UFS and NFS Performance CookBook
      Tuning Tips for I/O Intensive Applications



Last Revised: November 14 2001
This page is: http://www.dfwsug.org/cookbook.html




This Guide is a set of most common tuning tips  for a high performance
NFS server over UFS in an HPC environment. Some of the tips are fairly
new  and none should be applied  blindly.  "Consider your workload and
measure  your  performance"    is alway  good    advice.  

The tuning    recommendations  in this  document  are   organised into
Recipes.  For each one  we try to identify a  Tell  Tale Sign to alert
administrators that in this situation  the given recipe has  potential
benefit. We then describe the Fix (changing kernel parameters or mount
option  etc)   highlighting     the  potential  Drawbacks    of   this
recommendation   as   well   as the  additional    Memory Requirements
associated.

The first recommendation to anybody trying to improve performance of a
data intensive setup  is to use Solaris 8  (SunOS 5.8) or above
on the Server. The  Virtual Memory  System of  Solaris 8  was reworked
intensively. Server will make a lot better use of available memory. In
Solaris 8 there  is  no  more need   to tune /etc/system   parameters:
priority_paging, cachefree,lotsfree, fastscan.

In  pre-Solaris  8 machine,  the tuning tips   can still be valid, but
their identification   is more complex  and  additional  tuning of the
above  parameters  is   required.  

Your feedback is most welcome.



Table of Content and Summary Recipe ufs_HW: GBs of data written to a file Tell tale Sign: ufs_throttles keeps increasing Fix: increase ufs_HW Memory Requirement: ~ MAX(maxphys,ufs_HW) * #active files Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe ufs_HW: http://www.dfwsug.org/cookbook.html#ufs_HW Recipe nfsd: NFS Server daemons Tell tale sign: number of active nfs threads equals nfsd param Fix: kill and restart nfsd with more thread Memory Requirement (KMEM): 16K of kernel stack + 1 NFS blocksize of data per active thread Drawback: starvation of user's non-NFS related work Applicability: NFS server Created: July 19 2001 Revised: Nov 26 2001 Recipe nfsd: http://www.dfwsug.org/cookbook.html#nfsd Recipe autoup1: System Working Set much bigger than System Memory Tell tale sign: constant filesystem paging and low memory Fix: decrease autoup, tune_t_fsflushr Memory Requirement: saves system memory Drawback: More and smaller disk I/O, higher system CPU time Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe autoup1: http://www.dfwsug.org/cookbook.html#autoup1 Recipe autoup2: Write intensive Working Set fits in memory Tell tale sign: no filesystem paging, high disk writes, apps waiting Fix: increase autoup, tune_t_fsflushr Memory Requirement: 0 Drawback: file changes can be lost in case of failure Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Apr 02 2002 Recipe autoup2: http://www.dfwsug.org/cookbook.html#autoup2 Recipe stripe1: Applications doing lots of Large I/O (>>8K) Tell tale sign: sustained disk activity, apps doing large read/write Fix: stripe your volume, set UFS maxcontig and adjust maxphys Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file" Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe stripe1: http://www.dfwsug.org/cookbook.html#stripe1 Recipe stripe2: Many Small I/O (< 8K) Tell tale sign: sustained disk activity; apps doing small read/write Fix: stripe your volume, set UFS maxcontig and adjust maxphys Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file" Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe stripe2: http://www.dfwsug.org/cookbook.html#stripe2 Recipe stripe3: Working with an NFS server Tell tale sign: NFS blocksize (32K) average physical I/O sizes Fix: See Recipe nfs3_blocksize Memory Requirement: ufs_HW * "# active file" Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Nov 15 2001 Recipe stripe3: http://www.dfwsug.org/cookbook.html#stripe3 Recipe bufhwm: Large Active Filesystem (>>TB) Tell tale sign: small hit rate in the buffer cache Fix: increase bufhwm in /etc/system and reboot Memory Requirement (KMEM): MIN ( Total File System / 2M, bufhwm K ) Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe bufhwm: http://www.dfwsug.org/cookbook.html#bufhwm Recipe directio: Application I/O size >1MB, Active File Set > Memory Tell tale sign: high filesystem paging; apps doing large read/write Fix: mount -forcedirectio Memory Requirement: saves system memory Drawback: each and every read/write call goes to disk Applicability: direct attached UFS fileserver Created: July 19 2001 Revised: Recipe directio: http://www.dfwsug.org/cookbook.html#directio Recipe ncsize: Large Number of active files (>>10000) Tell tale sign: low dlnc cache hit Fix: increase ncsize Memory Requirement (KMEM): ~ ncsize* 0.5K Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Recipe ncsize: http://www.dfwsug.org/cookbook.html#ncsize Recipe segmap_percent: Recipe segmap_percent: Dedicated I/O server on large Dataset Tell tale sign: small segmap cache hit rate Fix: increase segmap_percent Memory Requirement: up to segmap_percent of memory may be used for I/O Drawback: application request for memory can cause paging storm Applicability: Dedicated I/O server NFS or UFS Created: Nov 13 2001 Revised: Recipe segmap_percent: http://www.dfwsug.org/cookbook.html#segmap_percent Recipe nfs3_max_threads: NFS Client Bursty Writes Tell tale sign: application shows high write and high wait time Fix: increase nfs3_max_threads Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem Drawback: the kernel threads will all be created even if they are of little use Applicability: NFS client Created: July 19 2001 Revised: Recipe nfs3_max_threads: http://www.dfwsug.org/cookbook.html#nfs3_max_threads Recipe nfs3_blocksize: Default NFS blocksize may throttle purely data intensize setup Tell tale sign: large client read/write call but physical I/O of sizes smaller than stripe unit Fix: adjust nfs3_bsize on client and nfs3_max_transfer_size on client and server Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem Drawback: larger blocks not suitable for small I/O Applicability: NFS client and Server Created: Nov 13 2001 Revised: Recipe nfs3_blocksize: http://www.dfwsug.org/cookbook.html#nfs3_blocksize Recipe nfs3_nra: NFS Client sequentially reads tens of MB at a time Tell tale sign: read performance (~3MB/s) much lower than disk or network capacity Fix: increase kernel parameter nfs3_nra to 8 or 16 Memory Requirement (KMEM): nfs3_nra * NFS blocksize (32K) per active file Drawback: if readahead blocks go unused the network BW and processing is wasted Applicability: NFS client Created: July 19 2001 Revised: Recipe nfs3_nra: http://www.dfwsug.org/cookbook.html#nfs3_nra Recipe freebehind: Large file does not get cached Tell tale sign: repeated sequential reading of a file; causes I/O each time Fix: increase kernel parameter smallfile or set freebehind to 0 Memory Requirement (KMEM): 0 Drawback: largefile caching displaces all other cached files Applicability: UFS, NFS server Created: March 25 2002 Revised: Recipe freebehind: http://www.dfwsug.org/cookbook.html#freebehind Recipe one_at_a_time: small I/O size even for large read/write calls Tell tale sign: physical I/O size smaller than maxphys/maxcontig Fix: files should be created one at a time, preferably not over NFS Memory Requirement (KMEM): 0 Drawback: not always possible; takes more time Applicability: UFS, NFS server Created: March 25 2002 Revised: Recipe one_at_a_time: http://www.dfwsug.org/cookbook.html#one_at_a_time Recipe false_readahead: random reads (> ~8k) on client causes large physical reads Tell tale sign: apps does small reads but physical I/O sizes are of size maxphys/maxcontig Fix: decrease maxcontig or segregate workload to separate filesystem Drawback: performance of large I/O will be reduced Applicability: UFS, NFS Client Created: Nov 13 2001 Revised: Mar 28 2002 Recipe false_readahead: http://www.dfwsug.org/cookbook.html#false_readahead Recipe clnt_max_conns: NFS Client reads or writes tens of MB Tell tale sign: write performance equal to single stream tcp test Fix: Increase kernel parameter clnt_max_conns to MIN(4,NCPU) Memory Requirement (KMEM): clnt_max_conns*tcp_max_buf Drawback: unused connections timeouts can lead to sluggish behavior Applicability: NFS client Created: July 19 2001 Revised: Recipe clnt_max_conns: http://www.dfwsug.org/cookbook.html#clnt_max_conns Recipe nocto: Apps does many open,write,close over NFS Tell tale sign: constant open/close activity; significant wait time Fix: mount -F nfs -o nocto Memory Requirement: N/A Drawback: file changes not seen immediately on other clients Applicability: NFS client Created: July 19 2001 Revised: Recipe nocto: http://www.dfwsug.org/cookbook.html#nocto Recipe RTFM: RTFM Tell tale sign: I'm lost Fix: READ Memory Requirement: you're allowed to take notes Drawback: it takes time Created: July 19 2001 Revised: NOTE: kernel parameters NOTE: Page Cache NOTE: Monitoring Paging NOTE: Monitoring Applications NOTE: Monitoring Disk Activity NOTE: Network tuning
Recipe ufs_HW: GBs of data written to a file Tell tale Sign: ufs_throttles keeps increasing Fix: increase ufs_HW Memory Requirement: ~ MAX(maxphys,ufs_HW) * #active files Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Note: kernel parameters UFS keeps track for each file of the number of bytes of data being written to disk. Those are bytes in transit between the page cache and the disks. When this amounts exceeds the threshold ufs_HW then subsequent write(2) will be blocked until enough of the I/O operation complete. Note that what contributes to ufs_HW are bytes of data being written between the page cache the the disk device but the write operation that can be blocked are the writes trying to put more data into the page cache. One can disable throttling by setting in /etc/system ufs:ufs_WRITES=0 It may be more prudent to set ufs_HW/ufs_LW parameters to values that should limit the adverse condition: ufs_HW should be set to many times maxphys ufs_LW should be 2/3 of ufs_HW When throttling happens, a process is blocked for a time of the order of a physical write, say 0.01s. This means that a process can achieve of the order of ufs_HW/0.01s or 100*ufs_HW Bytes/s. With ufs_HW=8M, a process may not be able to output more than 800MB/sec. The default of 384K throttles a process around 38MB/sec. A soon as you get a few disks in a stripe this limit is lurking. Recipe ufs_HW: http://www.dfwsug.org/cookbook.html#ufs_HW Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe nfsd: NFS Server daemons Tell tale sign: number of active nfs threads equals nfsd param Fix: kill and restart nfsd with more thread Memory Requirement (KMEM): 16K of kernel stack + 1 NFS blocksize of data per active thread Drawback: starvation of user's non-NFS related work Applicability: NFS server Created: July 19 2001 Revised: Nov 26 2001 Number of active nfs threads cannot be estimated easily. Below is a command that will give you a hint but it MUST NOT be used production system specially on E10K (see bugids 4305932 & 4344513). I strongly discourage its used appart from test configurations. #!/bin/csh echo '$<threadlist' | mdb -k |& grep svc_run | grep -v grep | wc -l note: nfsd threads are one type of threads that runs through svc_run. The script /etc/init.d/nfs.server set the maximum number of nfsd server threads that can run at once. Those threads are created and destroyed dynamically in the kernel. They each consume 16K of kernel stack and most likely handle one NFS filesystem block. They also run at a higher priority than the timeshare class. Since the memory requirements is cheap and dynamic, there are no big drawbacks to setting this value much higher in an NFS server. Note though that for an NFS server that is also an application server this can lead to applications being starved for CPU. Edit /etc/init.d/nfs.server : /usr/lib/nfs/nfsd -a 1024 Then kill and restart the deamon /etc/init.d/nfs.server stop /etc/init.d/nfs.server start Recipe nfsd: http://www.dfwsug.org/cookbook.html#nfsd Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe autoup1: System Working Set much bigger than System Memory Tell tale sign: constant filesystem paging and low memory Fix: decrease autoup, tune_t_fsflushr Memory Requirement: saves system memory Drawback: More and smaller disk I/O, higher system CPU time Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Note: kernel parameters Here the filesystem page cache is not fully effective (NOTE: Page Cache). "vmstat -p" shows constant filesystem activity and low memory; If your disk subsystem is not saturated, then there is hope to improve performance by sending more data to it thus freeing memory. If you disk subsystem is already saturated then the workload is just too big. Adding memory or disk is the only hope. The kernel autoup parameter is an indication to the kernel on how long it can tolerate before synching data to disk. You will want to reduce the autoup to a smaller value. Its a good idea to keep the ratio (autoup / tune_t_fsflushr ) to its default value or 6. The effect that is looked for is to make fsflush handle the I/O instead of the scanner. vmstat -p filesystem paging should decrease; disk I/O and free memory increase. The freed memory should lead to better overall response. You should also consider Recipe directio Recipe autoup1: http://www.dfwsug.org/cookbook.html#autoup1 Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe autoup2: Write intensive Working Set fits in memory Tell tale sign: no filesystem paging, high disk writes, apps waiting Fix: increase autoup, tune_t_fsflushr Memory Requirement: 0 Drawback: file changes can be lost in case of failure Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Apr 02 2002 Note: kernel parameters In this situation, you're application data are constantly being written and rewritten to files that are cached in memory. However a small autoup is putting a load on you're disk subsystem which is not able to keep up with the requirement and your application is waiting on I/O (even if this I/O is going to cache). So: 1- "vmstat -p" shows near 0 filesystem activity 2- "iostat -xtc" shows constant disk write activity 3- "truss -c <pid>" shows high wait time By increasing autoup you decrease the amount of data sent to disk, increase the chance of issuing larger I/O and benefit from write cancellation. On the other hand you slightly increase to chance of loosing data in case of server failure. Default value of autoup is 30 seconds. If the size of accessed data globally fits in memory then you're page cache is very effective and you should increase autoup (e.g. 900) and tune_t_fsflushr (e.g. 150). You will benefit from this tuning if you disk subsystem is not able to keep up with your I/O requirements. Say you're working set is 30GB and fits in your available memory. With an autoup of 30, the fsflush will try to synch 30GB of data every 30 sec requiring a rate of 1GB/sec from the disk subsystem. If you're subsystem cannot achieve this BW, you're application will be throttled and see some wait time. By increasing autoup, you should decrease the amount of I/O actually sent to disk and reduce the application wait time. As a side note, on a CPU bound system, this tuning can affect performance by freeing the fsflush cpu time and avoiding scheduling perturbation induced by fflush. Recipe autoup2: http://www.dfwsug.org/cookbook.html#autoup2 Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe stripe1: Applications doing lots of Large I/O (>>8K) Tell tale sign: sustained disk activity, apps doing large read/write Fix: stripe your volume, set UFS maxcontig and adjust maxphys Memory Requirement: ~ MAX (maxphys,ufs_HW) * "# active file" Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Here the system shows sustained disk activity; the applications are issuing large (>>8k) read or write calls. NOTE: Monitoring disk activity NOTE: Monitorin applications See Recipe stripe2 in conjunction with this one in case of small I/O. The tuning described here requires considering these 4 areas: 1- building a stripped volume with a correct interlace 2- tuning the filesystem cluster size 3- understanding how the kernel works on page boundaries 4- have apps issue read/write calls that cause sequential page accesses To get a high disk throughput in this context you want to send as big a chunk as possible to each disk. Considering a disk capable of 100 IOPS (physical I/O per seconds) and 10MB/sec you want to try and send to each disk on the order of 100K ((Disk Bandwidth MB/s)/(Disk IOPS)) per operation. 1- When building a Volume the continuous chunk given to a disk before switching to the next one is the interlace factor. Solaris Volume Manager (formely known as Disksuite) defaults to 64K; for exclusively sequential workload, multi streams (multi-threaded or multi-process) workload on small number of disks this could be tuned to 128K or 256K. Bigger interlace will allow a bigger throughput per individual disk. The drawback of using too large a value is that you will only see the performance boost of stripping when using I/O sizes larger that stripe size or on highly concurrent workloads. 64K is a good compromise; it is the smallest size that allows reaching a good portion of available disk bandwidth while keeping many busy disks. For disks capable of 40MB/sec, the interlace should definitively be increased (up to 256K). 2- The filesystem software maintains internally the notion of read and write clusters. For UFS, both read and write cluster size are set to the the largest of maxphys and maxcontig. Since maxphys is expressed in bytes and maxcontig in 8K filesystem blocks : rd/wr cluster size = MAX ( maxphys, maxcontig * 8192 ) The maxcontig parameter is set either when constructing a new filesystem with newfs -C or on an already built filesystem with "tunefs -a". A common value of maxcontig is 128 which tells the filesystem to work with 1MB clusters. Getting bigger values is possible but in that case you should maintain maxphys <= maxcontig*8K and experiment at your own risk (Note: kernel parameters). Clusters of 1MB should allow a single stream to drive 16 disks at close to 100MB/sec. 3- The kernel needs to fetch or push data to or from disk in response to you applications requests. This happens on a page by page basis. If pages are accessed in sequential order; then the kernel considers that the user is doing sequential access to the file. Note that in cases where the applications do random (small) I/O but accessed pages are contiguous, clustered read will work. On output, data written to a given file is accumulated page by page until the amount of data reaches the cluster size or until a page not contiguous with the current cluster is accessed; At that point, a physical I/O is issued. 4- You're applications must issue large read/write system calls (otherwise see Recipe stripe2). Verify with : truss -t read,write,pread,pwrite,kaio -p <pid>. It is also interesting to compute the average physical I/O size seen by the system (NOTE: Monitoring Disk Activity). In a good situation, the I/O seen by volumes should approach the cluster size and the physical I/O seen by individual disks should be close to the stripe unit (the interlace). Recipe stripe1: http://www.dfwsug.org/cookbook.html#stripe1 Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe stripe2: Many Small I/O (< 8K) Tell tale sign: sustained disk activity; apps doing small read/write Fix: stripe your volume, set UFS maxcontig and adjust maxphys Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file" Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: You can get the benefit of Recipe stripe1 even if your applications do small (<8K) read(2) and write(2). However this imposes that the files are large (>>1MB; they are implicitly large in the case of large I/O), accessed sequentially and that the system page cache is effective (NOTE: Page Cache). In this situation the I/O size sent to the volume may be as large as the cluster size. For example, if an apps does a long series of 1K writes, the pageout will be able to kluster up to maxcontig blocks and issue one large physical I/O. The average physical I/O size should then be closer to the cluster size than to 1K (NOTE: Monitoring Disk Activity); Recipe stripe1 will then apply. A different situation occurs for small files or non-sequential access; then the physical I/O size will be more related to the file sizes and the throughput performance of the disk subsystem will degrade. A single threaded process doing a backup of 8K files will not be able to achieve more than 800 K/sec no matter how many disk in the disk array there is. More performance can be obtained by using multiple streams (by threading the application or using asynchroneous I/O calls) but the maximum sustained throughput when the average file size is smaller than the interlace can be estimated to: MIN( #streams, #disks ) * AVG file size * 100/seconds The memory requirements drastically falls to #streams * AVG file size. Recipe stripe2: http://www.dfwsug.org/cookbook.html#stripe2 Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe stripe3: Working with an NFS server Tell tale sign: NFS blocksize (32K) average physical I/O sizes Fix: See Recipe nfs3_blocksize Memory Requirement: ufs_HW * "# active file" Applicability: NFS fileserver Created: July 19 2001 Revised: Nov 15 2001 Large NFS requests are broken up into chunks of a certain size. That chunk is the NFS blocksize and is negociated between the client and the server at mount time and defaults, on Sun, to a payload of 32K of data (see Recipe nfs3_blocksize). Also, the Solaris NFS client is multi-threaded in the kernel (see Recipe nfs3_max_threads). A single large NFS read or write will be worked on by multiple concurrent kernel threads. Those concurrent requests issued on the client will arrive on the server side in no particular order which jeopardizes the task of clustering done by the server. Considering the above, it is somewhat difficult for an NFS server to issue Physical I/O that are much more that the blocksize for writes. This applies to NFS v3 protocol. NFS v2 uses 8K blocks and all requests are synchroneous. The NFS server will still attempt to clusterize so using a large interlace (64K-256K) can still be a good thing. But an NFS v3 server that reaches approximately 4 MB/sec/disk (one blocksize (32K) per disk latency (0.01s) per disk) is close to saturation. Note that while clustering is considerably modified when going through NFS, the UFS throttling is not. This is why ufs_HW should be tuned on NFS server. See Recipe ufs_HW. Recipe stripe3: http://www.dfwsug.org/cookbook.html#stripe3 Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe bufhwm: Large Active Filesystem (>>TB) Tell tale sign: small hit rate in the buffer cache Fix: increase bufhwm Memory Requirement: MIN ( Total File System / 2M, bufhwm K ) Applicability: UFS or NFS fileserver Drawback: may consume memory for little benefit Created: July 19 2001 Revised: You should expect to tune from the default bufhwm value if you're filesystem is more than 40000 your system memory (e.g. 1GB system and 40TB filesystem). The real tell tale sign, is small hit ratio on the buffer cache during period of high activity: "sar -b 1 10" shows %rcache or %wcache < 90% A maximum bufhwm KB of kernel memory is used to cache metadata information (e.g. block indirection data). bufhwm defaults to 2% of system memory, it cannot be more than 20%. The buhfwm configured on your system can be obtained with "/usr/sbin/sysdef | grep bufhwm". The requirements for bufhwm should be: 'Sum Total of Active Filesystem Size' / 2M. For a 100GB filesystem then configure 50MB of "bufhwm" kernel memory and set bufhwm = 50000 (in units of K) Note: kernel parameters. The value of bufhwm is a high water mark in the sense that the kernel will consume memory, if required, up this mark. Note that files that are used even just once (for example during a full filesystem backup) will contribute its metadata to the buffer cache. In practice this means that the kernel memory consumed will be KMEM = MIN ( Total File System / 2M, bufhwm K ) Recipe bufhwm: http://www.dfwsug.org/cookbook.html#bufhwm Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe directio: Application I/O size >1MB, Active File Set > Memory Tell tale sign: high filesystem paging; apps doing large read/write Fix: mount -forcedirectio Memory Requirement: saves system memory Drawback: each and every read/write call goes to disk Applicability: direct attached UFS fileserver Created: July 19 2001 Revised: In this scenario,: 1- "vmstat -p" shows constant paging in filesystem indicating I/O intensive workload on low memory system, 2- truss of your favorite apps are showing that the issued read and write calls are almost exclusively > 1MB, 3- And your disk subsystem is capable of sustaining of the order of 50MB/sec or more, then there is little cache reuse and time taken to write to the page cache is comparable to the time taken to write to disk. In this situation the use of a page cache implies an extra copy, uses memory and the usual benefit do not apply. When all those conditions are present the page cache should be turned off with the mount -forcedirectio. The usual case here is for a workload explicitly developped for a directio filesystem. Note that a small write will be slowed by a factor of about 10000. A 64K write would be slowed by a factor of 100. So this truly applies to applications issuing exclusively large I/O. Also with this tuning if 2 applications read a given file, they will each be issuing disk I/O for that file as oppose to having the second one read from the memory based page cache. The UFS page cache (NOTE: Page Cache) is ON by default and one needs to mount with the -forcedirectio option to disable the page cache in the situations that requires it. This recipe can either solve your problem or degrage performance even more. It all depends on access pattern and sizes. If you can influence the size of the read and write calls make them bigger (up to 1M). If only a specific application (rather than the global workload) would benefit from a directio and if that applications data cannot be segregated on a different directio filesystem; then using the directio(3C) calls in this applications should be considered. Recipe directio: http://www.dfwsug.org/cookbook.html#directio Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe ncsize: Large Number of active files (>>10000) Tell tale sign: low dlnc cache hit Fix: increase ncsize Memory Requirement (KMEM): ~ ncsize* 0.5K Applicability: UFS or NFS fileserver Created: July 19 2001 Revised: Nov 14 2001 In this situation "vmstat -s| grep lookups" shows cache hits ratio smaller than 90%. Also see "kstat -n dnlcstats". Filenames and inode information is cached in the kernel. The size of the cache is controled by the ncsize kernel parameter. This tuning should be applied when a system is showing a low (<90%) DNLC cache hits; ufs_ninode will automatically be adjusted at bootime to ncsize. Note: kernel parameters. You will consume in kernel memory a few bytes per ncsize and 0.5K per ufs_ninode. For a Solaris 8 machine, the default settings should put both ncsize and ufs_ninode to ncsize = ufs_ninode = total memory / 16K This means that up to 1/32th of memory may be used for this cache when running with default settings. Recipe ncsize: http://www.dfwsug.org/cookbook.html#ncsize Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe segmap_percent: Dedicated I/O server on large Dataset Tell tale sign: small segmap cache hit rate Fix: increase segmap_percent Memory Requirement: up to segmap_percent of memory may be used for I/O Drawback: application request for memory can cause paging storm Applicability: Dedicated I/O server NFS or UFS Created: Nov 13 2001 Revised: It is well known that most of the memory can be used to cache filesystem data. This leads to important performance improvement when accessing this data. However only a portion of memory is actually readily mapped in the kernel in "segmap" to be the target of an actual I/O. For a read or write call, being or not in segmap can cause a performance difference of approximately 20%. Solaris 8 introduced a new kernel parameter called segmap_percent that controls the size of segmap. Prior to Solaris 8 the segment was of a fixed 256M size. With solaris 8 the segmap is sized to be portion of free memory after boot with a default value of 12%. The cache hit rate in segmap can be computed from the output of this command "kstat unix:0:segmap:get\* 1" The segmap efficiency since boot, should be (get_reclaim + get_use) / getmap The interesting measure would be the rate of increase (the difference between 2 successive set of data from the kstat command) of: getmap - (get_reclaim + get_use) which is a number pages per second which may be impacted by this tuning. If this number is small, no benefit should be expected. We can associate approximately 20% increase in performance on those pages if and only if we manage to keep them in segmap by increasing segmap_percent. There is not guarantee that increasing segmap will cause the pages to be in segmap. This is workload dependant. For example if the number increases by 1000 pages per second; then that correspond to 1000*8K pages or 8MB/sec of I/O that may benefit by 20% if increasing segmap_percent causes the data to become fully mapped in segmap. On a dedicate I/O server it may be beneficial to increase this value. This actually consumes little additionally memory for segmap structures (< 1%) but it should be noted that the segmap portion of the filesystem cache is not considered free memory. The non-segmaped portion is considered free memory. So it is rather important to keep a portion of memory free and available. Setting segmap_percent to a large portion (> 50%) of memory can definitively create paging storms causing serious performance degradation. Recipe segmap_percent: http://www.dfwsug.org/cookbook.html#segmap_percent Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe nfs3_max_threads: NFS Client Bursty Writes Tell tale sign: application shows high write and high wait time Fix: increase nfs3_max_threads Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem Drawback: the kernel threads will all be created even if they are of little use Applicability: NFS client Created: July 19 2001 Revised: Note: kernel parameters A target applications here will cycle between NFS write intensive and non I/O intensive operations both being significant contributor to total elapse time. NFS operations for a mount point are broken up on up to nfs3_max_threads kernel threads. The Solaris 8 default value for this parameter is 8. The client thus caches up to nfs3_max_threads * 2 * blocksize (default 32K) of written data before blocking processes on writes to the same NFS mount point. Increasing nfs3_max_threads allows applications to proceed through the I/O phase without being throttled by NFS and network performance. The I/O can then be executed by the kernel concurrently to the application's non I/O related processing. nfs3_max_threads should be kept to a reasonable value smaller than appromately 8*NCPU. Note that the nfs3_max_threads limit is per NFS filesystem. Consider an application that loops through a series of 1MB writes and a compute phase that lasts 1sec. This application only requires an average of 1MB/s of network bandwidth. With the default nfs3_max_threads of 8 and an NFS blocksize of 32K the client is able to buffer 2 * 8 * 32K = 512K of data without throttling applications. This means that the application doint a 1MB write will block and the I/O operation will proceed at the speed of the network or server side disk (whichever is slowest). On a 10MB/s connection, then 0.1 second will go the the I/O phase followed by 1 second of computation for 1.1s cycle time. On the other hand, if one then set nfs3_max_threads to 16 then the I/O operations will be fully buffered on the client side in the kernel. Application will see memcopy speed of more than 100MB/sec. The 1MB of data will be handled in less than 0.01 sec. The application will then proceed to its non-I/O part of 1sec while the kernel will process the I/O asynchroneously. The cycle time will be 1.01s a 10% speedup. What has been achieved in this tuning is parallelism between the data transfer over the wire (done by the kernel) and the application processing. An application must keep the file descriptor open for this tuning to work. If the Application closes the file descriptor at the end of every I/O phase then Recipe nocto must me implemented as well. This tuning will not help increase the maximum sustained NFS performance but only application performance in situation where NFS has not saturated the network or disk subsystem. Recipe nfs3_max_threads: http://www.dfwsug.org/cookbook.html#nfs3_max_threads Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe nfs3_blocksize: Default NFS blocksize may throttle purely data intensize setup Tell tale sign: large client read/write call but physical I/O of sizes smaller than stripe unit Fix: adjust nfs3_bsize on client and nfs3_max_transfer_size on client and server Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem Drawback: larger blocks not suitable for small I/O Applicability: NFS client and Server Created: Nov 13 2001 Revised: For a data-intensive setup over NFS, the default NFS3 transfer size can cause suboptimal utilisation of the disk subsystem. Requests are broken up in blocksize chunks and this governs the size of physical I/Os. The 32K default for NFSv3 Sun clients and Servers can cause disks to saturate at 3-5 MB/sec. To get a larger blocksize than 32K one needs to changes the client side nfs3_bsize and both the client and server nfs3_max_transfer_size. It is not usefull to change the mount size,rsize,wsize options. To get a smaller blocksize than 32K it is sufficient to change the client nfs3_bsize parameter. A reasonable value for the blocksize when working on data intensive setup would one that matches the stripe unit. Physical I/O size to individual disks should then not be limited by the NFS framework. Note: UDP cannot use blocksize larger than 64K and we actually do not recommend changing the blocksize when working with NFS over UDP. Recipe nfs3_blocksize: http://www.dfwsug.org/cookbook.html#nfs3_blocksize Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe nfs3_nra: NFS Client sequentially reads tens of MB at a time Tell tale sign: read performance (~3MB/s) much lower than disk or network capacity Fix: increase kernel parameter nfs3_nra to 8 or 16 Memory Requirement (KMEM): nfs3_nra * NFS blocksize (32K) per active file Drawback: if readahead blocks go unused the network BW and processing is wasted Applicability: NFS client Created: July 19 2001 Revised: nfs3_nra is the number of NFS blocks that a client tries to readahead. If the latency of an RPC call is say 0.01 seconds the maximum throughput that can be acheived for a read call using the default blocksize will be nfs3_nra*32K/0.01s. You also need to keep nfs3_max_threads >= nfs3_nra. Note: kernel parameters Recipe nfs3_nra: http://www.dfwsug.org/cookbook.html#nfs3_nra Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe freebehind: Large file does not get cached Tell tale sign: repeated sequential reading of a file; causes I/O each time Fix: increase kernel parameter smallfile or set freebehind to 0 Memory Requirement (KMEM): 0 Drawback: largefile caching displaces all other cached files Applicability: UFS, NFS server Created: March 25 2002 Revised: Prior to Solaris 8, it was very important to prevent the file cache from consuming all of memory. UFS Freebehind was implemented such that when a file was read sequentially, pages would be freed from the cache as soon as it had been moved to the user buffer. This allowed working on large files without consuming a lot of physical memory. Today the pages of the file cache are also considered as free memory which means that there is a less stringent requirement to free them from the page cache once consumed. The freebehind is activated for files bigger than smallfile (Solaris 8 default of 32K). The size of files that can be fully cached by a system is matter of policy to be set for each machine. On a dedicated I/O server that is known to work on large files and that benefit from caching them, one can set the smallfile parameter to a much larger value, for example 1/8th of total memory or disable freebehind altogether Note: kernel parameters. Recipe freebehind: http://www.dfwsug.org/cookbook.html#freebehind Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe one_at_a_time: small I/O size even for large read/write calls Tell tale sign: physical I/O size smaller than maxphys/maxcontig Fix: files should be created one at a time, preferably not over NFS Memory Requirement (KMEM): 0 Drawback: not always possible; takes more time Applicability: UFS, NFS server Created: March 25 2002 Revised: At file creation time or when a file is extended, sequences of disk blocks (extents) become associated with file offsets. The size of the extents also governs the potential size of I/O performed on the given file. When all is well the extents matches the maxcontig parameter. However when multiple files are extended simultaneously, the filesystem may end up creating extents that are smaller than the ideal values and the I/O size then suffers with no possible remedy. To avoid this issue, if possible, one should create files one after the other using cluster sized write calls. Also because of the multi-threadness of the nfs client (see Recipe nfs3_max_threads) file creation is best done on the server. Note that, in general, control over extent sizes is not guaranteed and will depend on the history of the filesystem. Recipe one_at_a_time: http://www.dfwsug.org/cookbook.html#one_at_a_time Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe false_readahead: small (<32K) random reads on client causes large physical reads Tell tale sign: apps does small reads but physical I/O sizes are of size maxphys/maxcontig Fix: decrease maxcontig or segregate workload to separate filesystem Drawback: performance of large I/O will be reduced Applicability: UFS, NFS Client Created: Nov 13 2001 Revised: Mar 28 2002 Working with read or write sizes slightly greater than the pagesize (8K) can cause serious performance. Here an application is issuing for example, 9K reads on a very large file, before issuing an lseek to a totally different portion of the file. The small read requests because it touches 2 different pages, is considered sequential access and will cause full length cluster to be read ahead. Note that this can happen for just about any sized read calls. If, for example, maxphys was tuned to 1MB (see Recipe stripe1) then a small client read requests can cause a rather unexpectedly large server I/O. This may not be a problem on a server will a lot of memory and if the data will eventually be used by clients of this server. But if the server has a very large filesystem; if files are larger on average than the clustersize and if they are read sparingly from clients this can be a concern. To work around this issue one can reduce the maxcontig parameter for the filesystem involved. However this may cause other applications that use large sequential I/O from that client to be impacted. The only other option is to segregate this type of workload to a seperate filesystem. Recipe false_readahead: http://www.dfwsug.org/cookbook.html#false_readahead Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe clnt_max_conns: NFS Client reads or writes tens of MB. Tell tale sign: write performance equal to single stream tcp test Fix: Increase kernel parameter clnt_max_conns to MIN(4,NCPU) Memory Requirement (KMEM): clnt_max_conns*tcp_max_buf Drawback: unused connections timeouts can lead to sluggish behavior Applicability: NFS client Created: July 19 2001 Revised: Driving TCP at Gb speed may require more than 1 CPU to achieve maximum throughput. All requests from an NFS client share a pool of connections and the Solaris 8 default connection poolsize is 1. This is sufficient for 100 Mbit ethernet but may require tuning for stronger interfaces. A single connection can lead to serialising the traffic to one CPU affecting throughput. Drawback is still under investigation. This may lead to sluggish response and is not recommended tuning due to lack of experiment with it. Note: kernel parameters Recipe clnt_max_conns: http://www.dfwsug.org/cookbook.html#clnt_max_conns Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe nocto: Apps does many open,write,close over NFS Tell tale sign: constant open/close activity; significant wait time Fix: mount -F nfs -o nocto Memory Requirement: N/A Drawback: file changes not seen immediately on other clients Applicability: NFS client Created: July 19 2001 Revised: This tuning will be effective when applications that go through open,write,close sequences or workloads that writes then closes very many files. By Default Sun NFS clients will wait for its writes to complete to disk when closing a file. This insures that the rest of the world will see changes to files that have been closed by an application. NFS need not guarantee such consistency but it nevertheless will by default. To take advantage of the weaker consistency model provided by this feature you may use the undocumented flag: mount -F nfs -o nocto The close(2) call will then complete without waiting for the data to be synchronized to disks. The adverse side-effect is that other clients will not see the changes until some unpredictable later time. This adverse effect is of course irrelevant if the files are mostly read and written from a single client. Beware that in other situations users will definitively notice the absence of close to open consistency (editing and compiling from different machines is one common case). The benefit from this tuning is bounded by the time spent in close. From a truss -c output one can get the number of times close(2) was called and a rule of thumb would be to associate one disk latency (0.01 second) per call. Recipe nocto: http://www.dfwsug.org/cookbook.html#nocto Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
Recipe RTFM Tell tale sign: I'm lost Fix: READ Memory Requirement: you're allowed to take notes Drawback: it takes time Created: July 19 2001 Revised: NFS Performance and Tuning Guide for Sun Hardware: There is a need for an updated version. Solaris Tunable Parameters Reference Manual Richard Mcdougall's Web Site Jim Mauro, Richard McDougall Solaris Internals: Core Kernel Architecture ISBN 0-13-022496-0 (C) Prentice Hall, 2000 Brent Callaghan, NFS Illustrated ISBN 0-201-32570-5 (C) Addison-Wesley Solaris - Tuning Your TCP/IP Stack Recipe RTFM: http://www.dfwsug.org/cookbook.html#RTFM Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated. Back To Content
NOTE: kernel parameters Your kernel parameters can be changed in the /etc/system file For example: set maxphys = 1048576 set bufhwm = 50000 set ncsize = 1000000 set autoup = 30 set tune_t_fsflushr = 5 set ufs:ufs_HW = 8388608 set ufs:ufs_LW = 5592405 set nfs:nfs3_max_threads = 16 set nfs:nfs3_nra = 8 set rpcmod:clnt_max_conns = 4 You can find out your systems parameter with mdb (or adb) as root: $mdb -k ncsize/1D ncsize: 33952 ufs_ninode/1D ufs_ninode: 33952 ufs_throttles/1D ufs_throttles: 200 nfs3_max_threads/1d <<<< note lowercase 'd' denotes a 16-bit integer nfs3_max_threads: 8 For bufhwm /usr/sbin/sysdef | grep bufhwm You may set some parameters with mdb -k -w. Often kernel parameter are used to initialized structures some of them at bootime. This means that it is not possible to change them dynamically on a running kernel. Those take immediate effect: ufs_HW. NOTE: Page Cache The UFS page cache uses the system memory as a cache of all filesystem. This means that written data is acknowledged very quickly (basically at memcpy speed) to your applications. Read data that hit in the cache is also going at memcpy speed. Multiple writes to a file will results in fewer bigger physical I/Os. Write cancellation happens when there are multiple writes (from one or more processes) to a given file. The last write cancels previous ones and this leads to fewer disk writes. The existence of the page cache also means that if there is catastrophic failure of the kernel (crash or power outage) in-flight data will not appear in the files after the reboot. However, upon closing a file (or on process exit) the outstanding data is synched to disk and you are not exposed to this effect. Only data written to open files are at risk. When using the page cache, the way to guarantee that data is set to stable storage is to use one of fsync(3C) and fdatasync(3RT) calls. Thanks to the page cache the size I/O sent to devices (physical or logical) can be bigger than the applications I/O size. This can be a very significant benefit of using the page cache. NOTE: Monitoring Paging Solaris 8 introduced the -p flags to vmstat which breaks down paging statistics between executable, anonymous and filesystem data. A system that shows constant sustained paging in the anonymous or executable columns is very probably showing signs that there is not enough physical memory to sustain the given workload. Paging statistics to filesystem is harder to interpret because it does not include all pages. One cannot for example match the data seen by the disks (see NOTE: Monitoring disk activity) with the paging statistics shown by vmstat. If paging statistics are dominated by the filesystems columns this can be a sign that the system is low on memory because the disk subsystem is not draining the I/O requests fast enough. Improving the I/O through tuning or additional hardware may me sufficient to heal a system. As with most solaris *stat commands, the first set of output represents average since boot time. NOTE: Monitoring Applications An applications I/O Size and access pattern can be checked by using truss(1). For example $ truss -t read,write,pread,pwrite,aio,lseek -p <pid> To see if I/O is important to the performance on a given process one can use "truss -c" to get basic statistics on time spent in I/O. $ truss -c read,write,lseek,close,open -p <pid> The wait time will often be associated with I/O. Understanding the pattern and size is critical to many tuning criteria. NOTE: Monitoring Disk Activity To measure you avg physical I/O size, you use the command iostat -xtc 1; Discard the first set of output from the iostat command (they are averages since boot time). The kr/s column gives the amount of data read per second (in K) while the r/s column is the number of read operations per second on the device. By dividing these numbers you get the average device I/O size (likewise for writes using columns kw/s and w/s). The tuning describe in this guide are aimed at increasing the physical I/O size such that each disk gets I/O size of around a stripe unit and each metadevice gets cluster size (maxphys) I/O. This actual numbers will depend, on the workload mix, the I/O pattern, the kernel tunables most notably those described in this document as well as Volume Management Software. NOTE: Network tuning This may help to achieved high throughput in synthetic tests. You should measure their effectiveness on your workload before implementing them in the general case. I apply them (blindly) on both side of connections. /etc/system sq_max_size = 16; //max recommended of 100; default of 2 ndd -set /dev/tcp tcp_deferred_acks_max 16 hiwat = 512000 lowat = 2/3 * hiwat ndd -set /dev/tcp tcp_xmit_hiwat $hiwat ndd -set /dev/tcp tcp_xmit_lowat $lowat ndd -set /dev/tcp tcp_recv_hiwat $hiwat ndd -set /dev/tcp tcp_maxpsz_multiplier 64 KMEM: hiwat bytes per connection