UFS and NFS Performance CookBook
Tuning Tips for I/O Intensive Applications
Last Revised: November 14 2001
This page is: http://www.dfwsug.org/cookbook.html
This Guide is a set of most common tuning tips for a high performance
NFS server over UFS in an HPC environment. Some of the tips are fairly
new and none should be applied blindly. "Consider your workload and
measure your performance" is alway good advice.
The tuning recommendations in this document are organised into
Recipes. For each one we try to identify a Tell Tale Sign to alert
administrators that in this situation the given recipe has potential
benefit. We then describe the Fix (changing kernel parameters or mount
option etc) highlighting the potential Drawbacks of this
recommendation as well as the additional Memory Requirements
associated.
The first recommendation to anybody trying to improve performance of a
data intensive setup is to use Solaris 8 (SunOS 5.8) or above
on the Server. The Virtual Memory System of Solaris 8 was reworked
intensively. Server will make a lot better use of available memory. In
Solaris 8 there is no more need to tune /etc/system parameters:
priority_paging, cachefree,lotsfree, fastscan.
In pre-Solaris 8 machine, the tuning tips can still be valid, but
their identification is more complex and additional tuning of the
above parameters is required.
Your feedback is most welcome.
Table of Content and Summary
Recipe ufs_HW: GBs of data written to a file
Tell tale Sign: ufs_throttles keeps increasing
Fix: increase ufs_HW
Memory Requirement: ~ MAX(maxphys,ufs_HW) * #active files
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe ufs_HW: http://www.dfwsug.org/cookbook.html#ufs_HW
Recipe nfsd: NFS Server daemons
Tell tale sign: number of active nfs threads equals nfsd param
Fix: kill and restart nfsd with more thread
Memory Requirement (KMEM): 16K of kernel stack + 1 NFS blocksize of data per active thread
Drawback: starvation of user's non-NFS related work
Applicability: NFS server
Created: July 19 2001
Revised: Nov 26 2001
Recipe nfsd: http://www.dfwsug.org/cookbook.html#nfsd
Recipe autoup1: System Working Set much bigger than System Memory
Tell tale sign: constant filesystem paging and low memory
Fix: decrease autoup, tune_t_fsflushr
Memory Requirement: saves system memory
Drawback: More and smaller disk I/O, higher system CPU time
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe autoup1: http://www.dfwsug.org/cookbook.html#autoup1
Recipe autoup2: Write intensive Working Set fits in memory
Tell tale sign: no filesystem paging, high disk writes, apps waiting
Fix: increase autoup, tune_t_fsflushr
Memory Requirement: 0
Drawback: file changes can be lost in case of failure
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised: Apr 02 2002
Recipe autoup2: http://www.dfwsug.org/cookbook.html#autoup2
Recipe stripe1: Applications doing lots of Large I/O (>>8K)
Tell tale sign: sustained disk activity, apps doing large read/write
Fix: stripe your volume, set UFS maxcontig and adjust maxphys
Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file"
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe stripe1: http://www.dfwsug.org/cookbook.html#stripe1
Recipe stripe2: Many Small I/O (< 8K)
Tell tale sign: sustained disk activity; apps doing small read/write
Fix: stripe your volume, set UFS maxcontig and adjust maxphys
Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file"
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe stripe2: http://www.dfwsug.org/cookbook.html#stripe2
Recipe stripe3: Working with an NFS server
Tell tale sign: NFS blocksize (32K) average physical I/O sizes
Fix: See Recipe nfs3_blocksize
Memory Requirement: ufs_HW * "# active file"
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised: Nov 15 2001
Recipe stripe3: http://www.dfwsug.org/cookbook.html#stripe3
Recipe bufhwm: Large Active Filesystem (>>TB)
Tell tale sign: small hit rate in the buffer cache
Fix: increase bufhwm in /etc/system and reboot
Memory Requirement (KMEM): MIN ( Total File System / 2M, bufhwm K )
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe bufhwm: http://www.dfwsug.org/cookbook.html#bufhwm
Recipe directio: Application I/O size >1MB, Active File Set > Memory
Tell tale sign: high filesystem paging; apps doing large read/write
Fix: mount -forcedirectio
Memory Requirement: saves system memory
Drawback: each and every read/write call goes to disk
Applicability: direct attached UFS fileserver
Created: July 19 2001
Revised:
Recipe directio: http://www.dfwsug.org/cookbook.html#directio
Recipe ncsize: Large Number of active files (>>10000)
Tell tale sign: low dlnc cache hit
Fix: increase ncsize
Memory Requirement (KMEM): ~ ncsize* 0.5K
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Recipe ncsize: http://www.dfwsug.org/cookbook.html#ncsize
Recipe segmap_percent: Recipe segmap_percent: Dedicated I/O server on large Dataset
Tell tale sign: small segmap cache hit rate
Fix: increase segmap_percent
Memory Requirement: up to segmap_percent of memory may be used for I/O
Drawback: application request for memory can cause paging storm
Applicability: Dedicated I/O server NFS or UFS
Created: Nov 13 2001
Revised:
Recipe segmap_percent: http://www.dfwsug.org/cookbook.html#segmap_percent
Recipe nfs3_max_threads: NFS Client Bursty Writes
Tell tale sign: application shows high write and high wait time
Fix: increase nfs3_max_threads
Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem
Drawback: the kernel threads will all be created even if they are of little use
Applicability: NFS client
Created: July 19 2001
Revised:
Recipe nfs3_max_threads: http://www.dfwsug.org/cookbook.html#nfs3_max_threads
Recipe nfs3_blocksize: Default NFS blocksize may throttle purely data intensize setup
Tell tale sign: large client read/write call but physical I/O of sizes smaller than stripe unit
Fix: adjust nfs3_bsize on client and nfs3_max_transfer_size on client and server
Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem
Drawback: larger blocks not suitable for small I/O
Applicability: NFS client and Server
Created: Nov 13 2001
Revised:
Recipe nfs3_blocksize: http://www.dfwsug.org/cookbook.html#nfs3_blocksize
Recipe nfs3_nra: NFS Client sequentially reads tens of MB at a time
Tell tale sign: read performance (~3MB/s) much lower than disk or network capacity
Fix: increase kernel parameter nfs3_nra to 8 or 16
Memory Requirement (KMEM): nfs3_nra * NFS blocksize (32K) per active file
Drawback: if readahead blocks go unused the network BW and processing is wasted
Applicability: NFS client
Created: July 19 2001
Revised:
Recipe nfs3_nra: http://www.dfwsug.org/cookbook.html#nfs3_nra
Recipe freebehind: Large file does not get cached
Tell tale sign: repeated sequential reading of a file; causes I/O each time
Fix: increase kernel parameter smallfile or set freebehind to 0
Memory Requirement (KMEM): 0
Drawback: largefile caching displaces all other cached files
Applicability: UFS, NFS server
Created: March 25 2002
Revised:
Recipe freebehind: http://www.dfwsug.org/cookbook.html#freebehind
Recipe one_at_a_time: small I/O size even for large read/write calls
Tell tale sign: physical I/O size smaller than maxphys/maxcontig
Fix: files should be created one at a time, preferably not over NFS
Memory Requirement (KMEM): 0
Drawback: not always possible; takes more time
Applicability: UFS, NFS server
Created: March 25 2002
Revised:
Recipe one_at_a_time: http://www.dfwsug.org/cookbook.html#one_at_a_time
Recipe false_readahead: random reads (> ~8k) on client causes large physical reads
Tell tale sign: apps does small reads but physical I/O sizes are of size maxphys/maxcontig
Fix: decrease maxcontig or segregate workload to separate filesystem
Drawback: performance of large I/O will be reduced
Applicability: UFS, NFS Client
Created: Nov 13 2001
Revised: Mar 28 2002
Recipe false_readahead: http://www.dfwsug.org/cookbook.html#false_readahead
Recipe clnt_max_conns: NFS Client reads or writes tens of MB
Tell tale sign: write performance equal to single stream tcp test
Fix: Increase kernel parameter clnt_max_conns to MIN(4,NCPU)
Memory Requirement (KMEM): clnt_max_conns*tcp_max_buf
Drawback: unused connections timeouts can lead to sluggish behavior
Applicability: NFS client
Created: July 19 2001
Revised:
Recipe clnt_max_conns: http://www.dfwsug.org/cookbook.html#clnt_max_conns
Recipe nocto: Apps does many open,write,close over NFS
Tell tale sign: constant open/close activity; significant wait time
Fix: mount -F nfs -o nocto
Memory Requirement: N/A
Drawback: file changes not seen immediately on other clients
Applicability: NFS client
Created: July 19 2001
Revised:
Recipe nocto: http://www.dfwsug.org/cookbook.html#nocto
Recipe RTFM: RTFM
Tell tale sign: I'm lost
Fix: READ
Memory Requirement: you're allowed to take notes
Drawback: it takes time
Created: July 19 2001
Revised:
NOTE: kernel parameters
NOTE: Page Cache
NOTE: Monitoring Paging
NOTE: Monitoring Applications
NOTE: Monitoring Disk Activity
NOTE: Network tuning
Recipe ufs_HW: GBs of data written to a file
Tell tale Sign: ufs_throttles keeps increasing
Fix: increase ufs_HW
Memory Requirement: ~ MAX(maxphys,ufs_HW) * #active files
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Note: kernel parameters
UFS keeps track for each file of the number of bytes of data being
written to disk. Those are bytes in transit between the page cache
and the disks. When this amounts exceeds the threshold ufs_HW then
subsequent write(2) will be blocked until enough of the I/O operation
complete. Note that what contributes to ufs_HW are bytes of data
being written between the page cache the the disk device but the
write operation that can be blocked are the writes trying to put more
data into the page cache.
One can disable throttling by setting in /etc/system ufs:ufs_WRITES=0
It may be more prudent to set ufs_HW/ufs_LW parameters to values that
should limit the adverse condition:
ufs_HW should be set to many times maxphys
ufs_LW should be 2/3 of ufs_HW
When throttling happens, a process is blocked for a time of the order
of a physical write, say 0.01s. This means that a process can achieve
of the order of ufs_HW/0.01s or 100*ufs_HW Bytes/s. With ufs_HW=8M, a
process may not be able to output more than 800MB/sec. The default
of 384K throttles a process around 38MB/sec. A soon as you get a few
disks in a stripe this limit is lurking.
Recipe ufs_HW: http://www.dfwsug.org/cookbook.html#ufs_HW
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe nfsd: NFS Server daemons
Tell tale sign: number of active nfs threads equals nfsd param
Fix: kill and restart nfsd with more thread
Memory Requirement (KMEM): 16K of kernel stack + 1 NFS blocksize of data per active thread
Drawback: starvation of user's non-NFS related work
Applicability: NFS server
Created: July 19 2001
Revised: Nov 26 2001
Number of active nfs threads cannot be estimated easily. Below is a
command that will give you a hint but it MUST NOT be used production
system specially on E10K (see bugids 4305932 & 4344513). I strongly
discourage its used appart from test configurations.
#!/bin/csh
echo '$<threadlist' | mdb -k |& grep svc_run | grep -v grep | wc -l
note: nfsd threads are one type of threads that runs through svc_run.
The script /etc/init.d/nfs.server set the maximum number of nfsd
server threads that can run at once. Those threads are created and
destroyed dynamically in the kernel. They each consume 16K of kernel
stack and most likely handle one NFS filesystem block. They also run
at a higher priority than the timeshare class.
Since the memory requirements is cheap and dynamic, there are no big
drawbacks to setting this value much higher in an NFS server. Note
though that for an NFS server that is also an application server this
can lead to applications being starved for CPU.
Edit /etc/init.d/nfs.server :
/usr/lib/nfs/nfsd -a 1024
Then kill and restart the deamon
/etc/init.d/nfs.server stop
/etc/init.d/nfs.server start
Recipe nfsd: http://www.dfwsug.org/cookbook.html#nfsd
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe autoup1: System Working Set much bigger than System Memory
Tell tale sign: constant filesystem paging and low memory
Fix: decrease autoup, tune_t_fsflushr
Memory Requirement: saves system memory
Drawback: More and smaller disk I/O, higher system CPU time
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Note: kernel parameters
Here the filesystem page cache is not fully effective (NOTE: Page Cache).
"vmstat -p" shows constant filesystem activity and low memory; If
your disk subsystem is not saturated, then there is hope to improve
performance by sending more data to it thus freeing memory. If you
disk subsystem is already saturated then the workload is just too
big. Adding memory or disk is the only hope.
The kernel autoup parameter is an indication to the kernel on how
long it can tolerate before synching data to disk. You will want to
reduce the autoup to a smaller value. Its a good idea to keep the
ratio
(autoup / tune_t_fsflushr )
to its default value or 6.
The effect that is looked for is to make fsflush handle the I/O
instead of the scanner. vmstat -p filesystem paging should decrease;
disk I/O and free memory increase. The freed memory should lead to
better overall response.
You should also consider Recipe directio
Recipe autoup1: http://www.dfwsug.org/cookbook.html#autoup1
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe autoup2: Write intensive Working Set fits in memory
Tell tale sign: no filesystem paging, high disk writes, apps waiting
Fix: increase autoup, tune_t_fsflushr
Memory Requirement: 0
Drawback: file changes can be lost in case of failure
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised: Apr 02 2002
Note: kernel parameters
In this situation, you're application data are constantly being
written and rewritten to files that are cached in memory. However a
small autoup is putting a load on you're disk subsystem which is not
able to keep up with the requirement and your application is waiting
on I/O (even if this I/O is going to cache).
So:
1- "vmstat -p" shows near 0 filesystem activity
2- "iostat -xtc" shows constant disk write activity
3- "truss -c <pid>" shows high wait time
By increasing autoup you decrease the amount of data sent to disk,
increase the chance of issuing larger I/O and benefit from write
cancellation. On the other hand you slightly increase to chance of
loosing data in case of server failure.
Default value of autoup is 30 seconds.
If the size of accessed data globally fits in memory then you're page
cache is very effective and you should increase autoup (e.g. 900) and
tune_t_fsflushr (e.g. 150).
You will benefit from this tuning if you disk subsystem is not able
to keep up with your I/O requirements. Say you're working set is 30GB
and fits in your available memory. With an autoup of 30, the fsflush
will try to synch 30GB of data every 30 sec requiring a rate of
1GB/sec from the disk subsystem. If you're subsystem cannot achieve
this BW, you're application will be throttled and see some wait time.
By increasing autoup, you should decrease the amount of I/O actually
sent to disk and reduce the application wait time.
As a side note, on a CPU bound system, this tuning can affect
performance by freeing the fsflush cpu time and avoiding scheduling
perturbation induced by fflush.
Recipe autoup2: http://www.dfwsug.org/cookbook.html#autoup2
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe stripe1: Applications doing lots of Large I/O (>>8K)
Tell tale sign: sustained disk activity, apps doing large read/write
Fix: stripe your volume, set UFS maxcontig and adjust maxphys
Memory Requirement: ~ MAX (maxphys,ufs_HW) * "# active file"
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
Here the system shows sustained disk activity; the applications are
issuing large (>>8k) read or write calls.
NOTE: Monitoring disk activity
NOTE: Monitorin applications
See Recipe stripe2 in conjunction with this one in case of small I/O.
The tuning described here requires considering these 4 areas:
1- building a stripped volume with a correct interlace
2- tuning the filesystem cluster size
3- understanding how the kernel works on page boundaries
4- have apps issue read/write calls that cause sequential page accesses
To get a high disk throughput in this context you want to send as big
a chunk as possible to each disk. Considering a disk capable of 100
IOPS (physical I/O per seconds) and 10MB/sec you want to try and send
to each disk on the order of 100K ((Disk Bandwidth MB/s)/(Disk IOPS))
per operation.
1- When building a Volume the continuous chunk given to a disk before
switching to the next one is the interlace factor. Solaris Volume
Manager (formely known as Disksuite) defaults to 64K; for exclusively
sequential workload, multi streams (multi-threaded or multi-process)
workload on small number of disks this could be tuned to 128K or
256K. Bigger interlace will allow a bigger throughput per individual
disk. The drawback of using too large a value is that you will only
see the performance boost of stripping when using I/O sizes larger
that stripe size or on highly concurrent workloads. 64K is a good
compromise; it is the smallest size that allows reaching a good
portion of available disk bandwidth while keeping many busy disks.
For disks capable of 40MB/sec, the interlace should definitively be
increased (up to 256K).
2- The filesystem software maintains internally the notion of read and
write clusters. For UFS, both read and write cluster size are set to
the the largest of maxphys and maxcontig. Since maxphys is expressed
in bytes and maxcontig in 8K filesystem blocks :
rd/wr cluster size = MAX ( maxphys, maxcontig * 8192 )
The maxcontig parameter is set either when constructing a new
filesystem with newfs -C or on an already built filesystem with
"tunefs -a". A common value of maxcontig is 128 which tells the
filesystem to work with 1MB clusters. Getting bigger values is
possible but in that case you should maintain maxphys <= maxcontig*8K
and experiment at your own risk (Note: kernel parameters).
Clusters of 1MB should allow a single stream to drive 16 disks at
close to 100MB/sec.
3- The kernel needs to fetch or push data to or from disk in response
to you applications requests.
This happens on a page by page basis. If pages are accessed in
sequential order; then the kernel considers that the user is doing
sequential access to the file. Note that in cases where the
applications do random (small) I/O but accessed pages are contiguous,
clustered read will work.
On output, data written to a given file is accumulated page by page
until the amount of data reaches the cluster size or until a page not
contiguous with the current cluster is accessed; At that point, a
physical I/O is issued.
4- You're applications must issue large read/write system calls
(otherwise see Recipe stripe2). Verify with :
truss -t read,write,pread,pwrite,kaio -p <pid>.
It is also interesting to compute the average physical I/O size seen
by the system (NOTE: Monitoring Disk Activity).
In a good situation, the I/O seen by volumes should approach the
cluster size and the physical I/O seen by individual disks should be
close to the stripe unit (the interlace).
Recipe stripe1: http://www.dfwsug.org/cookbook.html#stripe1
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe stripe2: Many Small I/O (< 8K)
Tell tale sign: sustained disk activity; apps doing small read/write
Fix: stripe your volume, set UFS maxcontig and adjust maxphys
Memory Requirement: ~ MAX (maxphys, ufs_HW) * "# active file"
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised:
You can get the benefit of Recipe stripe1 even if your applications do
small (<8K) read(2) and write(2). However this imposes that the
files are large (>>1MB; they are implicitly large in the case of
large I/O), accessed sequentially and that the system page cache is
effective (NOTE: Page Cache). In this situation the I/O size sent
to the volume may be as large as the cluster size.
For example, if an apps does a long series of 1K writes, the pageout
will be able to kluster up to maxcontig blocks and issue one large
physical I/O. The average physical I/O size should then be closer to
the cluster size than to 1K (NOTE: Monitoring Disk Activity);
Recipe stripe1 will then apply.
A different situation occurs for small files or non-sequential
access; then the physical I/O size will be more related to the file
sizes and the throughput performance of the disk subsystem will
degrade. A single threaded process doing a backup of 8K files will
not be able to achieve more than 800 K/sec no matter how many disk in
the disk array there is. More performance can be obtained by using
multiple streams (by threading the application or using asynchroneous
I/O calls) but the maximum sustained throughput when the average file
size is smaller than the interlace can be estimated to:
MIN( #streams, #disks ) * AVG file size * 100/seconds
The memory requirements drastically falls to #streams * AVG file size.
Recipe stripe2: http://www.dfwsug.org/cookbook.html#stripe2
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe stripe3: Working with an NFS server
Tell tale sign: NFS blocksize (32K) average physical I/O sizes
Fix: See Recipe nfs3_blocksize
Memory Requirement: ufs_HW * "# active file"
Applicability: NFS fileserver
Created: July 19 2001
Revised: Nov 15 2001
Large NFS requests are broken up into chunks of a certain size. That
chunk is the NFS blocksize and is negociated between the client and
the server at mount time and defaults, on Sun, to a payload of 32K of
data (see Recipe nfs3_blocksize). Also, the Solaris NFS client is
multi-threaded in the kernel (see Recipe nfs3_max_threads). A single
large NFS read or write will be worked on by multiple concurrent
kernel threads. Those concurrent requests issued on the client will
arrive on the server side in no particular order which jeopardizes
the task of clustering done by the server.
Considering the above, it is somewhat difficult for an NFS server to
issue Physical I/O that are much more that the blocksize for
writes. This applies to NFS v3 protocol. NFS v2 uses 8K blocks and
all requests are synchroneous.
The NFS server will still attempt to clusterize so using a large
interlace (64K-256K) can still be a good thing. But an NFS v3 server
that reaches approximately 4 MB/sec/disk (one blocksize (32K) per
disk latency (0.01s) per disk) is close to saturation.
Note that while clustering is considerably modified when going through
NFS, the UFS throttling is not. This is why ufs_HW should be tuned on
NFS server. See Recipe ufs_HW.
Recipe stripe3: http://www.dfwsug.org/cookbook.html#stripe3
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe bufhwm: Large Active Filesystem (>>TB)
Tell tale sign: small hit rate in the buffer cache
Fix: increase bufhwm
Memory Requirement: MIN ( Total File System / 2M, bufhwm K )
Applicability: UFS or NFS fileserver
Drawback: may consume memory for little benefit
Created: July 19 2001
Revised:
You should expect to tune from the default bufhwm value if you're
filesystem is more than 40000 your system memory (e.g. 1GB system
and 40TB filesystem). The real tell tale sign, is small hit ratio on
the buffer cache during period of high activity:
"sar -b 1 10" shows %rcache or %wcache < 90%
A maximum bufhwm KB of kernel memory is used to cache metadata
information (e.g. block indirection data). bufhwm defaults to 2% of
system memory, it cannot be more than 20%. The buhfwm configured on
your system can be obtained with "/usr/sbin/sysdef | grep bufhwm".
The requirements for bufhwm should be:
'Sum Total of Active Filesystem Size' / 2M.
For a 100GB filesystem then configure 50MB of "bufhwm" kernel memory
and set bufhwm = 50000 (in units of K) Note: kernel parameters.
The value of bufhwm is a high water mark in the sense that the kernel
will consume memory, if required, up this mark. Note that files that
are used even just once (for example during a full filesystem backup)
will contribute its metadata to the buffer cache. In practice this
means that the kernel memory consumed will be
KMEM = MIN ( Total File System / 2M, bufhwm K )
Recipe bufhwm: http://www.dfwsug.org/cookbook.html#bufhwm
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe directio: Application I/O size >1MB, Active File Set > Memory
Tell tale sign: high filesystem paging; apps doing large read/write
Fix: mount -forcedirectio
Memory Requirement: saves system memory
Drawback: each and every read/write call goes to disk
Applicability: direct attached UFS fileserver
Created: July 19 2001
Revised:
In this scenario,:
1- "vmstat -p" shows constant paging in filesystem indicating I/O
intensive workload on low memory system,
2- truss of your favorite apps are showing that the issued read
and write calls are almost exclusively > 1MB,
3- And your disk subsystem is capable of sustaining of the order
of 50MB/sec or more,
then there is little cache reuse and time taken to write to the page
cache is comparable to the time taken to write to disk. In this
situation the use of a page cache implies an extra copy, uses memory
and the usual benefit do not apply. When all those conditions are
present the page cache should be turned off with the mount
-forcedirectio. The usual case here is for a workload explicitly
developped for a directio filesystem.
Note that a small write will be slowed by a factor of about 10000. A
64K write would be slowed by a factor of 100. So this truly applies
to applications issuing exclusively large I/O. Also with this tuning
if 2 applications read a given file, they will each be issuing disk
I/O for that file as oppose to having the second one read from the
memory based page cache.
The UFS page cache (NOTE: Page Cache) is ON by default and one needs
to mount with the -forcedirectio option to disable the page cache in
the situations that requires it.
This recipe can either solve your problem or degrage performance even
more. It all depends on access pattern and sizes. If you can
influence the size of the read and write calls make them bigger (up
to 1M).
If only a specific application (rather than the global workload)
would benefit from a directio and if that applications data cannot be
segregated on a different directio filesystem; then using the
directio(3C) calls in this applications should be considered.
Recipe directio: http://www.dfwsug.org/cookbook.html#directio
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe ncsize: Large Number of active files (>>10000)
Tell tale sign: low dlnc cache hit
Fix: increase ncsize
Memory Requirement (KMEM): ~ ncsize* 0.5K
Applicability: UFS or NFS fileserver
Created: July 19 2001
Revised: Nov 14 2001
In this situation "vmstat -s| grep lookups" shows cache hits ratio
smaller than 90%. Also see "kstat -n dnlcstats".
Filenames and inode information is cached in the kernel. The size of
the cache is controled by the ncsize kernel parameter. This tuning
should be applied when a system is showing a low (<90%) DNLC cache
hits; ufs_ninode will automatically be adjusted at bootime to ncsize.
Note: kernel parameters.
You will consume in kernel memory a few bytes per ncsize and 0.5K per
ufs_ninode. For a Solaris 8 machine, the default settings should put
both ncsize and ufs_ninode to
ncsize = ufs_ninode = total memory / 16K
This means that up to 1/32th of memory may be used for this cache
when running with default settings.
Recipe ncsize: http://www.dfwsug.org/cookbook.html#ncsize
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe segmap_percent: Dedicated I/O server on large Dataset
Tell tale sign: small segmap cache hit rate
Fix: increase segmap_percent
Memory Requirement: up to segmap_percent of memory may be used for I/O
Drawback: application request for memory can cause paging storm
Applicability: Dedicated I/O server NFS or UFS
Created: Nov 13 2001
Revised:
It is well known that most of the memory can be used to cache
filesystem data. This leads to important performance improvement when
accessing this data. However only a portion of memory is actually
readily mapped in the kernel in "segmap" to be the target of an
actual I/O. For a read or write call, being or not in segmap can
cause a performance difference of approximately 20%.
Solaris 8 introduced a new kernel parameter called segmap_percent
that controls the size of segmap. Prior to Solaris 8 the segment was
of a fixed 256M size. With solaris 8 the segmap is sized to be
portion of free memory after boot with a default value of 12%.
The cache hit rate in segmap can be computed from the output of this
command "kstat unix:0:segmap:get\* 1"
The segmap efficiency since boot, should be
(get_reclaim + get_use) / getmap
The interesting measure would be the rate of increase (the difference
between 2 successive set of data from the kstat command) of:
getmap - (get_reclaim + get_use)
which is a number pages per second which may be impacted by this
tuning. If this number is small, no benefit should be expected. We
can associate approximately 20% increase in performance on those
pages if and only if we manage to keep them in segmap by increasing
segmap_percent. There is not guarantee that increasing segmap will
cause the pages to be in segmap. This is workload dependant. For
example if the number increases by 1000 pages per second; then that
correspond to 1000*8K pages or 8MB/sec of I/O that may benefit by 20%
if increasing segmap_percent causes the data to become fully mapped
in segmap.
On a dedicate I/O server it may be beneficial to increase this value.
This actually consumes little additionally memory for segmap
structures (< 1%) but it should be noted that the segmap portion of
the filesystem cache is not considered free memory. The non-segmaped
portion is considered free memory.
So it is rather important to keep a portion of memory free and
available. Setting segmap_percent to a large portion (> 50%) of
memory can definitively create paging storms causing serious
performance degradation.
Recipe segmap_percent: http://www.dfwsug.org/cookbook.html#segmap_percent
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe nfs3_max_threads: NFS Client Bursty Writes
Tell tale sign: application shows high write and high wait time
Fix: increase nfs3_max_threads
Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem
Drawback: the kernel threads will all be created even if they are of little use
Applicability: NFS client
Created: July 19 2001
Revised:
Note: kernel parameters
A target applications here will cycle between NFS write intensive and
non I/O intensive operations both being significant contributor to
total elapse time.
NFS operations for a mount point are broken up on up to
nfs3_max_threads kernel threads. The Solaris 8 default value for this
parameter is 8. The client thus caches up to nfs3_max_threads * 2 *
blocksize (default 32K) of written data before blocking processes on
writes to the same NFS mount point. Increasing nfs3_max_threads
allows applications to proceed through the I/O phase without being
throttled by NFS and network performance. The I/O can then be
executed by the kernel concurrently to the application's non I/O
related processing.
nfs3_max_threads should be kept to a reasonable value smaller than
appromately 8*NCPU. Note that the nfs3_max_threads limit is per NFS
filesystem.
Consider an application that loops through a series of 1MB writes and
a compute phase that lasts 1sec. This application only requires an
average of 1MB/s of network bandwidth. With the default
nfs3_max_threads of 8 and an NFS blocksize of 32K the client is able
to buffer 2 * 8 * 32K = 512K of data without throttling
applications. This means that the application doint a 1MB write will
block and the I/O operation will proceed at the speed of the network
or server side disk (whichever is slowest). On a 10MB/s connection,
then 0.1 second will go the the I/O phase followed by 1 second of
computation for 1.1s cycle time.
On the other hand, if one then set nfs3_max_threads to 16 then the
I/O operations will be fully buffered on the client side in the
kernel. Application will see memcopy speed of more than 100MB/sec.
The 1MB of data will be handled in less than 0.01 sec. The
application will then proceed to its non-I/O part of 1sec while the
kernel will process the I/O asynchroneously. The cycle time will be
1.01s a 10% speedup. What has been achieved in this tuning is
parallelism between the data transfer over the wire (done by the
kernel) and the application processing.
An application must keep the file descriptor open for this tuning to
work. If the Application closes the file descriptor at the end of
every I/O phase then Recipe nocto must me implemented as well.
This tuning will not help increase the maximum sustained NFS
performance but only application performance in situation where NFS
has not saturated the network or disk subsystem.
Recipe nfs3_max_threads: http://www.dfwsug.org/cookbook.html#nfs3_max_threads
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe nfs3_blocksize: Default NFS blocksize may throttle purely data intensize setup
Tell tale sign: large client read/write call but physical I/O of sizes smaller than stripe unit
Fix: adjust nfs3_bsize on client and nfs3_max_transfer_size on client and server
Memory Requirement (KMEM): nfs3_max_threads * 2 * NFS blocksize per filesystem
Drawback: larger blocks not suitable for small I/O
Applicability: NFS client and Server
Created: Nov 13 2001
Revised:
For a data-intensive setup over NFS, the default NFS3 transfer size
can cause suboptimal utilisation of the disk subsystem. Requests are
broken up in blocksize chunks and this governs the size of physical
I/Os. The 32K default for NFSv3 Sun clients and Servers can cause
disks to saturate at 3-5 MB/sec.
To get a larger blocksize than 32K one needs to changes the client
side nfs3_bsize and both the client and server
nfs3_max_transfer_size. It is not usefull to change the mount
size,rsize,wsize options. To get a smaller blocksize than 32K it is
sufficient to change the client nfs3_bsize parameter.
A reasonable value for the blocksize when working on data intensive
setup would one that matches the stripe unit. Physical I/O size to
individual disks should then not be limited by the NFS framework.
Note: UDP cannot use blocksize larger than 64K and we actually do not
recommend changing the blocksize when working with NFS over UDP.
Recipe nfs3_blocksize: http://www.dfwsug.org/cookbook.html#nfs3_blocksize
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe nfs3_nra: NFS Client sequentially reads tens of MB at a time
Tell tale sign: read performance (~3MB/s) much lower than disk or network capacity
Fix: increase kernel parameter nfs3_nra to 8 or 16
Memory Requirement (KMEM): nfs3_nra * NFS blocksize (32K) per active file
Drawback: if readahead blocks go unused the network BW and processing is wasted
Applicability: NFS client
Created: July 19 2001
Revised:
nfs3_nra is the number of NFS blocks that a client tries to
readahead. If the latency of an RPC call is say 0.01 seconds the
maximum throughput that can be acheived for a read call using the
default blocksize will be nfs3_nra*32K/0.01s.
You also need to keep nfs3_max_threads >= nfs3_nra.
Note: kernel parameters
Recipe nfs3_nra: http://www.dfwsug.org/cookbook.html#nfs3_nra
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe freebehind: Large file does not get cached
Tell tale sign: repeated sequential reading of a file; causes I/O each time
Fix: increase kernel parameter smallfile or set freebehind to 0
Memory Requirement (KMEM): 0
Drawback: largefile caching displaces all other cached files
Applicability: UFS, NFS server
Created: March 25 2002
Revised:
Prior to Solaris 8, it was very important to prevent the file cache
from consuming all of memory. UFS Freebehind was implemented such
that when a file was read sequentially, pages would be freed from the
cache as soon as it had been moved to the user buffer. This allowed
working on large files without consuming a lot of physical memory.
Today the pages of the file cache are also considered as free memory
which means that there is a less stringent requirement to free them
from the page cache once consumed.
The freebehind is activated for files bigger than smallfile (Solaris
8 default of 32K). The size of files that can be fully cached by a
system is matter of policy to be set for each machine. On a
dedicated I/O server that is known to work on large files and that
benefit from caching them, one can set the smallfile parameter to a
much larger value, for example 1/8th of total memory or disable
freebehind altogether Note: kernel parameters.
Recipe freebehind: http://www.dfwsug.org/cookbook.html#freebehind
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe one_at_a_time: small I/O size even for large read/write calls
Tell tale sign: physical I/O size smaller than maxphys/maxcontig
Fix: files should be created one at a time, preferably not over NFS
Memory Requirement (KMEM): 0
Drawback: not always possible; takes more time
Applicability: UFS, NFS server
Created: March 25 2002
Revised:
At file creation time or when a file is extended, sequences of disk
blocks (extents) become associated with file offsets. The size of the
extents also governs the potential size of I/O performed on the given
file. When all is well the extents matches the maxcontig parameter.
However when multiple files are extended simultaneously, the
filesystem may end up creating extents that are smaller than the
ideal values and the I/O size then suffers with no possible remedy.
To avoid this issue, if possible, one should create files one after
the other using cluster sized write calls. Also because of the
multi-threadness of the nfs client (see Recipe nfs3_max_threads)
file creation is best done on the server.
Note that, in general, control over extent sizes is not guaranteed
and will depend on the history of the filesystem.
Recipe one_at_a_time: http://www.dfwsug.org/cookbook.html#one_at_a_time
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe false_readahead: small (<32K) random reads on client causes large physical reads
Tell tale sign: apps does small reads but physical I/O sizes are of size maxphys/maxcontig
Fix: decrease maxcontig or segregate workload to separate filesystem
Drawback: performance of large I/O will be reduced
Applicability: UFS, NFS Client
Created: Nov 13 2001
Revised: Mar 28 2002
Working with read or write sizes slightly greater than the pagesize
(8K) can cause serious performance. Here an application is issuing
for example, 9K reads on a very large file, before issuing an lseek
to a totally different portion of the file.
The small read requests because it touches 2 different pages, is
considered sequential access and will cause full length cluster to be
read ahead. Note that this can happen for just about any sized read
calls.
If, for example, maxphys was tuned to 1MB (see Recipe stripe1) then a
small client read requests can cause a rather unexpectedly large server I/O.
This may not be a problem on a server will a lot of memory and if the
data will eventually be used by clients of this server. But if the
server has a very large filesystem; if files are larger on average
than the clustersize and if they are read sparingly from clients this
can be a concern.
To work around this issue one can reduce the maxcontig parameter for
the filesystem involved. However this may cause other applications
that use large sequential I/O from that client to be impacted. The
only other option is to segregate this type of workload to a seperate
filesystem.
Recipe false_readahead: http://www.dfwsug.org/cookbook.html#false_readahead
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe clnt_max_conns: NFS Client reads or writes tens of MB.
Tell tale sign: write performance equal to single stream tcp test
Fix: Increase kernel parameter clnt_max_conns to MIN(4,NCPU)
Memory Requirement (KMEM): clnt_max_conns*tcp_max_buf
Drawback: unused connections timeouts can lead to sluggish behavior
Applicability: NFS client
Created: July 19 2001
Revised:
Driving TCP at Gb speed may require more than 1 CPU to achieve
maximum throughput. All requests from an NFS client share a pool of
connections and the Solaris 8 default connection poolsize is 1. This
is sufficient for 100 Mbit ethernet but may require tuning for
stronger interfaces. A single connection can lead to serialising the
traffic to one CPU affecting throughput.
Drawback is still under investigation. This may lead to sluggish
response and is not recommended tuning due to lack of experiment with
it.
Note: kernel parameters
Recipe clnt_max_conns: http://www.dfwsug.org/cookbook.html#clnt_max_conns
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe nocto: Apps does many open,write,close over NFS
Tell tale sign: constant open/close activity; significant wait time
Fix: mount -F nfs -o nocto
Memory Requirement: N/A
Drawback: file changes not seen immediately on other clients
Applicability: NFS client
Created: July 19 2001
Revised:
This tuning will be effective when applications that go through
open,write,close sequences or workloads that writes then closes very
many files.
By Default Sun NFS clients will wait for its writes to complete to
disk when closing a file. This insures that the rest of the world
will see changes to files that have been closed by an application.
NFS need not guarantee such consistency but it nevertheless will by
default. To take advantage of the weaker consistency model provided
by this feature you may use the undocumented flag:
mount -F nfs -o nocto
The close(2) call will then complete without waiting for the data to
be synchronized to disks. The adverse side-effect is that other
clients will not see the changes until some unpredictable later time.
This adverse effect is of course irrelevant if the files are mostly
read and written from a single client. Beware that in other
situations users will definitively notice the absence of close to
open consistency (editing and compiling from different machines is
one common case).
The benefit from this tuning is bounded by the time spent in close.
From a truss -c output one can get the number of times close(2) was
called and a rule of thumb would be to associate one disk latency
(0.01 second) per call.
Recipe nocto: http://www.dfwsug.org/cookbook.html#nocto
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
Recipe RTFM
Tell tale sign: I'm lost
Fix: READ
Memory Requirement: you're allowed to take notes
Drawback: it takes time
Created: July 19 2001
Revised:
NFS Performance and Tuning Guide for Sun Hardware: There is a need for an updated version.
Solaris Tunable Parameters Reference Manual
Richard Mcdougall's Web Site
Jim Mauro, Richard McDougall Solaris Internals: Core Kernel Architecture
ISBN 0-13-022496-0 (C) Prentice Hall, 2000
Brent Callaghan, NFS Illustrated
ISBN 0-201-32570-5 (C) Addison-Wesley
Solaris - Tuning Your TCP/IP Stack
Recipe RTFM: http://www.dfwsug.org/cookbook.html#RTFM
Warning: If you are not reading this of the official web site please verify the site to see if the tip has been updated.
Back To Content
NOTE: kernel parameters
Your kernel parameters can be changed in the /etc/system file
For example:
set maxphys = 1048576
set bufhwm = 50000
set ncsize = 1000000
set autoup = 30
set tune_t_fsflushr = 5
set ufs:ufs_HW = 8388608
set ufs:ufs_LW = 5592405
set nfs:nfs3_max_threads = 16
set nfs:nfs3_nra = 8
set rpcmod:clnt_max_conns = 4
You can find out your systems parameter with mdb (or adb) as root:
$mdb -k
ncsize/1D
ncsize: 33952
ufs_ninode/1D
ufs_ninode: 33952
ufs_throttles/1D
ufs_throttles: 200
nfs3_max_threads/1d <<<< note lowercase 'd' denotes a 16-bit integer
nfs3_max_threads: 8
For bufhwm
/usr/sbin/sysdef | grep bufhwm
You may set some parameters with mdb -k -w.
Often kernel parameter are used to initialized structures some of them at bootime.
This means that it is not possible to change them dynamically on a running kernel.
Those take immediate effect: ufs_HW.
NOTE: Page Cache
The UFS page cache uses the system memory as a cache of all
filesystem. This means that written data is acknowledged very
quickly (basically at memcpy speed) to your applications. Read data
that hit in the cache is also going at memcpy speed. Multiple writes
to a file will results in fewer bigger physical I/Os. Write
cancellation happens when there are multiple writes (from one or more
processes) to a given file. The last write cancels previous ones and
this leads to fewer disk writes.
The existence of the page cache also means that if there is
catastrophic failure of the kernel (crash or power outage) in-flight
data will not appear in the files after the reboot. However, upon
closing a file (or on process exit) the outstanding data is synched
to disk and you are not exposed to this effect. Only data written to
open files are at risk. When using the page cache, the way to
guarantee that data is set to stable storage is to use one of
fsync(3C) and fdatasync(3RT) calls.
Thanks to the page cache the size I/O sent to devices (physical or
logical) can be bigger than the applications I/O size. This can be a
very significant benefit of using the page cache.
NOTE: Monitoring Paging
Solaris 8 introduced the -p flags to vmstat which breaks down paging
statistics between executable, anonymous and filesystem data. A
system that shows constant sustained paging in the anonymous or
executable columns is very probably showing signs that there is not
enough physical memory to sustain the given workload. Paging
statistics to filesystem is harder to interpret because it does not
include all pages. One cannot for example match the data seen by the
disks (see NOTE: Monitoring disk activity) with the paging statistics
shown by vmstat.
If paging statistics are dominated by the filesystems columns this
can be a sign that the system is low on memory because the disk
subsystem is not draining the I/O requests fast enough. Improving
the I/O through tuning or additional hardware may me sufficient to
heal a system.
As with most solaris *stat commands, the first set of output
represents average since boot time.
NOTE: Monitoring Applications
An applications I/O Size and access pattern can be checked by
using truss(1). For example
$ truss -t read,write,pread,pwrite,aio,lseek -p <pid>
To see if I/O is important to the performance on a given process
one can use "truss -c" to get basic statistics on time spent in I/O.
$ truss -c read,write,lseek,close,open -p <pid>
The wait time will often be associated with I/O.
Understanding the pattern and size is critical to many
tuning criteria.
NOTE: Monitoring Disk Activity
To measure you avg physical I/O size, you use the command iostat -xtc
1; Discard the first set of output from the iostat command (they are
averages since boot time). The kr/s column gives the amount of data
read per second (in K) while the r/s column is the number of read
operations per second on the device. By dividing these numbers you
get the average device I/O size (likewise for writes using columns
kw/s and w/s).
The tuning describe in this guide are aimed at increasing the
physical I/O size such that each disk gets I/O size of around a
stripe unit and each metadevice gets cluster size (maxphys) I/O.
This actual numbers will depend, on the workload mix, the I/O
pattern, the kernel tunables most notably those described in this
document as well as Volume Management Software.
NOTE: Network tuning
This may help to achieved high throughput in synthetic tests. You
should measure their effectiveness on your workload before
implementing them in the general case. I apply them (blindly) on
both side of connections.
/etc/system
sq_max_size = 16; //max recommended of 100; default of 2
ndd -set /dev/tcp tcp_deferred_acks_max 16
hiwat = 512000
lowat = 2/3 * hiwat
ndd -set /dev/tcp tcp_xmit_hiwat $hiwat
ndd -set /dev/tcp tcp_xmit_lowat $lowat
ndd -set /dev/tcp tcp_recv_hiwat $hiwat
ndd -set /dev/tcp tcp_maxpsz_multiplier 64
KMEM: hiwat bytes per connection