AIX NIM Server Tuning Part 1
My NIM server is not just a NIM server, overnight, starting at 1am, it goes out to every client LPAR and collects the previous days NMON log files. These files are then processed and loaded in to an RRD (Round Robin Database) from which performance graphs and statistics are produced. Therefore, my NIM server generates a lot of disk and network I/O as it processes all the NMON data files and it is running mksysb backups from client LPARs.
What I discovered with vmstat -IWwt 1 is that my LPAR was suffering from ‘Free Frame Waits’ and at times had zero free memory pages. At the same time, I could see that LRUD was working hard to find Free Memory Pages (fr and sr) which in turn showed high System (sy at 60%+) CPU percentage. When LRUD runs, its run priority is fixed at 16, meaning it runs before everything else, as when AIX is short on Free Memory Pages, the priority is to find Free Pages. From the vmstat data here we can also see the LPAR is running over entitlement. Notice how the number of Interrupts (in) and IO disk writes start to decrease as there is no Free Memory to service disk IO pages.
Before Tuning - vmstat -IWwt 1
kthr memory page faults cpu time
--------------- --------------------- ------------------------------------ ------------------ ----------------------- --------
r b p w avm fre fi fo pi po fr sr in sy cs us sy id wa pc ec hr mi se
0 0 0 0 1062138 53142 0 7551 0 0 0 0 1467 35556 7301 27 59 14 0 2.19 546.5 02:06:47
10 0 0 0 1061920 46157 0 7039 0 0 0 0 1299 33949 6655 28 60 12 0 2.18 545.6 02:06:48
10 0 0 0 1061923 37408 0 7631 0 0 0 0 1581 35491 7759 29 58 12 0 2.22 554.2 02:06:49
12 0 0 0 1062044 27639 0 10832 0 0 0 0 1589 40811 7548 32 57 10 0 2.20 550.8 02:06:50
11 0 0 0 1061916 18920 0 8768 0 0 0 0 1664 38953 7563 31 57 13 0 2.27 568.1 02:06:51
10 0 0 0 1062358 8176 0 10116 0 0 0 0 1718 45134 8154 35 52 13 0 2.18 545.2 02:06:52
10 0 0 0 1062367 354 0 9374 0 0 1516 24162 1712 42742 7402 31 59 10 0 2.20 549.2 02:06:53
8 2 0 0 1062359 9 0 6245 0 0 5642 7827 1604 31486 9165 27 61 9 3 2.18 544.8 02:06:54
8 0 0 4 1062482 1 0 3983 0 0 4166 4166 1242 22106 11106 17 65 14 4 2.07 517.6 02:06:55
8 2 0 4 1062148 1 0 4048 0 0 3582 3582 1201 21817 10544 17 66 13 4 2.12 529.3 02:06:56
8 3 0 0 1062156 95 0 3489 0 0 3661 3665 1096 18140 11086 15 67 13 5 2.14 535.0 02:06:57
7 2 0 0 1062924 277 0 2878 0 0 3792 3792 1011 17573 14152 14 68 11 7 2.15 536.5 02:06:58
7 1 0 4 1062374 0 0 3616 0 0 2839 2839 999 17407 5951 14 69 15 2 2.12 529.0 02:06:59
11 0 0 0 1062374 3 0 4040 0 0 3994 3993 1258 22807 7490 15 67 15 3 2.18 544.6 02:07:00
8 2 0 0 1062382 669 0 2067 0 0 2746 2748 786 13160 9221 11 71 12 6 1.97 491.3 02:07:01
18 0 0 0 1063343 5 0 2359 0 0 2624 2624 757 13805 6144 14 70 12 3 1.85 463.5 02:07:02
14 1 0 1 1063715 0 0 1663 0 0 2015 2015 619 10625 5914 16 69 11 5 1.84 460.8 02:07:03
11 3 0 6 1064916 0 0 4224 0 0 5472 5471 1213 104773 16118 23 65 7 5 2.12 529.1 02:07:04
16 0 0 5 1065228 1 0 5503 0 0 5824 5824 1364 31827 12211 25 64 8 3 2.07 516.4 02:07:05
8 2 0 0 1065934 278 0 4360 0 0 5379 5387 1305 22972 13452 21 66 9 5 1.91 477.8 02:07:06
Free Frame Waits
The LPAR was suffering from Free Frame Waits, as shown in the vmstat -IWwt 1 output above, the FRE column is seen at 0 (zero) for several seconds. You will notice that the number of pages out for IO dropped from around 9,000 per second to 2,000 or 3,000 per second. Most people assume that slow disk IO is caused by slow disks or adapters, but that is not always true. The cause of the IO slowness was that AIX had no Free Memory Frames to hold the new IO pages until LRUD had scanned and freed enough Free Pages. The historical output from ‘vmstat -s’ showed that in the 199 days of uptime, we had seen the Free Frame List exhausted 3,573,746 times.
Tuning for Free Frame Waits is done with the ‘vmo’ command and the ‘minfree’ and ‘maxfree’ settings. For our environment, we found that the following settings were needed to minimise the Free Frame Waits.
vmo -po minfree=12288
vmo -po maxfree=14336
IO Buffer Shortages
The LPAR also suffered from IO buffer shortages for disk IO and JFS2 file IO, as shown in output from vmstat -v (sample shown below).
164918 pending disk I/Os blocked with no pbuf <== Disk I/O Blocked
0 paging space I/Os blocked with no psbuf <== Paging Space I/O Blocked
2288 filesystem I/Os blocked with no fsbuf <== JFS I/O Blocked
0 client filesystem I/Os blocked with no fsbuf <== NFS or VxFS I/O Blocked
5329075 external pager filesystem I/Os blocked with no fsbuf <== JFS2 I/O Blocked
Pending Disk IOs Blocked with no pbuf
These disk I/Os are blocked when AIX Volume Group has run out of sufficient pbuf’s for all the IO to it. You can check each Volume Groups pbuf configuration with lvmo.
# lvmo -v nimadmvg -a
vgname = nimadmvg
pv_pbuf_count = 512
total_vg_pbufs = 512
max_vg_pbufs = 16384
pervg_blocked_io_count = 824
pv_min_pbuf = 512
max_vg_pbuf_count = 0
global_blocked_io_count = 847
aio_cache_pbuf_count = 0
Tuning is done on a per Volume Group basis. Do not add too many pbufs to a Volume Group as these will consume pinned memory pages. Adding small amounts of extra pbufs is the best approach until the pervg_blocked_io_count stops rising. So in this example, for nimadmvg, I am adding an extra 512 pufs per Physical Volume in the volume group. You should increase the pv_pbuf_count slowly in small amounts and monitor until the pervg_blocked_io_count stops incrementing.
# lvmo -v nimadmvg -o pv_pbuf_count=1024
Please note:
The maximum number of pbufs for a volume group is dependent on the number of disks in the Volume Group. The total_vg_pbufs figure can never be greater than max_vg_pbufs. Once the total_vg_pbufs figure is equal to the max_vg_pbufs, you cannot add any more pbufs. In this situation, if you still see the pervg_blocked_io_count rising, you should consider moving some of the data with the volume group to another volume group or add more disks to the existing volume group and rebalance your data.
External pager filesystem I/Os blocked with no fsbuf.
These JFS2 disk I/Os are blocked when AIX does not have sufficient fsbufs in pinned memory to hold the IO requests. It is recommended that you tune the j2_dynamicBufferPreallocation option first, as this is a dynamic change and takes immediate effect. Acceptable tolerance is 5 digits per 90 days of uptime. Your first tuning option should be j2_dynamicBufferPreallocation.
If 6 digits, set:
ioo –o j2_dynamicBufferPreallocation=128.
If 7+ digits, set:
ioo –o j2_dynamicBufferPreallocation=256.
From the AIX man page:
A value of 16 represents 256K. Filesystem does not need remounting. The bufstructs for Enhanced JFS are now dynamic; the number of buffers that start on the paging device is controlled by j2_nBufferPerPagerDevice, but buffers are allocated and destroyed dynamically past this initial value. If the number of "external pager filesystem I/Os blocked with no fsbuf" (from vmstat -v) increases, the j2_dynamicBufferPreallocation should be increased for that filesystem, as the I/O load on the filesystem may be exceeding the speed of preallocation. A value of 0 will disable dynamic buffer allocation completely.
I set our server to j2_dynamicBufferPreallocation=256.
Your second tuning option should be j2_nBufferPerPagerDevice.
Please Note:
The j2_nBufferPerPagerDevice is now a restricted tunable and IBM recommend you open a Sev3 PMR first to confirm your actions before proceeding. This option requires that the Filesystems be re-mounted to take effect.
Second Option (if first option wasn’t enough): If 6 digits, set:
ioo -o j2_nBufferPerPagerDevice=5120.
Second Option (if first option wasn’t enough): If 7+ digits, set:
ioo -o j2_nBufferPerPagerDevice=10240.
From the AIX man page:
File system must be remounted. This tunable only specifies the number of bufstructs that start on the paging device. Enhanced JFS will allocate more dynamically. Ideally, this value should not be tuned, and instead j2_dynamicBufferPreallocation should be tuned. However, it may be appropriate to change this value if, when using vmstat -v, the number of "external pager filesystem I/Os blocked with no fsbuf" increases quickly (and continues increasing) and j2_dynamicBufferPreallocation tuning has already been attempted. If the kernel must wait for a free bufstruct, it puts the process on a wait list before the start I/O is issued and will wake it up once a bufstruct has become available. May be appropriate to increase if striped logical volumes or disk arrays are being used.
We set our server to j2_nBufferPerPagerDevice=5120.
Sequential Read Ahead.
Our NIM server processes lots of large sized files. These files are read sequentially for processing, and also the large mksysb backup images will be read sequentially by TSM when the daily backups occur. We chose to increase this setting from the default value of 128 pages to 1024 pages. Each page is 4K in size, so the JFS2 file system will pre-fetch 4096Kbytes of data when files are sequentially read.
ioo -po j2_maxPageReadAhead=1024
JFS2 Write Behind
JFS2 by default, writes 32 pages of sequential IO per file, which is 128Kbytes of data. As the NIM server is storing many large mksysb files, we decided to increase this value to 256 pages, which is 1Mbyte. This is the number of pages to keep in RAM when files are written sequencially, meaning we would be sending fewer IOs to the storage array but each IO would be large.
ioo -po j2_nPagesPerWriteBehindCluster=256