犯错了 (vm_nr_hugepages)

问题是这样的:操作系统是RHEL 5.3 64bit,上面装有Oracle和其它应用,开机之后,用top/free/ps等工具查看内存使用情况,发现少了几G内存,不知道被用到哪里去了。

折腾了几天之后,最后发现是有一个内核参数配置过大,导致一开机,有超过预期数量的内存被系统锁住了。因为系统启用了HugePages来分配Oracle的SGA,而在之前的某个时间,我把SGA的值改小了,这个参数(vm_nr_hugepages)又没相应地改小。这个低级错误的根本原因在于没有完全地理解这个参数,惭愧。

还有一个问题,用ipcs -m看到的Shared Memory Segment会比指定的SGA大一点,是什么道理?比如SGA设置为8G,用show sga看到确实是分配了8G(8589934592),而ipcs -m看到却是分配了8592031744,比8G大了2M。这样又导致vm_nr_hugepages=4096的时候 (HugePages size=2M,理论上4096*2M刚好=8G),系统无法成功分配8G内存给SGA,现在是把vm_nr_hugepages设成4196了。看了几篇文章,都没提到有类似问题。

有关Linux HugePages的参考信息

RHEL 5下为Oracle SGA启用huge pages

Memory

MEMORY_TARGET (SGA_TARGET) or HugePages – which to choose?

Oracle10gR2在RHEL 5下开启DIRECT IO

平台: Red Hat Linux Enterprise 5.3 64 bit, Oracle 10gR2 10.2.0.4 64 bit

Oracle开启direct io前

SQL> show parameter filesystemio_options

NAME                                 TYPE        VALUE
———————————— ———– ——————————
filesystemio_options                 string      ASYNCH

重启动oracle实例后,cached内存为160700KB
# free
             total       used       free     shared    buffers     cached
Mem:      32887744    4583296   28304448          0       2992    160700
-/+ buffers/cache:    4419604   28468140
Swap:      4192924          0    4192924

现在做一个大表查询
SQL> select /*+ full(channels) parallel(channels,4) */ count(*) from channels;
  COUNT(*)
———-
 793103894
Elapsed: 00:13:09.42

再看看内存使用情况,cached内存为9669284KB
# free
             total       used       free     shared    buffers     cached
Mem:      32887744   14155316   18732428          0      26484   9669284
-/+ buffers/cache:    4459548   28428196
Swap:      4192924          0    4192924

开启direct io
SQL> alter system set filesystemio_options=SETALL scope=spfile;
System altered.
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

把cached内存释放掉
# sync; echo 3 > /proc/sys/vm/drop_caches
# free
             total       used       free     shared    buffers     cached
Mem:      32887744    4374724   28513020          0        540     47532
-/+ buffers/cache:    4326652   28561092
Swap:      4192924          0    4192924

重启Oracle instance
SQL> startup
ORACLE instance started.

Total System Global Area 4294967296 bytes
Fixed Size                  2089432 bytes
Variable Size             301993512 bytes
Database Buffers         3976200192 bytes
Redo Buffers               14684160 bytes
Database mounted.
Database opened.
SQL> show parameter filesystemio_options

NAME                                 TYPE        VALUE
———————————— ———– ——————————
filesystemio_options                 string      SETALL

此时cached内存94612KB
# free
             total       used       free     shared    buffers     cached
Mem:      32887744    4503964   28383780          0       1872     94612
-/+ buffers/cache:    4407480   28480264
Swap:      4192924          0    4192924

再做一个同样的查询
SQL> select /*+ full(channels) parallel(channels,4) */ count(*) from channels;
  COUNT(*)
———-
 793103894
Elapsed: 00:03:37.87

速度快了不少,cached内存96872KB,基本不变
# free
             total       used       free     shared    buffers     cached
Mem:      32887744    4559484   28328260          0      43556     96872
-/+ buffers/cache:    4419056   28468688
Swap:      4192924          0    4192924

RHEL 5下为Oracle SGA启用huge pages

在64位操作系统下,为oracle SGA启用huge pages memory mapping,可以更高效地使用系统内存。
有关huge pages和oracle的内存分配,可以参见下面两篇文章,讲解非常详细,推荐一看。
Memory
Pythian Goodies: The Answer to Free Memory, Swap, Oracle, and Everything

平台: Red Hat Linux Enterprise 5.3 64 bit, Oracle 10gR2 10.2.0.4 64 bit
启用huge pages的步骤。

查看默认的small page size
# getconf PAGE_SIZE
4096

查看huge page size
# grep Hugepagesize /proc/meminfo
Hugepagesize:     2048 kB

对比以上可以看出,一个是4k,一个是2M

假设要为SGA分配4G内存,则需要4G/2M=2048 pages
我们要在/etc/sysctl.conf里添加一句
vm.nr_hugepages = 2052  #比2048稍大
# echo  “vm.nr_hugepages = 2052” >> /etc/sysctl.conf
# sysctl -p

huge pages使用时会锁在内存中,不会被交换出去
需要在/etc/security/limits.conf里添加如下内容 (16777216KB是基于可扩展性考虑,大于4GB即可)
# cat >> /etc/security/limits.conf <<EOF
> oracle           soft    memlock        16777216
> oracle           hard    memlock        16777216
> EOF
#

重启oracle instance,然后查看huge page使用情况
# cat /proc/meminfo |grep Huge
HugePages_Total:  2052
HugePages_Free:   1702
HugePages_Rsvd:   1699
Hugepagesize:     2048 kB

Oracle10gR2在RHEL 5下开启异步IO

记录一下步骤,内容主要来自这篇文章

平台:Redhat Enterprise Linux 5 64bit,Oracle 10gR2 10.2.0.4 64bit

1、首先用root用户安装以下必要的rpm包

# rpm -Uvh libaio-0.3.106-3.2.x86_64.rpm
# rpm -Uvh libaio-devel-0.3.106-3.2.x86_64.rpm

2、在系统级支持异步I/O
  与[Note 225751.1]介绍的在RHEL 3里面设置异步IO不同,不需要设置aio-max-size,而且’/proc/sys/fs’路径下也没有这个文件。因为从2.6 kernel开始,已经取消了对IO size的限制[Note 549075.1]。另外根据[Note 471846.1],Oracle建议将aio-max-nr的值设置为1048576或更高。

# echo > /proc/sys/fs/aio-max-nr 1048576

要永久修改这个内核参数,需要在/etc/sysctl.conf加上下面这句
fs.aio-max-nr = 1048576
使参数生效
#/sbin/sysctl -p

3、在数据库级启用异步I/O
  首先修改数据库参数。与[Note 225751.1]在RHEL 3里面设置异步IO不同,Oracle10gR2默认是打开了对异步IO的支持的,不需要重新编译数据库软件。在’$ORACLE_HOME/rdbms/lib’路径下,也没有’skgaioi.o’这个文件。在某些情况下,Oracle无法将IO行为或事件报告给操作系统[Note 365416.1],因此需要做以下操作。

这里开始换成oracle用户

SQL> alter system set disk_asynch_io=TRUE scope=spfile;

SQL> alter system set filesystemio_options=asynch scope=spfile;

SQL>shutdown immediate
$ cd $ORACLE_HOME/rdbms/lib
$ ln -s /usr/lib/libaio.so.1 skgaio.o
$ make PL_ORALIBS=-laio -f ins_rdbms.mk async_on
SQL>startup

  在Oracle10gR2中AIO默认已经是开启的了。可以通过ldd或者nm来检查oracle是否已经启用了AIO支持,有输出代表已经启用。

[oraprod@db01 ~]$ /usr/bin/ldd $ORACLE_HOME/bin/oracle | grep libaio
libaio.so.1 => /usr/lib64/libaio.so.1 (0x00002aaaac4a9000)
[oraprod@db01 ~]$ /usr/bin/nm $ORACLE_HOME/bin/oracle | grep io_getevent
w io_getevents@@LIBAIO_0.4

4、检查异步I/O是否在使用
  根据[Note 370579.1],可以通过查看slabinfo统计信息查看操作系统中AIO是否运行,slab是Linux的内存分配器,AIO相关的内存结构已经分配,kiocb值的第二列和第三列非0即是已使用。与kernel 2.4.x不同,没有显示kiobuf,因为从kernel 2.5.43开始,kiobuf已经从内核中被移除。

$ cat /proc/slabinfo | grep kio
kioctx 64 110 384 10 1 : tunables 54 27 8 : slabdata 11 11 0
kiocb 13 315 256 15 1 : tunables 120 60 8 : slabdata 21 21 44

Linux IO性能测试

近来想了解一下开发环境的IO性能,分别用dd/orion/iozone/bonnie++四种工具测试了一下
开发环境系统配置如下:
Intel SR1625 server, 2 CPU, 32GB内存, 用主板自带卡做了raid1+0,8个7200转SATA硬盘
操作系统是RHEL 5.3 64位
因为物理内存是32GB,因此整个过程都选用了60GB+的数据量来测试,以避免cache的影响
1. 首先用自带的dd命令先测一下, 块大小为8k
dd只能提供一个大概的测试结果,而且是连续IO而不是随机IO
读测试
# time dd if=/dev/sda2 of=/dev/null bs=8k count=8388608
8388608+0 records in
8388608+0 records out
68719476736 bytes (69 GB) copied, 516.547 seconds, 133 MB/s
real    8m36.926s
user    0m0.117s
sys     0m55.216s
写测试
# time dd if=/dev/zero of=/opt/iotest bs=8k count=8388608
8388608+0 records in
8388608+0 records out
68719476736 bytes (69 GB) copied, 888.398 seconds, 77.4 MB/s
real    14m48.743s
user    0m3.678s
sys     2m47.158s
读写测试
# time dd if=/dev/sda2 of=/opt/iotest bs=8k count=8388608
8388608+0 records in
8388608+0 records out
68719476736 bytes (69 GB) copied, 1869.89 seconds, 36.8 MB/s
real    31m10.343s
user    0m2.613s
sys     3m25.548s
2. 接下来用Oracle的orion工具来测一下
解压即可使用
# gzip orion_linux_x86-64.gz
测异步IO时需要libaio库
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64
# echo $LD_LIBRARY_PATH
:/opt/oracle/product/10.2.0/lib:/usr/lib64
创建配置文件mytest.lun,列出要测试的分区即可. 注意文件名前缀要跟下面的 testname一致
# vi mytest.lun
查看mytest.jun内容
# cat mytest.lun
/dev/sda2
先来个simple test
# ./orion_linux_x86-64 -run simple -testname mytest -num_disks 8
查看测试结果
# cat mytest_20081111_1431_summary.txt
ORION VERSION 11.1.0.7.0
Commandline:
-run simple -testname mytest -num_disks 8
This maps to this test:
Test: mytest
Small IO size: 8 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 0%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,      0
Large Columns:,      0,      1,      2,      3,      4,      5,      6,      7,      8,      9,     10,     11,     12,     13,     14,     15,     16
Total Data Points: 38
Name: /dev/sda2 Size: 629143441920
1 FILEs found.
Maximum Large MBPS=56.97 @ Small=0 and Large=7
Maximum Small IOPS=442 @ Small=40 and Large=0
Minimum Small Latency=14.62 @ Small=1 and Large=0
最大MBPS为56.97,最大IOPS为442
再测一下8k随机读操作
# ./orion_linux_x86-64 -run advanced -testname mytest -num_disks 8 -size_small 8 -size_large 8 -type rand &
看看结果
# cat mytest_20081111_1519_summary.txt
ORION VERSION 11.1.0.7.0
Commandline:
-run advanced -testname mytest -num_disks 8 -size_small 8 -size_large 8 -type rand
This maps to this test:
Test: mytest
Small IO size: 8 KB
Large IO size: 8 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 0%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,      0
Large Columns:,      0,      1,      2,      3,      4,      5,      6,      7,      8,      9,     10,     11,     12,     13,     14,     15,     16
Total Data Points: 38
Name: /dev/sda2 Size: 629143441920
1 FILEs found.
Maximum Large MBPS=3.21 @ Small=0 and Large=13
Maximum Small IOPS=448 @ Small=38 and Large=0
Minimum Small Latency=15.16 @ Small=1 and Large=0
最大MBPS为3.21(这么低??),最大IOPS为448
再测一下1M顺序读操作, 失败了, 原因不明…
# ./orion_linux_x86-64 -run advanced -testname mytest -num_disks 8 -size_small 1024 -size_large 1024 -type seq
ORION: ORacle IO Numbers — Version 11.1.0.7.0
mytest_20081114_1349
Test will take approximately 73 minutes
Larger caches may take longer
rwbase_run_test: rwbase_reap_req failed
rwbase_run_process: rwbase_run_test failed
rwbase_rwluns: rwbase_run_process failed
orion_warm_cache: Warming cache failed. Continuing
看看结果
# cat mytest_20081111_1620_summary.txt
ORION VERSION 11.1.0.7.0
Commandline:
-run advanced -testname mytest -num_disks 8 -size_small 1024 -size_large 1024 -type seq
This maps to this test:
Test: mytest
Small IO size: 1024 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Sequential Streams
Number of Concurrent IOs Per Stream: 4
Force streams to separate disks: No
Simulated Array Type: CONCAT
Write: 0%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
没结果,失败
3. 用iozone来测一下
安装
# tar -xvf iozone3_345.tar
# make linux-AMD64
指定64G的文件,只测read/reread和write/rewrite,记录大小从4k-16k.同时生成一个excel文件iozone.wks
# ./iozone -Rab iozone.wks -s64G -i 0 -i 1 -y 4k -q 16k
        Iozone: Performance Test of File I/O
                Version $Revision: 3.345 $
                Compiled for 64 bit mode.
                Build: linux-AMD64
        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                     Fabrice Bacchella, Zhenghua Xue, Qin Li.
        Run began: Tue Nov 11 10:23:25 2008
        Excel chart generation enabled
        Auto Mode
        File size set to 67108864 KB
        Using Minimum Record Size 4 KB
        Using Maximum Record Size 16 KB
        Command line used: ./iozone -Rab iozone.wks -s64G -i 0 -i 1 -y 4k -q 16k
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride                                 
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
        67108864       4   72882   69470   104898   125512
        67108864       8   72083   69256   133689   109061
        67108864      16   73375   69155   142019   116034
iozone test complete.
Excel output is below:
"Writer report"
        "4"  "8"  "16"
"67108864"   72882  72083  73375
"Re-writer report"
        "4"  "8"  "16"
"67108864"   69470  69256  69155
"Reader report"
        "4"  "8"  "16"
"67108864"   104898  133689  142019
"Re-Reader report"
        "4"  "8"  "16"
"67108864"   125512  109061  116034
可以看到,8k的写是72M/s左右,读是133M/s左右,跟dd的结果比较接近
测一下64G文件8k随机读写
# ./iozone -Rab iozone.wks -s64G -i 2 -y 8k -q 8k
        Iozone: Performance Test of File I/O
                Version $Revision: 3.345 $
                Compiled for 64 bit mode.
                Build: linux-AMD64
        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
                     Fabrice Bacchella, Zhenghua Xue, Qin Li.
        Run began: Fri Nov 14 15:52:01 2008
        Excel chart generation enabled
        Auto Mode
        File size set to 67108864 KB
        Using Minimum Record Size 8 KB
        Using Maximum Record Size 8 KB
        Command line used: ./iozone -Rab iozone.wks -s64G -i 2 -y 8k -q 8k
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride                                 
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
        67108864       8
Error reading block at 6501007360
read: Success
出错了(??)
4. 最后用bonnie++测一下
安装
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64
# ./configure
# make
# make install
开始测试,默认文件大小是内存的2倍
# bonnie++ -d /opt/IOTest/ -m sva17 -u root
Using uid:0, gid:0.
Writing with putc()…done
Writing intelligently…done
Rewriting…done
Reading with getc()…done
Reading intelligently…done
start ’em…done…done…done…
Create files in sequential order…done.
Stat files in sequential order…done.
Delete files in sequential order…done.
Create files in random order…done.
Stat files in random order…done.
Delete files in random order…done.
Version 1.03e       ——Sequential Output—— –Sequential Input- –Random-
                    -Per Chr- –Block– -Rewrite- -Per Chr- –Block– –Seeks–
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
sva17           63G 52391  84 35222   7 34323   6 56362  88 131568  10 176.7   0
                    ——Sequential Create—— ——–Random Create——–
                    -Create– –Read— -Delete– -Create– –Read— -Delete–
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
sva17,63G,52391,84,35222,7,34323,6,56362,88,131568,10,176.7,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
顺序写: 按字符是52391KB/s,CPU占用率84%;按块是35222KB/s,CPU占用率7%
顺序读: 按字符是56362KB/s, CPU占用率88%;按块是131568KB/s,CPU占用率10%
随机读写: 176.7次/s,CPU占用率0%
后两项全是++ (没结果?)
结论:不同测试工具构建出来的测试环境不同,侧重点也不一样,得到的结果可能相差比较大。
MBPS:
dd和iozone比较接近,读写分别是130+和70+。
orion读57左右,写没测(会删掉分区内所有文件!)
bonnie++按块读是130左右,写是35左右;按字符读是56左右,写是52左右
IOPS:
dd 无结果
orion 440左右(只读)
iozone 出错
bonnie++ 176.7 (读写)