SLES12 SP2的linux上发生的问题,并不常见,但是给出了一些新的思路。
现象是数据库进程达到300个左右时,就无法继续连接数据库了,报以下错误。
ERROR:
 ORA-12518: TNS:listener could not hand off client connection
 
 
 15-AUG-2017 01:40:01 * (CONNECT_DATA=(CID=(PROGRAM=myapp)(HOST=__jdbc__)(USER=admin))(SERVER=DEDICATED)(SERVICE_NAME=oracle)) * (ADDRESS=(PROTOCOL=tcp)(HOST=11.22.33.44)(PORT=1521)) * establish * oracle * 12518
 TNS-12518: TNS:listener could not hand off client connection
  TNS-12536: TNS:operation would block
   TNS-12560: TNS:protocol adapter error
   TNS-00506: Operation would block
   Linux Error: 11: Resource temporarily unavailable
问题可以一直重现,但是用户无法找到限制在哪儿,ulimit -a显示没有明显限制:
sa-server-0:grid:+ASM1 # ulimit -a
 core file size          (blocks, -c) 0
 data seg size           (kbytes, -d) unlimited
 scheduling priority             (-e) 0
 file size               (blocks, -f) unlimited
 pending signals                 (-i) 513378
 max locked memory       (kbytes, -l) 64
 max memory size         (kbytes, -m) unlimited
 open files                      (-n) 1000000
 pipe size            (512 bytes, -p) 8
 POSIX message queues     (bytes, -q) 819200
 real-time priority              (-r) 0
 stack size              (kbytes, -s) 8192
 cpu time               (seconds, -t) unlimited
 max user processes              (-u) 1000000
 virtual memory          (kbytes, -v) unlimited
 file locks                      (-x) unlimited
检查进程限制也没有异常:
sa-server-0:~ # cat /proc/5497/limits
 Limit                     Soft Limit           Hard Limit           Units    
 Max cpu time              unlimited            unlimited            seconds  
 Max file size             unlimited            unlimited            bytes    
 Max data size             unlimited            unlimited            bytes    
 Max stack size            33554432             unlimited            bytes    
 Max core file size        unlimited            unlimited            bytes    
 Max resident set          unlimited            unlimited            bytes    
 Max processes             513378               513378               processes
 Max open files            65536                65536                files    
 Max locked memory         unlimited            unlimited            bytes    
 Max address space         unlimited            unlimited            bytes    
 Max file locks            unlimited            unlimited            locks    
 Max pending signals       513378               513378               signals  
 Max msgqueue size         819200               819200               bytes    
 Max nice priority         0                    0                    
 Max realtime priority     0                    0                    
 Max realtime timeout      unlimited            unlimited            us        
让用户取了listener的strace,的确是clone函数失败,原因是资源不足(Resource temporarily unavailable):
STRACE
 ——————-
 filename=listener.strace
 
 11404      0.000022 poll([{fd=8, events=POLLIN|POLLRDNORM}, {fd=11, events=POLLIN|POLLRDNORM}, {fd=13, events=POLLIN|POLLRDNORM}, {fd=14, events=POLLIN|POLLRDNORM}, {fd=15, events=POLLIN|POLLRDNORM}, {fd=16, events=POLLIN|POLLRDNORM}, {fd=17, events=POLLIN|POLLRDNORM}, {fd=3, events=POLLIN|POLLRDNORM}], 8, 60000) = 2 ([{fd=15, revents=POLLIN|POLLRDNORM}, {fd=3, revents=POLLIN|POLLRDNORM}]) <0.000012>
 11404      0.000043 read(3, “\0\367\0\0\1\0\0\0\0016\1,\fA \0\177\377O\230\0\0\0\1\0\275\0:\0\0\0\0″…, 8208) = 247 <0.000010>
 11404      0.000028 fcntl(3, F_GETFL)   = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000008>
 11404      0.000021 fcntl(3, F_SETFL, O_RDWR) = 0 <0.000008>
 11404      0.000023 times({tms_utime=5483, tms_stime=2588, tms_cutime=440, tms_cstime=60}) = 1720115043 <0.000009>
 11404      0.000096 fcntl(3, F_SETFD, 0) = 0 <0.000010>
 11404      0.000027 pipe([18, 19])      = 0 <0.000012>
 11404      0.000026 pipe([20, 21])      = 0 <0.000011>
 11404      0.000024 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4e29b769d0) = -1 EAGAIN (Resource temporarily unavailable) <0.000197>  《============
 11404      0.000219 close(18)           = 0 <0.000011>
 11404      0.000022 close(19)           = 0 <0.000010>
 11404      0.000023 close(20)           = 0 <0.000009>
 11404      0.000021 close(21)           = 0 <0.000009> 
检查OS log发现了一点端倪:
2017-08-16T02:36:55.560027+08:00 server-0 kernel: [ 165.619978] cgroup: fork rejected by pids controller in /system.slice/ohasd.service
‘ fork rejected by pids controller’ 说明对进程数是有限制的。
最终的原因是因为在SUSE 12上增加了systemd的资源控制,其中默认参数:
DefaultTasksMax was default value(512).
 systemd limited maximum number of tasks that may be created in the unit.
  这个值会影响 OS上的maxpid,将该参数设为无限制后解决该问题:
修改 /etc/systemd/system.conf
设置 DefaultTasksMax 的值为’infinity’,重启主机。