Since a couple of day at work I had to investigate on some network issues for an application, it seems that we have a strange ARP configuration that may be the main issue… by the way, I discovered another strange thing where icmp packets sent in burst mode (like a flood) are lost at variable rates. Here is the note on my investigations
Let’s start by the beginning of the story, I had to reproduce a high frequency request/reply on a server to reproduce what is happening on an application having some trouble. For this I chose icmp as it is easy to simulate. Unfortunately ping flood is only available for root but as IBM may thing normal user should be able to overload an interface that have created the -l option allowing the same such violent method. Thank you IBM.
Here is my script to produce heavy and aggressive ping
#!/bin/ksh num=100 server=localhost totsend=0 totlos=0 while true ; do res=`ping -q -w 2 -s 1024 -l $(( $num - 1 )) -c $num $server | tail -2 | tr '\n' '#'` echo $res # 100 packets transmitted, 89 packets received, 11% packet loss#round-trip min/avg/max = 0/0/1 ms# received=`echo $res | cut -d ',' -f 2 | cut -d ' ' -f 2` totsend=$(( $totsend + $num )) totlos=$(( $totlos + $num - $received )) echo "overall lost : $(( 1000 * $totlos / $totsend )) o/oo == $totlos / $totsend" sleep 1 done
As the localhost is a good example as it always lost 39% of the traffic it is interesting to investigate that one.
With netstat we can investigate on where the packets has been dropped. First approach is to check ICMP
#netstat -p icmp icmp: 1132 calls to icmp_error 0 errors not generated because old message was icmp Output histogram: echo reply: 743 destination unreachable: 1132 echo: 369853 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: echo reply: 369853 destination unreachable: 1137 echo: 743 743 message responses generated
So here, we have some errors but in fact, when the tping.sh script is launched, these numbers are not changing. Look strange but why not .. take a look to IP propotocol
#netstat -p IP
ip:
64337400 total packets received
[...]
579 packets dropped due to the full socket receive buffer
0 dead gateway detection packets sent
By running this command and the script in parallel it is easy to see that this counter is growing as we are losing icmp packets.
It seems that this socket receive buffer is the reason of this loss issue. This could make sense regarding what I have measured on different servers : faster is the response time, more loss we have:
- local host is 39%
- different lpar / same frame : 15%
- different lpar / different frame : 5%
- different lapr / different datacenter : 0,5%
- Linux box on a vm : no loss
So … we can assume faster is the response, quicker the input buffer is full. What is interesting is to consider that this buffer is about 61% of 100 icmp packet of 1024 … we can assume this buffer looks limited to 64KB. I assume this can be an issue when our application is trying to transfer large XML messages in mass over the loopback interface between internal services.
This can also be seen by running netstat -D where the Idrops column is growing for IP Protocol in my case.
Let’s confirm this hypothesis by looking the configuration …
# no -a
The size of the received queue is defined by ipqmaxlen parameter.
# no -a| grep ipqmaxlen ipqmaxlen = 100
This means 100 packet can be stored in the queue. By-the-way, I’m only able to process 61 on 100 … so I assume this is due to my icmp lenght of 1024 potentially split into multiple packets. Let’s change this for 110 packet of 512B …
ksh ./tping.sh 110 packets transmitted, 103 packets received, 6% packet loss#round-trip min/avg/max = 0/0/1 ms# overall lost : 63 o/oo == 7 / 110
Considering some of he packets are proceeded before all are sent, this would make sense to have 103 proceed and 7 losts
So the next step is to ask root to change this parameter to something higher !
in Fact later, after deep reading and no efficient solution, I learned that this parameter will affect the ipintrq counter. In my case the value was 0 (no pbm)! ]
# netstat -s | grep ipin 0 ipintrq overflows
This done, I was expecting to get a better result ! unfortunately not 🙁 🙁
So ! let’s consider the other parameters : The IBM documentation helps
But I found no parameters really interesting me at this point.
Let’s take a look to the network stack to have a better understanding : the source of this picture is here.
So according to the error counter (full socket receive buffer) and this schema, I would say that socket-receive buffer would be full the main issue for me here is that I’m running ICMP, not TCP/UDP… by the way, I will assume the mechanism are similar. This could confirm receive queue (on a lower level) is not in cause.
If the drop was on a lower layer, it could be interesting to investigate parameter on the interface, they can be access with the following command :
# lsattr -E -l enXXX
As I’m reproducing the problem on loopback, these parameter have no limitation and should not be in cause.
Let’s take a look to mbuf : these buffers are allocated to store the network data. They are pre-allocated buffers per CPU.
# netstat -m
******* CPU 0 ******* By size inuse calls failed delayed free hiwat freed 64 200 481148 0 6 376 524288 0 128 1082 35437 0 33 198 393216 0 256 4864 101700 0 300 208 786432 0 512 6757 7985069 0 839 163 1638400 0 1024 73 151994 0 31 55 393216 0 2048 1826 332816 0 1013 338 983040 0 4096 10 128 0 3 5 65536 0 8192 16 1627 0 12 4 16384 0 16384 385 495 0 54 1 8192 0 32768 137 471826 0 114 372 4096 0 65536 187 308 0 114 10 4096 0 131072 2 75 0 0 5120 10240 0
These buffer are allocated by size (first column).Next column are the following
- inuse = how many buffer are actually in use
- calls = how many request have been made for this size
- failed = how many request have not been positively answered
- delayed = how many request have been delayed when authorized
- free = how many buffer of this size are actually available for request
- hiwat = under this limit, the free list will not be returned (free) to the system
- freed = number of buffer returned (free) to the system
So, only failed column is a good indicator of a mbuf memory issue. In my system, no failed. Apparently, according to some documentation network stack is using 256 and 4096 mbufs. Delayed mbuf could be a point to take into consideration, when performance issues are detected.
Another way to check buffer allocation is to use netstat -M eventually we can sort the result on failed events:
# netstat -M | sort -n -k 4 | tail -3 2048 62 72035 477 0 89 F1000E007E368E00 2048 62 198183 694 0 94 F1000E007E37FE00 2048 512 464072 84068 512 639 F1000E00A40E3400
Seems there is something to look at here, mostly as it is growing … but how to identify what is the source of this (no solution found)?
So let’s get back on network parameters, here are the one in relation with buffer (accessible with no -o yyyyy command)
- sb_max : is the size for all socket, in bytes. (2MB in my system)
- tcp_recvspace : is the size for tcp receive sockets in byte (256KB in my system)
- tcp_sendspace : is the size for tcp send socket in byte (256KB in my system)
- udp_recvspace : is the size for udp receive sockets in byte (256KB in my system)
- udp_sendspace : is the size for udp send socket in byte (256KB in my system)
sb_max must be higher than tcp* + udp*. My problem is that ICMP is part of ?
According to different test, I assume I’m having a drop occurring when the amount of data is about 64 – 74KB of data. This id lower than the limit here defined.
The maximum size of mbuf is defined according to this parameter :
# lsattr -E -l sys0 -a maxmbuf
In my system, 0 indicates, no specific limit.
To go further, the use of the netpmon tool is really interesting to get a per process network usage analysis.
To use it :
# netpmon Run trcstop command to signal end of trace. Fri Apr 4 10:35:09 2014 System: AIX 7.1 Node: xxxaixxxxxx Machine: xxxxxxx # ... # trcstop
And we can get report like this :
First Level Interrupt Handler CPU Usage Statistics: --------------------------------------------------- Network FLIH CPU Time CPU % CPU % ---------------------------------------------------------- data page fault 4.2324 1.221 0.000 UNKNOWN 0.3702 0.107 0.000 PPC decrementer 0.0878 0.025 0.000 external device 0.0083 0.002 0.000 queued interrupt 0.0078 0.002 0.000 instruction page fault 0.0000 0.000 0.000 ---------------------------------------------------------- Total (all FLIHs) 4.7066 1.358 0.000 ======================================================================== TCP Socket Call Statistics (by Process): ---------------------------------------- ------ Read ----- ----- Write ----- Process (top 20) PID Calls/s Bytes/s Calls/s Bytes/s ------------------------------------------------------------------------ df_engine 29294834 92.20 2470242 80.96 107869 java 12255424 69.29 511481 69.38 46462 sshd: 27328748 0.14 2269 0.21 13 ======================================================================== Detailed TCP Socket Call Statistics (by Process): ------------------------------------------------- PROCESS: df_engine PID: 29294834 reads: 3995 read sizes (bytes): avg 26791.2 min 8208 max 65536 sdev NaNQ read times (msec): avg 2.422 min 0.003 max 3013.614 sdev 48.963 writes: 3508 write sizes (bytes): avg 1332.3 min 13 max 14679 sdev 3378.6 write times (msec): avg 0.044 min 0.013 max 19.546 sdev 0.365 PROCESS: java PID: 12255424
Providing good information on network usage
I also used the following script to check the mbuf delay activity
while true ; do v=0 ; netstat -m | grep "^[0123456789]" | sort -k 5 | tr -s " " | \ cut -d " " -f 5 | grep -v "failures" | \ while read ; do v=$(( $v + $REPLY )) ; done echo `date '+%H:%M:%S' ` $v sleep 1 done
This second script is usefull to track memory fail on the 512 and 4096 block used by the network stack :
#!/bin/sh while true ; do b512=`netstat -M | tr -s " " | grep "^ 512 " | cut -d " " -f 5 | \ ( v=0 ; while read ; do v=$(( $v + $REPLY )) ; done ; echo $v )` b4096=`netstat -M | tr -s " " | grep "^ 4096 " | cut -d " " -f 5 |\ ( v=0 ; while read ; do v=$(( $v + $REPLY )) ; done ; echo $v )` echo `date '+%H:%m:%S' ` $b512 $b4096 sleep 5 done
This analysis is really interesting because there is a corellation between the network freeze seen and the growth of the delayed mbuf. My analysis is the following one : the frame is store in the IP stack and try to go up to TCP, for this a mbuf is required if the tcp stack have nomore mbuf available, I assume the IP layer put the frame in the parking and proceed the others (when smaller they can have mbuf available) what is strange is how many time the frame stay in parking, sometime 30 seconds in my case when in parallel lot of next frame are proceeded immediately.
By-the way, it is possible to give more memory to the tcp queues by changing this parameter :
This article gives me lots of element for the analysis :o) thank you