Page 1 of 1

iscsi performance

Posted: Tue Dec 09, 2014 9:46 pm
by cfreak
hi,
i have a new 4 node 7450 3.2.1mu1 connected to 32 blades running vsphere 5.5u2 over iscsi connected via 4x6120xg switches. Each Vhost has 8 paths to each datastore, rr with iops=1.

When i migrate VMs from our P4900 to the new 3par (1TB thin vv) the average IO latency inside the linux guest increases by quite some margin.

As i spent the last two weeks debugging and ran out of ideas (besides ordering FC hardware) perhaps you guys can help me.

Benchmarks with fio also show the increased latency for single IO requests:

Code: Select all

root@3par:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/3136K /s] [0 /[color=#FF0000]784  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5096
  write: io=102400KB, bw=3151.7KB/s, iops=787 , runt= 32497msec
    slat (usec): min=28 , max=2825 , avg=43.49, stdev=28.71
    clat (usec): min=967 , max=6845 , avg=1219.50, stdev=156.39
     lat (usec): min=1004 , max=6892 , avg=1263.63, stdev=160.23
    clat percentiles (usec):
     |  1.00th=[ 1012],  5.00th=[ 1048], 10.00th=[ 1064], 20.00th=[ 1112],
     | 30.00th=[ 1160], 40.00th=[ 1192], 50.00th=[ 1224], 60.00th=[ 1240],
     | 70.00th=[ 1272], 80.00th=[ 1288], 90.00th=[ 1336], 95.00th=[ 1384],
     | 99.00th=[ 1608], 99.50th=[ [color=#BF0000]1816[/color]], 99.90th=[ 3184], 99.95th=[ 3376],
     | 99.99th=[ 5792]
    bw (KB/s)  : min= 3009, max= 3272, per=100.00%, avg=3153.97, stdev=51.20
    lat (usec) : 1000=0.42%
    lat (msec) : 2=99.21%, 4=0.34%, 10=0.02%
  cpu          : usr=0.82%, sys=3.95%, ctx=25627, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=3151KB/s, minb=3151KB/s, maxb=3151KB/s, mint=32497msec, maxt=32497msec

Disk stats (read/write):
    dm-0: ios=0/26098, merge=0/0, ticks=0/49716, in_queue=49716, util=92.15%, aggrios=0/25652, aggrmerge=0/504, aggrticks=0/33932, aggrin_queue=33888, aggrutil=91.79%
  sda: ios=0/25652, merge=0/504, ticks=0/33932, in_queue=33888, util=91.79%


Code: Select all

root@p4900:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 100MB)
Jobs: 1 (f=1): [w] [100.0% done] [0K/7892K /s] [0 /[color=#BF0000]1973  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3193
  write: io=102400KB, bw=7464.2KB/s, iops=1866 , runt= 13719msec
    slat (usec): min=21 , max=1721 , avg=29.71, stdev=14.23
    clat (usec): min=359 , max=22106 , avg=502.07, stdev=222.14
     lat (usec): min=388 , max=22138 , avg=532.20, stdev=222.72
    clat percentiles (usec):
     |  1.00th=[  386],  5.00th=[  402], 10.00th=[  410], 20.00th=[  426],
     | 30.00th=[  438], 40.00th=[  454], 50.00th=[  470], 60.00th=[  490],
     | 70.00th=[  516], 80.00th=[  548], 90.00th=[  596], 95.00th=[  660],
     | 99.00th=[ 1032], 99.50th=[[color=#FF0000] 1192[/color]], 99.90th=[ 2672], 99.95th=[ 4192],
     | 99.99th=[ 8032]
    bw (KB/s)  : min= 6784, max= 8008, per=100.00%, avg=7464.89, stdev=339.23
    lat (usec) : 500=64.13%, 750=32.82%, 1000=1.85%
    lat (msec) : 2=1.04%, 4=0.10%, 10=0.05%, 50=0.01%
  cpu          : usr=2.01%, sys=5.86%, ctx=25635, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=7464KB/s, minb=7464KB/s, maxb=7464KB/s, mint=13719msec, maxt=13719msec

Disk stats (read/write):
    dm-0: ios=0/25647, merge=0/0, ticks=0/12712, in_queue=12712, util=87.10%, aggrios=0/25617, aggrmerge=0/166, aggrticks=0/12748, aggrin_queue=12736, aggrutil=86.71%
  sda: ios=0/25617, merge=0/166, ticks=0/12748, in_queue=12736, util=86.71%

Re: iscsi performance

Posted: Tue Dec 09, 2014 11:18 pm
by afidel
Is that during the move, or after the VM is done moving?

Re: iscsi performance

Posted: Tue Dec 09, 2014 11:22 pm
by cfreak
this benchmark is from 1 old an 1 new test vm

Re: iscsi performance

Posted: Wed Dec 10, 2014 1:51 pm
by JohnMH
Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post viewtopic.php?f=18&t=883&p=4246&hilit=interrupt#p4246

Re: iscsi performance

Posted: Wed Dec 10, 2014 6:25 pm
by afidel
Oh, and a 100MB test file isn't going to tell you anything, you need to be several times the size of the storage cache to get any value out of a synthetic benchmark.

Re: iscsi performance

Posted: Wed Dec 10, 2014 8:07 pm
by Schmoog
JohnMH wrote:Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post http://www.3parug.com/viewtopic.php?f=1 ... rupt#p4246


+100 to what John and afidel said. I see this particular issue coming up relatively often here. These systems are designed to function under load in a multi threaded multi tenant environment. If your benchmarking isn't stressing the system, the performance number will be lackluster.

It's a little counterintuitive because conventional wisdom would say lower load = higher performance. But in the car of extremely small workloads that is not necessarily the case. A workload that isn't big enough to hit the buffer queues, write cache, etc, won't get very high performance numbers in a system that is otherwise idle