which vv or host contributed to high service times?
which vv or host contributed to high service times?
F400, most of these are on VVs from an FC CPG. I believe our backup LUN for DB2 is on a single NL VV. DB backup runs from 19:30 to about 23:00.
Had an event early this morning. Now going back with SR, around the time of the event, IOPs less than 1000 around 00:04, 00:09, through 00:34, with very high service times, then they fell off.
For my first report, I selected the DB host and all associated VVs. For the second report, I selected six application ( no DB ) hosts and their associated VVs. Both reports show high service times in the same time period.
VLUN Perf
The DB host/VVs showed a READ service time of 18ms and a WRITE service time of 31ms ( 21.8 total ) with about 501 IOPs, queue length of 6.
The APP hosts/VVs showed READ service time of 10.5ms and a WRITE service time of 25.6ms ( 25.5 total ) with about 60 IOPs, queue length of 1.
I checked some SQL VVs just to see how they were behaving, looks similar, low IOP ( 443 ), high service time, 34.5ms.
Checking Ports, link transfers ( please explain ) at 00:09 was 92980 transfers/s
Ports
Port Performance for all ports at 00:09 was 90593 Total IOP/s
Bandwidth at 00:09 for all ports was 1.25 million KBytes/s Total
all ports at 00:09 Service Time total 5.6ms
queue length was 216 at 00:09
Is there a way to see if a particular host/VVs were causing this? Could have been my DB host, but IOPs are quite low during the same time. Perhaps a 3PAR admin job?
Had an event early this morning. Now going back with SR, around the time of the event, IOPs less than 1000 around 00:04, 00:09, through 00:34, with very high service times, then they fell off.
For my first report, I selected the DB host and all associated VVs. For the second report, I selected six application ( no DB ) hosts and their associated VVs. Both reports show high service times in the same time period.
VLUN Perf
The DB host/VVs showed a READ service time of 18ms and a WRITE service time of 31ms ( 21.8 total ) with about 501 IOPs, queue length of 6.
The APP hosts/VVs showed READ service time of 10.5ms and a WRITE service time of 25.6ms ( 25.5 total ) with about 60 IOPs, queue length of 1.
I checked some SQL VVs just to see how they were behaving, looks similar, low IOP ( 443 ), high service time, 34.5ms.
Checking Ports, link transfers ( please explain ) at 00:09 was 92980 transfers/s
Ports
Port Performance for all ports at 00:09 was 90593 Total IOP/s
Bandwidth at 00:09 for all ports was 1.25 million KBytes/s Total
all ports at 00:09 Service Time total 5.6ms
queue length was 216 at 00:09
Is there a way to see if a particular host/VVs were causing this? Could have been my DB host, but IOPs are quite low during the same time. Perhaps a 3PAR admin job?
-
- Posts: 390
- Joined: Fri Jun 27, 2014 2:01 am
Re: which vv or host contributed to high service times?
If you have external System Reporter, give a look to Hi Res VLUN Perf graphs.
Otherway... Try with the IMC.
Otherway... Try with the IMC.
Re: which vv or host contributed to high service times?
I believe I ran hi res performance and just couldn't pinpoint it down to a specific LUN...I guess If I did them one at a time?
IMC, that is the realtime tool in the Management Console, correct?
I have thought about staying up late and just watching it. I know there are some statvv type commands I could run...anyone have success at running these on a schedule basis from your desktop or a host?
We have found Solarwinds beneficial in this particular case in pinpointing a possible culprit.
IMC, that is the realtime tool in the Management Console, correct?
I have thought about staying up late and just watching it. I know there are some statvv type commands I could run...anyone have success at running these on a schedule basis from your desktop or a host?
We have found Solarwinds beneficial in this particular case in pinpointing a possible culprit.
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: which vv or host contributed to high service times?
Try using Hi-res to view port perf, for DISK ports, compared by N:S:P... see if any particular loops spike.
Do the same for front end ports, compared by N:S:P .. looking for hosts that are NOT using round robin correctly is pretty hard/difficult.
If you find a spike on the back end, you can track it down to a shelf, and probably a PD with a PD perf report limited to those on that spiked loop. You should be able to ssh to the insert and pull detailed logs to see of there were LESB errors, or bad chunk let being swapped etc.
If you find a spike on the front end, you can build a list of hosts zoned to that FE port, and work your way down... VLUN perf, limited per 1 host, compared by N:S:P, the lines should be close to on-top of each other if round robin is setup properly.
Do the same for front end ports, compared by N:S:P .. looking for hosts that are NOT using round robin correctly is pretty hard/difficult.
If you find a spike on the back end, you can track it down to a shelf, and probably a PD with a PD perf report limited to those on that spiked loop. You should be able to ssh to the insert and pull detailed logs to see of there were LESB errors, or bad chunk let being swapped etc.
If you find a spike on the front end, you can build a list of hosts zoned to that FE port, and work your way down... VLUN perf, limited per 1 host, compared by N:S:P, the lines should be close to on-top of each other if round robin is setup properly.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
Re: which vv or host contributed to high service times?
Richard Siemers wrote:Try using Hi-res to view port perf, for DISK ports, compared by N:S:P... see if any particular loops spike.
Very interesting. Looking at the 00:21 yesterday morning time, yes, indeed, 3:2:3 and 2:2:3 are 1874 and 1901 IOPs respectively, whereas the remaining six ports are less than 1000 IOPs per ( 970, 960, 947, 964, 947, 950 ).
Bandwidth on 3:2:3 and 2:2:3 is 92,000 and 95,000, the remaining six are around 38,000 per.
Service time on 3:2:3 and 2:2:3 is 36ms and 45ms respectively, the other average 5ms per.
Average Busy, 3:2:3 and 2:2:3 are nearly 100% ( 96% and 97% ) the other six are low 80% range.
A glance at the colors and I see RED 3:2:3 and GREEN 2:2:3 consistently above the rest. Is this a rebalance issue?
Edit: adding to this,
cage4, loop A 3:2:3, loop B, 2:2:3 have both FC and NL disk
cage10, loop A 3:2:3, loop B, 2:2:3 have both FC and NL disk
Of the 12 cages shown, our NL disk is only in those two cages, 4 and 10. Related???
Question??? Would a TUNE on the CPG where the database lives and seems to be most affected, currently RAID 1, Tuned to a RAID 5 CPG help??
Richard Siemers wrote:Do the same for front end ports, compared by N:S:P .. looking for hosts that are NOT using round robin correctly is pretty hard/difficult.
Is that host port type versus disk port type in previous?
Port Types : host ; Port Rates : --All Port Rates-- ; Ports (n:s:p) : --All Ports-- ; Compare : n:s:p
Select Peak : total_iops
Those all look fairly balanced...only seeing 2:1:1, 2:1:2, 3:1:1, and 3:1:2
Richard Siemers wrote:If you find a spike on the back end, you can track it down to a shelf, and probably a PD with a PD perf report limited to those on that spiked loop. You should be able to ssh to the insert and pull detailed logs to see of there were LESB errors, or bad chunk let being swapped etc.
If you find a spike on the front end, you can build a list of hosts zoned to that FE port, and work your way down... VLUN perf, limited per 1 host, compared by N:S:P, the lines should be close to on-top of each other if round robin is setup properly.
Re: which vv or host contributed to high service times?
Does showportlesb show cumulative information? Not exactly sure how to interpret it. I see large numbers for LossSync and InvWord on all controllers. I see occasional double digits on LinkFail for some PDs on some controllers, but mostly 0 or 1.
Loop <2:2:3> Time since last save: 144:01:37
ID ALPA LinkFail LossSync LossSig PrimSeq InvWord InvCRC
Loop <2:2:3> Time since last save: 144:01:37
ID ALPA LinkFail LossSync LossSig PrimSeq InvWord InvCRC
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: which vv or host contributed to high service times?
mohuddle wrote:Edit: adding to this,
cage4, loop A 3:2:3, loop B, 2:2:3 have both FC and NL disk
cage10, loop A 3:2:3, loop B, 2:2:3 have both FC and NL disk
Of the 12 cages shown, our NL disk is only in those two cages, 4 and 10. Related???
Question??? Would a TUNE on the CPG where the database lives and seems to be most affected, currently RAID 1, Tuned to a RAID 5 CPG help??
Yes I would speculate that you have a hardware imbalance. Especially if cage 4 and cage 10 are in a single daisy chain which sounds like they are from above (both on 3:2:3 and 2:2:3) and contain ALL of your systems NL drives as stated.
You said F400, 8 total backend ports, and 12 cages... from all that info, I assume you have 4 nodes, 2 loops each.... which means you should have 3 cages per loop, yet we only see 2 cages above on those loops.
Ideally, you should have equally counts of NL and FC spindles on each loop.
Basically, and over simplified, when you run backups, which I assume writes to your NL drives... all 8 loops of FC disks are being utilized for read IO... and 2 of those 8 have the extra duty of doing all of the writes to NL drives. Does that sound like an accurate summary?
If all the above is true, then tune will not you help you. You will need to work with HP to rebalance the NL disks across all of your loops, which can be slow and difficult.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: which vv or host contributed to high service times?
To further drill into your response time issue, I think you can go 1 step further, PD perf, limited to 3:2:3 and 2:2:3 and compare by DISK TYPE (FC vs NL).
Pretty sure it will be the NL causing the alarm.
Pretty sure it will be the NL causing the alarm.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
Re: which vv or host contributed to high service times?
Edit: We only have TWO CONTROLLER NODES
Linking img of cage setup. Does this confirm what you suspected in previous email?
Thanks a ton for your help, it is much appreciated.
Linking img of cage setup. Does this confirm what you suspected in previous email?
Thanks a ton for your help, it is much appreciated.
Last edited by mohuddle on Thu Oct 30, 2014 2:38 pm, edited 2 times in total.
Re: which vv or host contributed to high service times?
considering a look at QoS on the F400 as a short-term patch.
Ultimately utilizing a temp license on Peer Motion to move some of these hosts over to our new 7400 and/or map new LUNs from the 7400 and move data.
Ultimately utilizing a temp license on Peer Motion to move some of these hosts over to our new 7400 and/or map new LUNs from the 7400 and move data.