HPE Storage Users Group

Posted: **Thu Jan 03, 2013 11:41 am**

HI All,

We're experiencing some anomalous performance issue on our 2 node F400 we purchased recently.
In short out SQL nodes are experiencing high disk waits and hence we're getting SQL blocking.

The guest OS is showing disk latency and high wait times.
The VMware layer is showing everything is dandy.
Looking at the disk system reports and the individual LUNs the average service time looks good (13ms).

However i've just run a report off of the 3PAR disk port utilisation and am seeing high AVG % Busy time on the disk ports 80% + continuously.
Can people have a look at the stats doc i've attached into this post and tell me if it looks abnormally high to you please and let me know.

Thanks all, we just cannot see whats causing the problem short of calling our supplier who just seems intent on clouding over the problem.

Posted: **Tue Jan 08, 2013 3:41 pm**

Hello Keith,

In my environment, the AVG % Busy metric produces some crazy scary numbers. I don't trust/use that one very much.

Also, the data you provided is "All Disk Ports" without a comparison chosen. I believe these metrics to be an aggregate of all your disk port traffic... so when I see aprox 400 KBytes/second... I can't tell if this is 1 port, or the sum of 4 ports.

Here is what I do to look for bottle necks. Start at the disk, and work your way up.

Start with a Hi-res PD Perf report, past 24 hours, peak set to total_iops, and compare set to PDID. I run this once for just the FC disks, and once again for just the NL disks. Look for FC disks close to 180 iops, and NL disks close to 80 iops as this is about as many iops these drives can do.

Move up to Hi-res PORT Perf report, same time window, set the COMPARE to N:S:P and set the port type to disk. I am assuming these are 4g links between the nodes and the disk shelves, check to make sure none of the ports are getting close to 500 KByte/second. Consider how many disk shelves you have, and if any are daisy chained off eachother... if you have available FC ports on the back of the nodes, you should be using those instead of daisy chaining.

Move up to Hi-res PORT Perf, compare N:S:P and set the port type to HOST... similar to above. You should have at least 4 host ports imho, and your traffic should be relatively balanced across all of them. If you see one maxed out and 3 or more idle, we need to re-evaluate your host cabling and multi-pathing configuration at the ESX host.

Lastly, look at Hi-res VLUN Perf, first compared by HOST. Analyse which host(s) is/are the heavy hitters then change the compared to N:S:P and then filter to just one of the heavy hitting hosts. You should see multiple N:S:P and the lines should be on top of each other working equally if your round robin multi-pathing is set correctly.

Vmware ESX has multipath policies per LUN... so you may need to filter down by host AND by each lun at a time to validate that round robin is using all available paths. I think ESX defaults to most recently used?

At this point, you should have an indication of whether the bottleneck is hardware or elsewhere up the VMware stack. If you want to give more details, perhaps we can help some more.

1) What types if disks are in the F200, how many of each? SSD, FC, NL?
2) How many disk shelves
3) How is your CPG set up? Raid type and set size?
4) How are your LUNs set up? Thin Provis? Size? Quantity?
5) Datastore, are these VMware datastores? Raw Disk Maps? etc Thin Provisioned? Eager or Lazy zeroed?

Hope this helps!

--Richard

Posted: **Tue Jan 08, 2013 4:55 pm**

I think Keith was saying that the VMware was dandy, but the SQL cluster was having issues.

Given that - everything said still applies. Look for bottle necks at the disk level, and work your way up.

-Andy

Posted: **Wed Jan 09, 2013 10:04 am**

wkdpanda wrote:I think Keith was saying that the VMware was dandy, but the SQL cluster was having issues.

Given that - everything said still applies. Look for bottle necks at the disk level, and work your way up.

-Andy

I took "dandy" to also imply that ESX is not showing high latency from its physical physical disk performance metrics. If there is a physical disk related bottleneck, I would expect ESX performance reporting/alerting to paint red all over that and be far from dandy.

That said, you might want to circle back to vSphere and look at the "physical disk command latency" of the ESX server under "Hosts and Clusters" advanced performance charts. Check that chart and look for excessive times over 10ms. A single spike once in awhile is fine, what you are looking for are lines that remain over 10ms for several polling cycles straight. Each line should represent a separate LUN, or datastore so you can zero in on the hot one(s).

Posted: **Fri Jan 11, 2013 4:40 am**

Thanks for the reply guys, i will collate the info requested and run the reports of suggested and report back shortly.

Posted: **Tue May 28, 2013 7:50 pm**

PDCL.docx: (1.45 MiB) Downloaded 1960 times

Sorry for the delay, been working on other projects.
We now have our supplier and HP involved in this investigation as we're experiencing allot of problems.
We were advised to upgrade to 3.2.1 and then enable AO on the CPG's which our SAN techs did.

We're getting constant 85% + port throughput and it plateaus out as well.
Attached is the graphs of Physical device command latency as requested.
This is just two of our SQL hosts we have some 40+ G7 blades fully loaded 10 of which are our production SQL boxes.

Posted: **Tue May 28, 2013 7:53 pm**

Will have a reply to the other questions shortly hopefully, working nights this week moving our SQL SRS servers onto cheaper storage.

HPE Storage Users Group

3PAR F400 performance issues.

3PAR F400 performance issues.

Re: 3PAR F400 performance issues.

Re: 3PAR F400 performance issues.

Re: 3PAR F400 performance issues.

Re: 3PAR F400 performance issues.

Re: 3PAR F400 performance issues.

Re: 3PAR F400 performance issues.