relevance of port_perf "AVG Busy %" in SR reports?

Post Reply
zQUEz
Posts: 33
Joined: Mon Aug 20, 2012 1:54 pm
Location: Atlanta, GA

relevance of port_perf "AVG Busy %" in SR reports?

Post by zQUEz »

Hi all,
using System Reporter, port_perf reports, HiRes and Hourly, raises the following questions:

1) How much weight does everyone else put on the "Avg busy %" for host ports? Way back, my HP CE told me not to put too much weight on it with regards to VLUN reports, but I am wondering about the port_perf?

2) I can see some correlation with 100% avg busy % host ports and spikes in svctms on VLUN's. From that I am thinking about re-distributing our SAN zones across more ports, but want to get others opinions on whether they would make that same judgement?

3) My total KB/s on those same host ports only max out about 100,000KB/s when at the 100% avg busy mark. For a 4gb port, shouldn't I be seeing closer to 500,000 if those ports were truely maxed? I don't ever see high port utilization on the switch side.

4) During those same 100% avg busy periods, I also see queue lengths ranging from low teens to mid 40's. This ports have 55 initiators (mix of vmware, windows, linux) behind them, and SVCT ms for the ports are in the high teens.

My point of giving you these stats, is other than the avg busy % the other times don't strike me as bad, but, because there is that correlation between high VLUN svc tm spikes with the high AVG Busy %, I am wondering if I am mistaken in my anaylsis.

What do others here think?
User avatar
Richard Siemers
Site Admin
Posts: 1333
Joined: Tue Aug 18, 2009 10:35 pm
Location: Dallas, Texas

Re: relevance of port_perf "AVG Busy %" in SR reports?

Post by Richard Siemers »

I do not like that metric as implemented. My best guess is that it is based on some formula/ calculation of IOPs * bandwidth... and the definition of what 100% "looks like" has not been updated or tweaked properly.

I have compared this metric to the port performance metrics reported by Cisco Fabric Manager's %Util and they are not even close.

to answer your questions:

1) A little. I prefer a hybrid of the cisco metrics and the storage metrics.

2) Assuming you mean that you have extra ports on your 3PAR you didn't connect to your SAN already, then yes, using all of your 3PAR host ports and re-distributing hosts so that you have less hosts per port is a great idea.

3) Thats my observation as well. Its hard to find published, but HBAs (the ones we use in our servers AND the ones 3par/Netapp/Ibm/etc use in their storage) have an IOPs limit and queue depth limits as well. Its possible some of these unpublished limits are a factored into the metric. For example, if you run a benchmark with a 1k IO, you will never hit the MB/s limits... the IOps will be the bottleneck.

4) What do your backend disk IO ports look like? Generally speaking I see alot more traffic there than I do on the host ports. I suspect that bottlenecks anywhere on the system will show up as queue time and svct on the front end ports.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
zQUEz
Posts: 33
Joined: Mon Aug 20, 2012 1:54 pm
Location: Atlanta, GA

Re: relevance of port_perf "AVG Busy %" in SR reports?

Post by zQUEz »

ok great thanks for the perspective. It doesn't sound like I am missing any fundamental piece of information then. I will proceed to rezone, but, will be cautious about it opening the flood gates of performance the likes we haven't yet seen. Generally, unless I just happen to be getting 100% cache hits, I doubt FC would ever be the bottleneck for an entire system, as disk is always infinitely slower. But, we will see.

Our disk IO ports do show up as much heavier usage, though, because of AO always running, I find it more difficult to see what is real work vs. what is AO using up whatever IO is left.

Late yesterday, I did actually find back in 2012 a single day where our host ports spiked to close to 500,000 - it was for a very short hour period. I cant imagine what we were doing back then, and it was only once. Since then it is steady around 100,000. So I guess it is possible with the right workload. I still need to further track it down to a VV and host.
Post Reply