vSphere 6 Beta

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

I am skeptical and nervous about this. At almost every version upgrade we have done we have hit some incompatibilities with HBA and/or storage and relying on a direct API to manipulate storage scares me. VMwares history of weak QA last few releases and even 3pars would make upgrades a huge planning and testing ordeal. We recently did 5.5U1 and while everything was on the "HCL" list we ran into 3par 3.1.2 MU3 issues with Qlogic HBAs casing paths to disappear. Going to 3.1.3 fixed it, but neither vmware or HP could explain it. The one thing I have always loved about vmware is the abstraction layer that allows me to isolate the underlying technology from vmware and seamlessly do underlying upgrades or move form one vendor to another while systems are up. Neither vmware or 3par has demonstrated that they can pull this off without pain points. vmware has tuned their vmxnet3 and paravirtual SCSI drivers to the point that there is virtually no penalty to be virtualized in 99% of the use cases and I like it that way.

I would be more pleased if vmware would concentrate on upgrading vcenter to do a better job with basic HA tasks or to better allow me to seamlessly manage my 30 vcenters as a single entity.

Schmoog · **Joined:** Wed Oct 30, 2013 2:30 pm **Posts:** 242

Truthfully I haven't had too many issues with VMware QA. We were 75% virtual with 4.1, and have been 99% since 5.0.

I will say that I have a testing VMware environment that mirrors my production environment (at least in the area's that count), so when VMware comes out with new releases we generally spin them up in the dev environment first to iron out the kinks and certify the release for our production environment.

The only issue that I ran into with the 5.5 RTM was that under certain situations performing a rescan on a cluster would lead to the cluster getting disconnected from vCenter for a few moments. It was a weird one. Didn't impact operation (the cluster didn't crash, it just disappeared from vcenter for a minute and then came back), and VMware was able to help us get it resolved (can't remember how at this point though).

I am a little disappointed that I can't kick the tires on vVol without a test storeserv, but I'll deal.

afidel · **Joined:** Tue May 07, 2013 1:45 pm **Posts:** 216

A 7200 with minimal drive count and nbd support isn't that expensive. I've been contemplating one for testing software updates, using it for testing vvols would be icing in my mind.

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

Schmoog wrote:

Truthfully I haven't had too many issues with VMware QA. We were 75% virtual with 4.1, and have been 99% since 5.0.

I will say that I have a testing VMware environment that mirrors my production environment (at least in the area's that count), so when VMware comes out with new releases we generally spin them up in the dev environment first to iron out the kinks and certify the release for our production environment.

The only issue that I ran into with the 5.5 RTM was that under certain situations performing a rescan on a cluster would lead to the cluster getting disconnected from vCenter for a few moments. It was a weird one. Didn't impact operation (the cluster didn't crash, it just disappeared from vcenter for a minute and then came back), and VMware was able to help us get it resolved (can't remember how at this point though).

I am a little disappointed that I can't kick the tires on vVol without a test storeserv, but I'll deal.

You are lucky then. We have been 99% virtualized in our data center for 2+ years and 80% in our remote locations (200 of them).

Here are just a few of our recent issues:

5.1 SSO issues with upgrades, dvs issues with SRM. HA issues where HA stops working

5.5 upgrades of vcenter fail due to FQDN issues and their KB was incorrect on how to fix it. Broadcom in box drives conflict with previous Broadcom drivers. vcenter appiance issues if the DNS server is not on LAN connection, WAN latency will cause resolution issues. HBR can crash guest if you use Quiesce application option.

5.5 U1 SRM dvs issues are back. Qlogic native HBA drives issues with 3par 3.1.2 MU3, Broadcom in box drivers prevent upgrade manager from running forcing manual uninstall. SRM refuses to power on some VMs during recovery. vcenter appliance DNS issues have returned.

All version of SRM will show a recovery plan successful even if VMs timed out and did not reconfigure or did not power one. Their definition of successful is that the snapshot was presented and the steps were executed, not that the steps themselves were successful.

Also vcenter is a weak spot in the portfolio, the way vmware implements HA is weak. It relies on either vmtools heartbeats (very risky) or the management interface on the vsphere host. If you loos all the attached storage toa host (failed HBA) vcenter is happy as long as the host can be reached on management interface, it will NOT fail over the VMs even though they have lost all their storage.

What I find with vmware is while the hypervisor itself is solid, most of the added functionality is not well thought through or what I would call Enterprise hardened.

Schmoog · **Joined:** Wed Oct 30, 2013 2:30 pm **Posts:** 242

What is the issue you had with the DVS and SRM? I am using both myself and haven't run into any issues...

We haven't run into any issues with Broadcom mainly because we don't use them. All our ESX hosts are connected via Emulex 10GbE for Ethernet, and Emulex 8Gb FC for storage. Years ago we were using several HP blade systems which used qlogic and we had a lot of issues with firmware management etc, so since then I steer clear of qlogic.

The DNS over WAN issue is not something I've run into either. But my environment is small enough that my vcenter's have the various names of hosts and other vcenter servers defined in the hosts file (I did that because with my DNS running on VMware, I didn't want my VMware to be relying on DNS). Though I also have DNS running in each site, and I also have 2x 10Gb private fiber connections between my data centers, so latency/bandwidth/other general WAN issues are not something I have to worry about (also why I've sworn off WAN optimizers like riverbed/WAAS).

As far as the storage goes, my impression was that the esx cluster was supposed to heartbeat through the datastores, such that if the storage is lost, esx is able to detect the fault and failover. that being said, any storage related failover has been in my experience extremely slow. Much slower than I really like. When storage is not removed gracefully, esx takes FOREVER to figure out what is going on and react accordingly. Likewise even FC multipathing (even A/A round robin like 3par uses) can be embarrassingly slow. One of these days I am going to have to power off one of my FC fabrics to move it's power circuit, and I'm really not looking forward to having to do that.

Regarding the clustering, my feelings on the matter are mainly that while VMware's cluster implementation may have it's warts, it's still far and away better than managing a hyper-v cluster built on MSCS. Years ago we were managing Microsoft clusters based on 2003, 2008, and 2008 R2, and they were nothing but a nightmare.

Davidkn · **Joined:** Mon May 26, 2014 7:15 am **Posts:** 237

Must admit, I'm an srm consultant and I've not seen these issues being mentioned before.

The successful failover report was changed in a version because it was a pain that there was always some slight niggle that would make it a failure, and even though we know that a ep VMware tools timeout is fine and we don't care about it, upper management who would receive the reports wouldn't understand this.

So it had to be changed really, certain things happening(or not happening) aren't enough to warrant having a failed status for a failover or test, so I agree with this change. It still doesn't stop you going in and seeing which elements did have slight issues (the main one I see is rdms not connecting in time).

I've also not seen many issues with the storage paths failing over, certainly not a delay that would cause any issues anyway, but then I guess it does depend on how up-to-date and healthy the environment is to start with.

I have been a little confused over the cna firmware recently, what I thought was a Broadcom cna in a bl460g8, turns out to now be a qlogic one as qlogic bought the Ethernet side of the Broadcom cna business, so it's now a qlogic card with a Broadcom chipset. But on quick specs it's listed as Broadcom, but in the vibs depot recipe guide it's listed as qlogic. Very confusing.

I always stick to emulex if I can, but hp changed the default cna to be Broadcom and so that was missed for a while and where I was expecting to see an emulex card I was getting broadcoms, now qlogic.....

We have a couple of spare 3pars, an f400 and a 7200, wish I had the time to test out these vvols, but I doubt I will.

Schmoog · **Joined:** Wed Oct 30, 2013 2:30 pm **Posts:** 242

What I've found is that going back to the eva days (which for me weren't that long ago... I still run a p6300 in fact) the path failover process can result in an io disruption of somewhere around 30 seconds. Now usually this isnt a huge issue because the os timeout is like 60 seconds I think, but if you have a busy database it can certainly be noticeable.

I haven't tested it scientifically using the round robin policy that 3par uses (eva uses mru alua). Perhaps I should. Maybe I'm just being overly cautious/conservative (I tend to be highly conservative)

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

Schmoog wrote:

What is the issue you had with the DVS and SRM? I am using both myself and haven't run into any issues...

We routinely run into when SRM trys to power on a VM in a recovery plan that we get a device 0 error on the network which means the dvs port has a conflict. We have to manually flip the VLAN and flip it back to get the VM to try and grab another dvs port. We can also manually force to a known open dvs port.

Schmoog wrote:

We haven't run into any issues with Broadcom mainly because we don't use them. All our ESX hosts are connected via Emulex 10GbE for Ethernet, and Emulex 8Gb FC for storage. Years ago we were using several HP blade systems which used qlogic and we had a lot of issues with firmware management etc, so since then I steer clear of qlogic.

We use 1gb Broadcom NIC (standard on our Dell blades) as our DR test bubble, I personally would never use Broadcom for production traffic, their drivers ares some of the worst I have seen through the years. The is some issue in 5.5 U1 with the bundled Broadcom drivers and update manager not recognizing the difference in an extra bundled driver that was in 5.1 U1 so the upgrade fails until you manually remove the vib.

Schmoog wrote:

As far as the storage goes, my impression was that the esx cluster was supposed to heartbeat through the datastores, such that if the storage is lost, esx is able to detect the fault and failover. that being said, any storage related failover has been in my experience extremely slow. Much slower than I really like. When storage is not removed gracefully, esx takes FOREVER to figure out what is going on and react accordingly. Likewise even FC multipathing (even A/A round robin like 3par uses) can be embarrassingly slow. One of these days I am going to have to power off one of my FC fabrics to move it's power circuit, and I'm really not looking forward to having to do that.

I thought the same with heartbeats, but I believe the datastore heartbeats are a secondary heartbeat to the management interface. If all storage to a live host goes away (in our case the dual port HBA failed) the host actually treats is as a APD (All Paths Down). At a certain point it eventually will fail over, but if the management interfaces responds then the host is considered alive. We argued with support that this is a weak way to verfy healthy status. Their response was use vmtools heartbeats. That is so risky if tools does not respond in an otherwise healthy VM then the VM reboots on another host.

Schmoog wrote:

Regarding the clustering, my feelings on the matter are mainly that while VMware's cluster implementation may have it's warts, it's still far and away better than managing a hyper-v cluster built on MSCS. Years ago we were managing Microsoft clusters based on 2003, 2008, and 2008 R2, and they were nothing but a nightmare.

Fully agree here. We had 20 remote locations with 2 node Hyper-V clusters that we just converted to vmware because Hyper-V clustering is so bad!

Let me say that we have over 700 VMs in SRM in dozens of Protection Groups in dozens of Protection plans. The products works well, but still has a lot of room for improvement. My concern is vmware lacks a vision of how to truly integrate their products, particularly the ones that are directly dependent on each other (SRM/vcenter/vsphere).

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

Davidkn wrote:

Must admit, I'm an srm consultant and I've not seen these issues being mentioned before.
The successful failover report was changed in a version because it was a pain that there was always some slight niggle that would make it a failure, and even though we know that a ep VMware tools timeout is fine and we don't care about it, upper management who would receive the reports wouldn't understand this.

I would have to disagree. If vmtools times out that usually also means the IP reconfiguration probably failed, which means the VM is toast. Success and Failure are very absolute in my mind. We have had VMs fail to even power on and yet the Recovery Plan shows successful. Our suggestion to VM is to have a Warning status, one that indicates the recovery plan finished, but not everything went 100%. To arbitrarily say a plan is successful (again indicating no issues) and then have users calling say they can';t get to their systems is ridiculous. We have to drill into every recovery plan and expand out the actions to make sure every VM is on without issues. That is painfully time consuming when you have 700 VMs coming up.

My problem with most products today is they spend very little time understanding how their customers are using their products. Sure you participate in a survey or some conversation, but they need to dedicate time to have product managers and their teams go to customers and sit with them and actually watch how a product is being used. When I have done this with sme vendors you see their reaction on how they never thought how their products were actually being used sometimes.

We deal with this with 3par, which has been very receptive, since we really push the arrays. In 3.1.2 we actually bumped up against the 12K volume limits, we have several hundred RC Groups and scripts. While I think 3par support is awful, I love the technology and feel the product manager and his team is very open to hearing about how we use the product and what we would like the product to do.

Davidkn · **Joined:** Mon May 26, 2014 7:15 am **Posts:** 237

You clearly have a very large estate, the kind of customers I am going in to for srm are never more than about 100vms.

My experience of the vm tools timeout has simply been because the vm is taking longer to start up, so we normally just up the timet value, but for some vms, they just forever to start, which I guess is an issue in itself that needs solving, but that's not usually part of my remit.

I've not known an ip not take on failover. I haven't had a chance to install 5.5.1 yet, although I do have an install with a compellent San in 2 weeks and that's the version I am installing.

I guess, for me, my requirements are different, I personally want the system working and showing successful reports so that I can complete the install. Having reports that were failing because the heartbeat wasn't being received in time meant I had to try and convince the customer the system was working correctly, and it just needed the vm sorting or the timeout extending etc.

Once it's working, that's my job done. I agree it's not foolproof, but it's a very good product IMO and for the cost does a very good job. The frustrating part is that you can run a plan and it fails, say on an rdm mapping in time. Then you can do a cleanup, run the same plan again and it'll work. That's frustrating as you then have to convince the customer that it's going to work when you actually need it in the case of a disaster.....

HPE Storage Users Group

vSphere 6 Beta

Who is online