Schmoog wrote:
What is the issue you had with the DVS and SRM? I am using both myself and haven't run into any issues...
We routinely run into when SRM trys to power on a VM in a recovery plan that we get a device 0 error on the network which means the dvs port has a conflict. We have to manually flip the VLAN and flip it back to get the VM to try and grab another dvs port. We can also manually force to a known open dvs port.
Schmoog wrote:
We haven't run into any issues with Broadcom mainly because we don't use them. All our ESX hosts are connected via Emulex 10GbE for Ethernet, and Emulex 8Gb FC for storage. Years ago we were using several HP blade systems which used qlogic and we had a lot of issues with firmware management etc, so since then I steer clear of qlogic.
We use 1gb Broadcom NIC (standard on our Dell blades) as our DR test bubble, I personally would never use Broadcom for production traffic, their drivers ares some of the worst I have seen through the years. The is some issue in 5.5 U1 with the bundled Broadcom drivers and update manager not recognizing the difference in an extra bundled driver that was in 5.1 U1 so the upgrade fails until you manually remove the vib.
Schmoog wrote:
As far as the storage goes, my impression was that the esx cluster was supposed to heartbeat through the datastores, such that if the storage is lost, esx is able to detect the fault and failover. that being said, any storage related failover has been in my experience extremely slow. Much slower than I really like. When storage is not removed gracefully, esx takes FOREVER to figure out what is going on and react accordingly. Likewise even FC multipathing (even A/A round robin like 3par uses) can be embarrassingly slow. One of these days I am going to have to power off one of my FC fabrics to move it's power circuit, and I'm really not looking forward to having to do that.
I thought the same with heartbeats, but I believe the datastore heartbeats are a secondary heartbeat to the management interface. If all storage to a live host goes away (in our case the dual port HBA failed) the host actually treats is as a APD (All Paths Down). At a certain point it eventually will fail over, but if the management interfaces responds then the host is considered alive. We argued with support that this is a weak way to verfy healthy status. Their response was use vmtools heartbeats. That is so risky if tools does not respond in an otherwise healthy VM then the VM reboots on another host.
Schmoog wrote:
Regarding the clustering, my feelings on the matter are mainly that while VMware's cluster implementation may have it's warts, it's still far and away better than managing a hyper-v cluster built on MSCS. Years ago we were managing Microsoft clusters based on 2003, 2008, and 2008 R2, and they were nothing but a nightmare.
Fully agree here. We had 20 remote locations with 2 node Hyper-V clusters that we just converted to vmware because Hyper-V clustering is so bad!
Let me say that we have over 700 VMs in SRM in dozens of Protection Groups in dozens of Protection plans. The products works well, but still has a lot of room for improvement. My concern is vmware lacks a vision of how to truly integrate their products, particularly the ones that are directly dependent on each other (SRM/vcenter/vsphere).