Issue with peer persistence in VMware involving pathing
Posted: Wed Dec 03, 2014 3:30 pm
I have VMware 5.1 running on datastores presented from two 7400 arrays which are running sync remote copy between two sites over FC. I had to perform maintenance on a blade chassis at the primary site, so I ran the switchover command on all of my remote copy groups containing datastores.
All of the datastores switched correctly, however, when maintenance was over and I performed another switchover, all of the datastores switched back except one. I confirmed with a showrcopy command that indeed all remote copy groups returned to their original state, but for some reason the ALUA pathing on some of the hosts failed to update and their "Active (I/O)" paths were still pointed to the secondary storage array.
In order to correct this, I ran another switchover command on the remote copy group containing the datastore with the incorrect active paths. This time, the switchover did not happen successfully. I ended up with a 3par alert and the remote copy group ended up performing a full resync. During this time, this datastore was completely offline, even though some of the hosts had the correct active paths to the secondary array. The datastore was visible within VMware, but when I attempted to browse the files with the vSphere client, it showed no files within the datastore.
At this point, I called and opened a case, but so far HP's suggestion is to perform another switchover and to unmount and remount the datastore. In conclusion, I'm left with a datastore only visible on two out of 12 servers. The 10 servers that cannot view the datastore have all of their active (I/O) paths pointed at the wrong (secondary) site. The paths that it should be using are all in "stand by" mode. I have a production VM running a DB on this datastore so I cannot afford to lose all connectivity to it and I cannot storage vMotion it.
My long-term solution is to move the VMs off of this datastore and delete it, but I'm hoping I can get a better understanding of what is happening. Luckily this happened on my least populated datastore. Is there a way to force the active (i/o) paths back to the correct storage array?
FYI, my host sets are using persona 11 and the 3par os firmware is 3.1.3 mu1 and I'm using round-robin multipathing. Here is the alert I got when attempting the switchover:
Severity: Degraded
Type: Component state change
Message: Remote Copy Volume 15795(vv_name_goes_here)
Degraded (Volume Unsynced - promote of snapshot failed {0x8} )
ID: 2474
Message Code: 0x03700de
All of the datastores switched correctly, however, when maintenance was over and I performed another switchover, all of the datastores switched back except one. I confirmed with a showrcopy command that indeed all remote copy groups returned to their original state, but for some reason the ALUA pathing on some of the hosts failed to update and their "Active (I/O)" paths were still pointed to the secondary storage array.
In order to correct this, I ran another switchover command on the remote copy group containing the datastore with the incorrect active paths. This time, the switchover did not happen successfully. I ended up with a 3par alert and the remote copy group ended up performing a full resync. During this time, this datastore was completely offline, even though some of the hosts had the correct active paths to the secondary array. The datastore was visible within VMware, but when I attempted to browse the files with the vSphere client, it showed no files within the datastore.
At this point, I called and opened a case, but so far HP's suggestion is to perform another switchover and to unmount and remount the datastore. In conclusion, I'm left with a datastore only visible on two out of 12 servers. The 10 servers that cannot view the datastore have all of their active (I/O) paths pointed at the wrong (secondary) site. The paths that it should be using are all in "stand by" mode. I have a production VM running a DB on this datastore so I cannot afford to lose all connectivity to it and I cannot storage vMotion it.
My long-term solution is to move the VMs off of this datastore and delete it, but I'm hoping I can get a better understanding of what is happening. Luckily this happened on my least populated datastore. Is there a way to force the active (i/o) paths back to the correct storage array?
FYI, my host sets are using persona 11 and the 3par os firmware is 3.1.3 mu1 and I'm using round-robin multipathing. Here is the alert I got when attempting the switchover:
Severity: Degraded
Type: Component state change
Message: Remote Copy Volume 15795(vv_name_goes_here)
Degraded (Volume Unsynced - promote of snapshot failed {0x8} )
ID: 2474
Message Code: 0x03700de