An existing periodic remote copy setup that's worked flawlessly for several years has suddenly stopped working after a lengthy network outage about two weeks ago. I have been on vacation and returned to a real mess. To make matters worse, my SAN administrator had already turned in his notice and is long gone now.
Both targets are marked as "FAILED" under Remote Copy Configuration/Targets.
All RCIP ports report READY, and can ping each other with no issues, but the LINKS are "Down" status.
What other information can I provide to help here? My only options for the remote copy groups is "Failover remote copy groups."
RC in "Failed" status after network outage
Re: RC in "Failed" status after network outage
I don't think RCIP ports shouldn't respond to ping. If they are, then I think you might have an IP conflict causing you trouble.
edit: I suggest looking at the checkrclink command in CLI to check the status... Or maybe log a call with HPE.
edit: I suggest looking at the checkrclink command in CLI to check the status... Or maybe log a call with HPE.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
Re: RC in "Failed" status after network outage
Code: Select all
Running Client Side
Running link test on: 0:3:1
Test length (secs): 10
Destination Addr: X.X.210.X
Local IP Addr: X.X.121.X
Local Device name: eth1
------------------------------------------------------------
Measuring link latency
------------------------------------------------------------
Average measured latency: 12.575 ms
Pings Lost: 0 %
------------------------------------------------------------
Starting max MTU test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
MTU: 1500
------------------------------------------------------------
Starting throughput test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
------------------------------------------------------------
Client connecting to X.X.210.X, TCP port 5001
TCP window size: 4096 KByte (WARNING: requested 2048 KByte)
------------------------------------------------------------
[ 12] local X.X.121.X port 45132 connected with X.X.210.X port 5001
[ 6] local X.X.121.X port 45126 connected with X.X.210.X port 5001
[ 9] local X.X.121.X port 45129 connected with X.X.210.X port 5001
[ 8] local X.X.121.X port 45128 connected with X.X.210.X port 5001
[ 4] local X.X.121.X port 45125 connected with X.X.210.X port 5001
[ 5] local X.X.121.X port 45124 connected with X.X.210.X port 5001
[ 3] local X.X.121.X port 45123 connected with X.X.210.X port 5001
[ 11] local X.X.121.X port 45131 connected with X.X.210.X port 5001
[ 7] local X.X.121.X port 45127 connected with X.X.210.X port 5001
[ 10] local X.X.121.X port 45130 connected with X.X.210.X port 5001
All the ports report virtually identical output from checkrclink.
It appears to "hang" at this point, it never returns to the CLI, not sure if that's normal or not.
HPE support is sadly not an option; executive leadership decided that we have no need for HP support on hardware that will be replaced in a few months. I mean, why would we need support for our production SAN and DR environment? That's crazy talk.
Re: RC in "Failed" status after network outage
For the "message too long" you need to reduce MTU. Try 1450 and increase one by one until it stops working. Not sure if that is your only problem.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
Re: RC in "Failed" status after network outage
Im going to throw my hat in the ring here as well and say network issue. The only time I have seen it fail like that is when it couldnt reach the other side. We use periodic remote copy as well.
Can your network team help you to see if the traffic is reaching the other side? What caused the outage? Maybe a switch they are connected to lost a config? Someone forget to do a write mem?
Im a total newb on this (only been a 3par admin for about a year and a half) but here are a couple of commands that may give you more info. They may not help but not sure if you knew them at all.
Show Remote copy links
Showrctransport -rcip
Show all target and links for the remote copy group
Showrcopy –d targets or links
Start and Stop RCOPY from command line
Get list of Rcopy groups
showrcopy
Stop Rcopy Groups
stoprcopygroup <groupname>
Starting Rcopy groups
startrcopygroup <groupname>
Report back if you figure it out.
Can your network team help you to see if the traffic is reaching the other side? What caused the outage? Maybe a switch they are connected to lost a config? Someone forget to do a write mem?
Im a total newb on this (only been a 3par admin for about a year and a half) but here are a couple of commands that may give you more info. They may not help but not sure if you knew them at all.
Show Remote copy links
Showrctransport -rcip
Show all target and links for the remote copy group
Showrcopy –d targets or links
Start and Stop RCOPY from command line
Get list of Rcopy groups
showrcopy
Stop Rcopy Groups
stoprcopygroup <groupname>
Starting Rcopy groups
startrcopygroup <groupname>
Report back if you figure it out.
Re: RC in "Failed" status after network outage
RCIP pings are succeeding.
I've "stoprcopy" and "startrcopy" several times on both ends.
showrctransport -rcip reports "State" as "Missing" on all four ports (two local, two remote). Configuration looks good otherwise.
checkrclink freezes as reported in the previous listing with server in production testing from DR.
When running startserver in DR, I get the following at the end, after normal MTU check and whatnot
I've "stoprcopy" and "startrcopy" several times on both ends.
showrctransport -rcip reports "State" as "Missing" on all four ports (two local, two remote). Configuration looks good otherwise.
checkrclink freezes as reported in the previous listing with server in production testing from DR.
When running startserver in DR, I get the following at the end, after normal MTU check and whatnot
Code: Select all
------------------------------------------------------------
Starting throughput test, from 0:3:1 -> x.x.121.x
------------------------------------------------------------
Could not connect with server.
Please ensure server is running.
============================================================
TEST SUMMARY from 0:3:1 -> x.x.121.x
Test Started: Mon Jun 17 13:44:04 EDT 2019
Test Finisshed: Mon Jun 17 13:45:09 EDT 2019
============================================================
Latency: 12.058 ms
Lost pings: 0 %
Through-put: 0 Bits/second
Max MTU: 1500
Tx TCP Segs: 688
Rx TCP Segs: 647
TCP retrans: 8 %
Errored Segs: 0 %
Check remote server is running.
Link 0:3:1 is NOT SUITABLE for Remote Copy Use
============================================================
Re: RC in "Failed" status after network outage
We're seeing an almost identical situation here. We were testing a fiber failover and after we finished testing, put everything back how it was, we saw this exact same situation as you. Link in "down" status even tho all the ports are pingable and up.
Same thing -- DR-side shows links up, Prod shows links down.
Same thing -- DR-side shows links up, Prod shows links down.
Code: Select all
PRODSAN1 cli% checkrclink startclient 0:9:1 x.x.110.131 60
Running Client Side
Running link test on: 0:9:1
Test length (secs): 60
Destination Addr: x.x.110.131
Local IP Addr: x.x.110.41
Local Device name: eth1
------------------------------------------------------------
Measuring link latency
------------------------------------------------------------
Average measured latency: 16.761 ms
Pings Lost: 3 %
------------------------------------------------------------
Starting max MTU test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
MTU: 1500
------------------------------------------------------------
Starting throughput test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------
Could not connect with server.
Please ensure server is running.
============================================================
TEST SUMMARY from 0:9:1 -> x.x.110.131
Test Started: Wed Jul 3 19:02:31 EDT 2019
Test Finished: Wed Jul 3 19:02:37 EDT 2019
============================================================
Latency: 16.761 ms
Lost pings: 3 %
Through-put: 0 Bits/second
Max MTU: 1500
Tx TCP Segs: 806
Rx TCP Segs: 758
TCP retrans: 16 %
Errored Segs: 0 %
Check remote server is running.
Link 0:9:1 is NOT SUITABLE for Remote Copy Use
============================================================
Re: RC in "Failed" status after network outage
No resolution yet. I've got some professional assistance scheduled early next week from a vendor, they're still unwilling to spring for actual HP support.
Re: RC in "Failed" status after network outage
Professional services assistance never happened, but thsi article at least allowed me to clean up the mess. It didn't fix the connectivity issue, but it let me un-replicate the volumes and clean up snapshots and whatnot.
https://community.hpe.com/t5/3PAR-Store ... crmidV7kuU
Specifically,
cli%setrcopytarget no_mirror_config <target array name>
is what let me clean up the leftover RC pieces.
https://community.hpe.com/t5/3PAR-Store ... crmidV7kuU
Specifically,
cli%setrcopytarget no_mirror_config <target array name>
is what let me clean up the leftover RC pieces.