3PAR Users Group

A Storage Administrator Community




Post new topic Reply to topic  [ 8 posts ] 
Author Message
 Post subject: RC in "Failed" status after network outage
PostPosted: Thu Jun 13, 2019 7:20 pm 

Joined: Thu Jun 13, 2019 7:12 pm
Posts: 5
An existing periodic remote copy setup that's worked flawlessly for several years has suddenly stopped working after a lengthy network outage about two weeks ago. I have been on vacation and returned to a real mess. To make matters worse, my SAN administrator had already turned in his notice and is long gone now.

Both targets are marked as "FAILED" under Remote Copy Configuration/Targets.

All RCIP ports report READY, and can ping each other with no issues, but the LINKS are "Down" status.

What other information can I provide to help here? My only options for the remote copy groups is "Failover remote copy groups."


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Fri Jun 14, 2019 6:06 am 

Joined: Mon Sep 21, 2015 2:11 pm
Posts: 934
Location: Europe
I don't think RCIP ports shouldn't respond to ping. If they are, then I think you might have an IP conflict causing you trouble.

edit: I suggest looking at the checkrclink command in CLI to check the status... Or maybe log a call with HPE.

_________________
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Fri Jun 14, 2019 8:20 am 

Joined: Thu Jun 13, 2019 7:12 pm
Posts: 5
Code:
Running Client Side
Running link test on:  0:3:1
Test length (secs):    10
Destination Addr:      X.X.210.X
Local IP Addr:           X.X.121.X
Local Device name:       eth1

------------------------------------------------------------
Measuring link latency
------------------------------------------------------------

Average measured latency: 12.575 ms
Pings Lost:               0 %

------------------------------------------------------------
Starting max MTU test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

MTU: 1500

------------------------------------------------------------
Starting throughput test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
------------------------------------------------------------
Client connecting to X.X.210.X, TCP port 5001
TCP window size: 4096 KByte (WARNING: requested 2048 KByte)
------------------------------------------------------------
[ 12] local X.X.121.X port 45132 connected with X.X.210.X port 5001
[  6] local X.X.121.X port 45126 connected with X.X.210.X port 5001
[  9] local X.X.121.X port 45129 connected with X.X.210.X port 5001
[  8] local X.X.121.X port 45128 connected with X.X.210.X port 5001
[  4] local X.X.121.X port 45125 connected with X.X.210.X port 5001
[  5] local X.X.121.X port 45124 connected with X.X.210.X port 5001
[  3] local X.X.121.X port 45123 connected with X.X.210.X port 5001
[ 11] local X.X.121.X port 45131 connected with X.X.210.X port 5001
[  7] local X.X.121.X port 45127 connected with X.X.210.X port 5001
[ 10] local X.X.121.X port 45130 connected with X.X.210.X port 5001


All the ports report virtually identical output from checkrclink.

It appears to "hang" at this point, it never returns to the CLI, not sure if that's normal or not.

HPE support is sadly not an option; executive leadership decided that we have no need for HP support on hardware that will be replaced in a few months. I mean, why would we need support for our production SAN and DR environment? That's crazy talk.


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Fri Jun 14, 2019 10:39 am 

Joined: Mon Sep 21, 2015 2:11 pm
Posts: 934
Location: Europe
For the "message too long" you need to reduce MTU. Try 1450 and increase one by one until it stops working. Not sure if that is your only problem.

_________________
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Fri Jun 14, 2019 3:57 pm 

Joined: Thu Nov 30, 2017 11:20 am
Posts: 70
Location: WI
Im going to throw my hat in the ring here as well and say network issue. The only time I have seen it fail like that is when it couldnt reach the other side. We use periodic remote copy as well.

Can your network team help you to see if the traffic is reaching the other side? What caused the outage? Maybe a switch they are connected to lost a config? Someone forget to do a write mem?

Im a total newb on this (only been a 3par admin for about a year and a half) but here are a couple of commands that may give you more info. They may not help but not sure if you knew them at all.

Show Remote copy links
Showrctransport -rcip

Show all target and links for the remote copy group
Showrcopy –d targets or links

Start and Stop RCOPY from command line

Get list of Rcopy groups
showrcopy

Stop Rcopy Groups
stoprcopygroup <groupname>

Starting Rcopy groups
startrcopygroup <groupname>


Report back if you figure it out.


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Mon Jun 17, 2019 12:50 pm 

Joined: Thu Jun 13, 2019 7:12 pm
Posts: 5
RCIP pings are succeeding.

I've "stoprcopy" and "startrcopy" several times on both ends.

showrctransport -rcip reports "State" as "Missing" on all four ports (two local, two remote). Configuration looks good otherwise.

checkrclink freezes as reported in the previous listing with server in production testing from DR.

When running startserver in DR, I get the following at the end, after normal MTU check and whatnot

Code:
------------------------------------------------------------

Starting throughput test, from 0:3:1 -> x.x.121.x

------------------------------------------------------------



Could not connect with server.

Please ensure server is running.





============================================================

TEST SUMMARY from 0:3:1 -> x.x.121.x

Test Started:     Mon Jun 17 13:44:04 EDT 2019

Test Finisshed:   Mon Jun 17 13:45:09 EDT 2019

============================================================



Latency:                  12.058 ms

Lost pings:                    0 %

Through-put:                   0 Bits/second

Max MTU:                    1500

Tx TCP Segs:                 688

Rx TCP Segs:                 647

TCP retrans:                   8 %

Errored Segs:                  0 %





Check remote server is running.





Link 0:3:1 is NOT SUITABLE for Remote Copy Use



============================================================




Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Wed Jul 03, 2019 6:12 pm 

Joined: Wed Jul 03, 2019 6:06 pm
Posts: 1
We're seeing an almost identical situation here. We were testing a fiber failover and after we finished testing, put everything back how it was, we saw this exact same situation as you. Link in "down" status even tho all the ports are pingable and up.

Same thing -- DR-side shows links up, Prod shows links down.

Code:
PRODSAN1 cli% checkrclink startclient 0:9:1 x.x.110.131 60
Running Client Side
Running link test on:  0:9:1
Test length (secs):    60
Destination Addr:      x.x.110.131
Local IP Addr:           x.x.110.41
Local Device name:       eth1

------------------------------------------------------------
Measuring link latency
------------------------------------------------------------

Average measured latency: 16.761 ms
Pings Lost:               3 %

------------------------------------------------------------
Starting max MTU test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

MTU: 1500

------------------------------------------------------------
Starting throughput test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------

Could not connect with server.
Please ensure server is running.


============================================================
TEST SUMMARY from 0:9:1 -> x.x.110.131
Test Started:     Wed Jul  3 19:02:31 EDT 2019
Test Finished:   Wed Jul  3 19:02:37 EDT 2019
============================================================

Latency:                  16.761 ms
Lost pings:                    3 %
Through-put:                   0 Bits/second
Max MTU:                    1500
Tx TCP Segs:                 806
Rx TCP Segs:                 758
TCP retrans:                  16 %
Errored Segs:                  0 %


Check remote server is running.


Link 0:9:1 is NOT SUITABLE for Remote Copy Use

============================================================


Top
 Profile  
Reply with quote  
 Post subject: Re: RC in "Failed" status after network outage
PostPosted: Wed Jul 03, 2019 7:10 pm 

Joined: Thu Jun 13, 2019 7:12 pm
Posts: 5
No resolution yet. I've got some professional assistance scheduled early next week from a vendor, they're still unwilling to spring for actual HP support.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 


Who is online

Users browsing this forum: Google [Bot], Leif and 32 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group | DVGFX2 by: Matt