Quantum Forum for DXi V5000

Initial replication keeps dying randomly

Hello,

I'm not new to the industry or to backup and dedupping appliances etc. etc. but I am new to Quantum and the DXi. My company is a Quantum partner.
To get to know the product, I downloaded the v5000 community edition, installed it in the main Lab, configured a NFS share and ran a couple of days worth of Veeam backups to it: 4.66 TB before reduction, 1.24 TB after reduction. All good. No problems what so ever.

I then installed a second one (called dxi02) and configured directory/file based replication and said "synchronize now" which ran for about 9 hours and then died with the error "Namespace Replication did not complete successfully for : backups to target: dxi02.sezaro.local. Error getting replication service status. Cannot determine the possible reason for failure".

It keeps doing this totally randomly. It can run for many hours before breaking off, or die after a couple of minutes and anything in between. It has been doing this for the past 3 days now. I keep pressing "abort" and "synchronize now" and off it goes again.

If I understand the concept correctly, it must make an initial copy of those 1.24 TB and the accompanying meta-data (using the "synchronize now" button in the GUI).
The link between the lab is 50 Mbit which means 12.3 MB/s. I used hypervisor based throttling to bring it down to 10 MB/s. To transfer 1.24 TB share data + Metadata with 10 MB/s will take a while but that's ok.

The link between the lab and the remote site is a reliable VPN connection which is monitored and other stuff runs over it just fine. I don't use encryption because of the VPN.
Firewall ports going from the lab's v5000 (called dxi01) towards the remote one (dxi02) are all open and firewall issues are extremely unlikely. So are networking/VPN problems. The link is stable as a rock.

I must learn to troubleshoot these kind of problems. The error "Namespace Replication did not complete successfully for : backups to target: dxi02.sezaro.local. Error getting replication service status. Cannot determine the possible reason for failure" is not very helpful.

The second (target) DXi never shows any problems what so ever by the way.

Where do I start looking? Which logs etc. I use IP-adresses everywhere in the replication-configurations at this point to exclude possible DNS issues.

I attached a screenshot from dxi01 (the source) which shows the replication in progress and the speed.
The second screenshot is from the same system, dxi01, almost 4 hours later (again, the "dying interval" is totally random).

Kind regards,
Steven Rodenburg

0 members like this

Attachments:

Screenshot 2022-02-14 at 17.57.12.png, 521 KB
Screenshot 2022-02-14 at 21.46.53.png, 469 KB

▶ Reply to This

Replies to This Discussion

Permalink Reply by Steven Rodenburg on February 14, 2022 at 3:44pm

Update: I found the /var/log/DXi/tsunami.log on the source DXi which contains info about the replication process. Right at the time another replication-session ("syncronize now") died, I noticed this in that logfile:

ERROR - 2022-02-14-22:55:01.247 - replicationd Error.cpp(161) [replicationd] process_error() - BFSTv2 error occurred [/home/builder/work/QuantumVTL/lib/tsunami/Blockpool/BP/BfstV2.cpp(89)::connect_handle()]): ErrorCode: 229336895 ErrorString: Busy(bfst2_connect() failed: VServer "remote-no-encrypt" connection limit failed due to VServer "remote-group".
VServer "remote-group" limit reached (limit 10).
)
ERROR - 2022-02-14-22:55:01.247 - replicationd RemoteBfstV2.cpp(224) [replicationd] connect() - Failed to connect remote-no-encrypt@172.16.1.28:1062: ErrorCode: 229336895 ErrorString: Busy(bfst2_connect() failed: VServer "remote-no-encrypt" connection limit failed due to VServer "remote-group".
VServer "remote-group" limit reached (limit 10).
).
ERROR - 2022-02-14-22:55:01.247 - replicationd RERequest.cpp(167) [replicationd] send() - request_id 6020 failed: ErrorCode: 229336895 ErrorString: Busy(bfst2_connect() failed: VServer "remote-no-encrypt" connection limit failed due to VServer "remote-group".
VServer "remote-group" limit reached (limit 10).
).
ERROR - 2022-02-14-22:55:01.247 - replicationd REChargebackThread.cpp(614) [replicationd] doProcess() - Failed to send Replication Chargeback stats to target machine : 172.16.1.28, node: backups
INFO - 2022-02-14-22:58:01.926 - replicationd Target.cpp(196) [replicationd] discover() - Probing target "172.16.1.28": Does support BFST tunnel.
WARN - 2022-02-14-23:01:04.951 - root QuickMemen(0) [qlog] - Limit reached: /usr/libexec/accounts-daemon Puniq=15468 MaxKB=5000
WARN - 2022-02-14-23:01:04.963 - root QuickMemen(0) [qlog] - Limit reached: /usr/sbin/lightdm Puniq=11276 MaxKB=3500

On the target DXi, in the same tsunami.log I don't see any errors but I do see this warning a couple of times:
"WARN - 2022-02-14-14:01:04.946 - root QuickMemen(0) [qlog] - Limit reached: /usr/bin/vmtoolsd Puniq=3988 MaxKB=3000"

For reference: the source has IP 192.168.10.28 (small, 4cpu, 32GB RAM) and the target DXi has IP 172.168.1.28 (tiny, 2 CPU, 4GB RAM). Both have the standard 5 TB Disklayout)

I have a hunch that the target DXi does not have enough memory. I doubled it to 8 GB and assigned it to the OS. Let's see if that helps.

▶ Reply

Permalink Reply by Steven Rodenburg on February 15, 2022 at 1:15am

update: upping the memory of the target DXi to 8GB did not change anything. Still dies randomly. The source (dxi01) always starts with saying it paused the replication due to a low capactiy issue or an issue with the remote unit. The latter seems to be true according to the errors in the logs. It's always the same series of errors by the way so at least it's consistent.

What does "limit reached (limit 10)" mean anyway?

▶ Reply

Permalink Reply by Steven Rodenburg on February 15, 2022 at 1:45am

Update: I looked at the regular Linux logs like /var/log/messages. On the target, no errors but on the source I get stuff like this when a replication dies again.

Small snippet from the source DXi /var/log/messages file (no need to post 5 miles of this stuff):
________________________________________________

2022-02-14-23:53:38.302 dxi01 Blockpool[9939]: S: [22966] (Replicate BLOB) Failed to replicate BLOB 51A36DE373F5150F97C13C92A2287A9C. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.303 dxi01 Blockpool[9939]: S: [22966] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.306 dxi01 Blockpool[9939]: S: [22966] Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.306 dxi01 Blockpool[9939]: S: [22966] The connection was aborted.
2022-02-14-23:53:38.317 dxi01 Blockpool[9939]: S: [22972] (Replicate BLOB) Failed to replicate BLOB 5191E0CC259652D3F5549C2580C744B9. Error writing parcel to client server. Error writing data to network socket: 172.16.1.28:1062. Broken pipe
2022-02-14-23:53:38.317 dxi01 Blockpool[9939]: S: [22972] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error writing parcel to client server. Error writing data to network socket: 172.16.1.28:1062. Broken pipe
2022-02-14-23:53:38.317 dxi01 Blockpool[9939]: S: [22971] (Replicate BLOB) Failed to replicate BLOB 5126ED7CCDAFDF237F34195DFFC35650. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.318 dxi01 Blockpool[9939]: S: [22971] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.320 dxi01 Blockpool[9939]: S: [22968] (Replicate BLOB) Failed to replicate BLOB 50C41A038199E5CCD33CF602FA2ED1B4. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.320 dxi01 Blockpool[9939]: S: [22968] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.320 dxi01 Blockpool[9939]: S: [22971] Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.321 dxi01 Blockpool[9939]: S: [22971] The connection was aborted.
2022-02-14-23:53:38.322 dxi01 Blockpool[9939]: S: [22972] Network connection lost. Error writing parcel to client server. Error writing data to network socket: 172.16.1.28:1062. Broken pipe
2022-02-14-23:53:38.322 dxi01 Blockpool[9939]: S: [22972] The connection was aborted.
2022-02-14-23:53:38.324 dxi01 Blockpool[9939]: S: [22968] Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.325 dxi01 Blockpool[9939]: S: [22968] The connection was aborted.
2022-02-14-23:53:38.334 dxi01 Blockpool[9939]: S: [22967] (Replicate BLOB) Failed to replicate BLOB 51364EB234501A15F424CC32A0A97925. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.334 dxi01 Blockpool[9939]: S: [22967] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.337 dxi01 Blockpool[9939]: S: [22967] Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.337 dxi01 Blockpool[9939]: S: [22967] The connection was aborted.
2022-02-14-23:53:38.343 dxi01 Blockpool[9939]: S: [22969] (Replicate BLOB) Failed to replicate BLOB 50F7EC364318CAF75C3B55065D001369. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.343 dxi01 Blockpool[9939]: S: [22969] (Close Remote Session) Unable to finish the current transaction. Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.346 dxi01 Blockpool[9939]: S: [22969] Network connection lost. Error reading network socket: 172.16.1.28:1062. Socket has been closed at remote.
2022-02-14-23:53:38.347 dxi01 Blockpool[9939]: S: [22969] The connection was aborted.
2022-02-14-23:53:38.360 dxi01 Blockpool[9939]: S: [24716] (Remote Connect) This server has either terminated the connection unexpectedly or is not a Blockpool-compatible server.
2022-02-14-23:53:38.366 dxi01 Blockpool[9939]: S: [24720] (Remote Connect) Error connecting to 172.16.1.28:1062. Error connecting socket: Connection refused.
2022-02-14-23:53:38.372 dxi01 Blockpool[9939]: S: [24717] (Remote Connect) This server has either terminated the connection unexpectedly or is not a Blockpool-compatible server.
2022-02-14-23:53:38.373 dxi01 Blockpool[9939]: S: [24721] (Remote Connect) Error connecting to 172.16.1.28:1062. Error connecting socket: Connection refused.
2022-02-14-23:53:38.373 dxi01 Blockpool[9939]: S: [24718] (Remote Connect) This server has either terminated the connection unexpectedly or is not a Blockpool-compatible server.
2022-02-14-23:53:38.378 dxi01 Blockpool[9939]: S: [24719] (Remote Connect) This server has either terminated the connection unexpectedly or is not a Blockpool-compatible server.
2022-02-14-23:53:38.394 dxi01 Blockpool[9939]: S: [24722] (Remote Connect) Error connecting to 172.16.1.28:1062. Error connecting socket: Connection refused.
2022-02-14-23:53:38.402 dxi01 Blockpool[9939]: S: [24723] (Remote Connect) Error connecting to 172.16.1.28:1062. Error connecting socket: Connection refused.
2022-02-14-23:53:38.405 dxi01 Blockpool[9939]: S: [24724] (Remote Connect) Error connecting to 172.16.1.28:1062. Error connecting socket: Connection refused.

________________________________________________

So this does smell like a networking issue of some sort even though the VPN is stable as a rock (nothing else that goes over the link has any problems). On the other hand, these errors are of the type "remote closed the connection", "socket has been closed by remote" and "connection refused" which indicates that the target DXi is killing of TCP sessions?

▶ Reply

Permalink Reply by Steven Rodenburg on February 15, 2022 at 6:24am

Update: replication still dies randomly (usually runs for a few hours before it does) and I've not seen these networking-error-messages since.
The errors I copy-pasted in the beginning (of the "VServer "remote-no-encrypt" connection limit failed due to VServer "remote-group".
VServer "remote-group" limit reached (limit 10)." kind) show up every time replication dies.

▶ Reply

Permalink Reply by Steve Wright on February 23, 2022 at 1:03am

Hi Steven,

I'm really sorry I didn't notice this before. You are always welcome to contact me directly at Quantum. But this was brought to my attention and I'll try to find an answer.

▶ Reply

Permalink Reply by Steve Wright on February 23, 2022 at 8:21pm

Hi Steve ... I've been looking this over and doing some research. The problem looks to be that it's a VM and what happens when you throttle down port speeds, particularly via the VM settings.

This is one reason we didn't enable replication speed throttling from within the app, and what causes problems when we reduce speeds at the VM level. We are working on the product so that it will handle this better, but at the moment any kind of throttling of replication speeds on the VM can cause some issues with ports not being closed properly and hanging open.

When you do a sync you are syncing everything on the target DXi. So data not protected by a snapshot will be removed if it is outdated as the target DXi is made to mirror the sending DXi.

It looks like you have a solid connection. It's not clear, but did you ever do these long synchronizations without reducing the network speed?

Thanks!

▶ Reply

Permalink Reply by Steven Rodenburg on February 24, 2022 at 2:36am

Hi Steve,

The connection is a 50Mbit / 12.3 MB/s link. It's rock solid. The issue happens regardless if I don't throttle at all (full 12.3 MB/s) or use traffic-shaping on the hypervisor where the target is (10.0 MB/s).
The reason for throttling is to not let the replication completely bog down the link, consuming all the available bandwidth.

The solution should be able to deal with low bandwidth and dodgy links anyway. Virtual DXi or not. Veeam pride themselves by saying that their WAN Accelerator can deal with "reasonably unstable links" (and yes I've tested that claim by doing terrible things during its replication and it survived all the misery I threw at it with flying colours).

In my case, replication starts as soon as Veeam completes writing another Veeam Backup-file during the nightly backups. I do directory/file replication, not namespace (as I don't see the point in this use-case).
Small backup files are transferred without problem but the bigger ones (big delta, lots of new unique data) then it always dies, several times. I then kill -9 replicationd which triggers the replication to resume because a new replicationd is spawned.

That the system cannot detect a zombie replicationd process and kill it by itself is unfortunate. There is obviously some kind of watchdog, because roughly 3 seconds after I killed replicationd, a new one is spawned and then replication resumes. Until it dies a couple of hours later. In other words, the watchdog (my wording) is currently able to detect the "absence of replicationd" (and start a new one) but not a "hung/stuck" replicationd.

Anyway, so what I used to do is basically keep killing replicationd every couple of hours after it went zombie again and repeat (retry) until everything is replicated. Especially the big weekly backups often took (paste-tense) several retries.

I talked to Ines and she told me to set "MaxReplicationStreams" to "4" (default = 8) in /opt/DXi/theSeer/ObjectManager.conf and since then, all is fine. Even the big backups make it through and replicationd has not died since.
This needs to be done on the source DXi.
If one wants to do bi-directional replication, use this value on the remote DXi as well (as it is a "source" as well as a "target").

Kind regards,
Steven Rodenburg

▶ Reply

Permalink Reply by Steven Rodenburg on August 31, 2022 at 7:05am

WARNING: I updated the firmware to 4.5.0.3 after which replication crashed again in the same way.

Cause: the update set the value of "MaxReplicationStreams" back to the factory default of "8". I set it to "4" again and now all is good again.

To get the two DXi's back in sync with each other, I had to do a manual "synchronize now" on each replicated share.

Initial replication keeps dying randomly

Replies to This Discussion

Tips + Tricks