Enviroment: NBU 8.1.1 (Master/Media/Client) on RHEL 6.9
Symptom:
We have a large Oracle DB client with approximately 50TB, and every week once full backup and the other six days archive log backups.
And they almost all work fine, just except for only few archive log backup failed occasionally with the following similar job detail ...
03/04/2022 02:19:36 - Info nbjm(pid=21023) starting backup job (jobid=3375140) for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup
03/04/2022 02:19:36 - Info nbjm(pid=21023) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3375140\
request id:{7BC315D6-9B1E-11EC-9BB9-3BDF5C092EF3})
03/04/2022 02:19:36 - requesting resource jc41nbubk7-stu
03/04/2022 02:19:36 - requesting resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource jcbak.NBU_CLIENT.MAXJOBS.jcyc01db01bak
03/04/2022 02:19:37 - granted resource MediaID=@aaabT;DiskVolume=/jc41nbubk7_nbudata1;DiskPool=jc41nbubk7_dp;Path=/jc41nbubk7_nbudata1;StorageServer=jc41nbubk7;MediaServer=jc41nbubk7
03/04/2022 02:19:37 - granted resource jc41nbubk7-stu
03/04/2022 02:19:37 - estimated 0 kbytes needed
03/04/2022 02:19:37 - Info nbjm(pid=21023) started backup (backupid=jcyc01db01bak_1646331577) job for client jcyc01db01bak\
policy jcyc01db-EDCDB-arch\
schedule Default-Application-Backup on storage unit jc41nbubk7-stu
03/04/2022 02:19:37 - started process bpbrm (4140324)
03/04/2022 02:19:38 - connecting
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) jcyc01db01bak is the host to backup data from
03/04/2022 02:18:29 - Info bpbrm(pid=4140324) reading file list for client
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) listening for client connection
03/04/2022 02:18:30 - Info bpbrm(pid=4140324) INF - Client read timeout = 3600
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) accepted connection from client
03/04/2022 02:18:31 - Info dbclient(pid=107296) Backup started
03/04/2022 02:18:31 - Info bpbrm(pid=4140324) bptm pid: 4140327
03/04/2022 02:19:40 - connected; connect time: 000:00:02
03/04/2022 02:18:32 - Info bptm(pid=4140327) start
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 262144 data buffer size
03/04/2022 02:18:32 - Info bptm(pid=4140327) using 30 data buffers
03/04/2022 02:18:33 - Info bptm(pid=4140327) start backup
03/04/2022 02:18:33 - Info bptm(pid=4140327) backup child process is pid 4140354
03/04/2022 02:18:33 - Info dbclient(pid=107296) dbclient(pid=107296) wrote first buffer(size=262144)
03/04/2022 02:19:42 - begin writing
03/04/2022 02:37:04 - Info bptm(pid=4140327) waited for full buffer 5690 times\
delayed 71496 times
03/04/2022 02:37:04 - Info bptm(pid=4140327) EXITING with status 0 <----------
03/04/2022 02:46:14 - Error bpbrm(pid=4140324) socket read failed: errno = 104 - Connection reset by peer
03/04/2022 02:46:15 - Info dbclient(pid=107296) done. status: 13: file read failed
03/04/2022 02:47:23 - end writing; write time: 000:27:41
36382721
0
3375140
32751
So now I wonder why the bpbrm would fail at 02:46:14 after the bptm “successfully” finished at 02:37:04.
Or in other words, what the matter would illustrate between the “successful“” bptm and the failed bpbrm ...?
Backup size is large.
As per provided job details error is : socket read failed: errno = 104 - Connection reset by peer
This error can be came due to network connection break for some second.
Please confirm, Is firewall connectivity between client & Media server? and How is the network connectivity (LAN/SAN)?
If firewall in between, please ask firewall admins to monitor network traffic between the media server and client when backup is kicked off.
Please provide the media server bpbrm & bptm if possible.
Neither the firewall blocked nor the LAN network broken, but just failed occasionally ..., it mostly works fine!
1). bpbrm.log:
02:38:59.384 [4142698.4142698] <2> send_job_file: job ID 3375211, ftype = 3 msg len = 84, msg = LOG 1646332739 4 dbclient 70797 dbclient(pid=70797) wrote first buffer(size=262144)
02:38:59.386 [4142698.4142698] <2> bpbrm send_ping: PING
02:38:59.408 [4142698.4142698] <2> Bpbrm_siginfo_print: 0: delay 0 signo SIGUSR1:10 code 0 pid 4142701
02:38:59.408 [4142698.4142698] <2> bpbrm mm_sig: received ready signal from media manager
02:45:07.047 [4140489.4140489] <2> bpbrm send_ping: PING
02:46:14.993 [4140324.4140324] <16> bpbrm readline: socket read failed: errno = 104 - Connection reset by peer
02:46:14.993 [4140324.4140324] <2> set_job_details: Tfile (3375140): LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset by peer
02:46:14.993 [4140324.4140324] <2> send_job_file: job ID 3375140, ftype = 3 msg len = 91, msg = LOG 1646333174 16 bpbrm 4140324 socket read failed: errno = 104 - Connection reset
by peer
02:46:14.993 [4140324.4140324] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
02:46:14.993 [4140324.4140324] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
02:46:14.993 [4140324.4140324] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
As per my observation this issues occurred due to not sufficient Buffer size on media server.
File read failed and disconnected the connection.
As per bptm log showing buffer size issue.
Please check what is the current BUFFER size and increase it.
I don't believe buffer size is the issue here. However I do recommend using a higer NUMBER_DATA_BUFFERS. A value of 128 or 256 would be my recommendation.
If there is a firewall between client and media server, please implement TCP keep alive. It is a quick implementation and doesn't require any reboots. Basically TCP keep alive sends idle packages with a predefined interval to prevent firewall from closing them. Cisco firewalls closes idle connection after one hour per default. I am not 100% sure this is the root cause, but it is my best guess.