In this blog we are going to cover some of the most frequent replication issues that might happen to any actively being used databases. If they are not addressed properly it could really take your standby slaves nodes to an inconsistent and which could land you easily with single point of failure if your slaves are not in consistent state with master. Having a consistent slave is more important that having N number of slaves. We will go over couple of errors in this post.
Slave_IO_State: Waiting for master to send event
Master_Host: genexdemo-master.com
Master_User: replication_user
Master_Port: 3306
Master_Log_File: mysql-bin.011788
Read_Master_Log_Pos: 440461708
Relay_Log_Pos: 401162462
Relay_Master_Log_File: mysql-bin.011787
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction '542d20a8-51ee-11e7-93e0-0050569a05da:7085897700' at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Exec_Master_Log_Pos: 401162249
Seconds_Behind_Master: NULL
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1032
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction '542d20a8-51ee-11e7-93e0-0050569a05da:7085897700' at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
SQL_Delay: 0
Slave_SQL_Running_State:
Channel_Name:
Issue Analysis
MySQL Error Code – 1032
Error Message
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction ‘542d20a8-51ee-11e7-93e0-0050569a05da:7085897700’ at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Problem Statement
Data missing in the slaves
Cause of Problem
Someone might have accidentally deleted records directly from slave instead of master and then later when the same transactions comes through replication from the master the slave ends up in such failed state. So precisely transactions from master on records that does not exists in slaves would get into a failed state as above.
Short-term Solution
Refresh the inconsistent tables using xtrabackup/meb/mysqldump backups from master
Permanent Solution
Identify the transactions causing discrepancies on slaves and fix those transactions and make sure it flow properly through replication. Mark you slaves super-read-only so it can block any accidental direct writes from all users.
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: genexdemo-master.dom
Master_User: replication_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.002011
Read_Master_Log_Pos: 266143481
Relay_Log_File: mysql-relay-bin.000028
Relay_Log_Pos: 264952719
Relay_Master_Log_File: mysql-bin.000011
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '52347a8b-fbc3-11e7-afa2-0050568181e9:122276511' at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 264955505
Relay_Log_Space: 266143517
Until_Condition: None
Last_SQL_Errno: 1062
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '52347a8b-fbc3-11e7-afa2-0050568181e9:122276511' at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
1 row in set (0.00 sec)
Issue Analysis
MySQL Error Code – 1062
Issue
Slave fails to replicate because of duplicate entry for the primary of the table
Error Message
A Coordinator is stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction ‘52347a8b-fbc3-11e7-afa2-0050568181e9:122276511’ at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Reason
As I already mentioned for error code 1032 that we would be refreshing the tables that are not in sync. Each instances are quite huge and we take online backups from the master.
Even though we stop replication in the slave before starting the backup from master. So the replication co-ordinates of the backup would be greater than the coordinates of the stopped slave.
The state of the single table imported is greater than the position of the replication stopped in the slave. So till the replication comes to the state of the table being refreshed we would be getting duplicate entry errors for the data which is already there in that table.
Quick Resolution
It’s absolutely fine to Skip duplicate errors on a table with the error code 1062
Option 1:
stop slave; set global sql_slave_skip_counter=1; start slave;
Option 2:
You can even delete the duplicate records in the slave if you are sure the data in slave are corrupt so deleting those records from slave and letting replication to backfill the data deleted is also a possible option
Permanent Resolution
Perform a consistent check between master and slave. Record the tables not in consistent state. Take a consistent backup from master or any other live slave whichever is possible , load that backup from master to broken slave and resume replication.
Conclusion
There could be multiple ways to address the replication issues but in an ideal world replication is the most reliable way of having a standby server. However we have to accept the fact that we cannot build systems which would never fail and what gives us the edge over the failures is how well you are prepared or equipped to handle it.
Do we know all the possible failures and the cause of such failures in your system ? Do you have a monitoring that can help you detect the failure points ? Do you have a support that can pull you off from such failures efficiently. @genexdb we help our customers building solutions and processes needed for their environment to handle such scenarios.
Please reach out to us @
[email protected]
if you need help with improving and managing your database systems. We support a wide range of database and data streaming softwares visit us
genexdbs.com
to learn more about us.