添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
  1. #### 20191219 10:10:10,234 | com.alibaba.druid.filter.logging.Log4jFilter.statementLogError(Log4jFilter.java:152) | ERROR | {conn-10593, pstmt-38675} execute error. update operation_service set offlinemark = ? , resourcestatus = ? where RowGuid = ?

  2. com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction

上述这个错误,接触 MySQL 的同学或多或少应该都遇到过,专业一点来说,这个报错我们称之为 锁等待超时

下面先写下统一的模拟场景,用于复现行锁超时问题,便于大家理解:

  1. --表结构

  2. CREATE TABLE `emp` (

  3. `id` int(11) NOT NULL,

  4. KEY `idx_id` (`id`)

  5. ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4


  6. 从1~100w插入100w行记录。


  7. --测试过程:

  8. 事务1:

  9. start transaction;

  10. delete from emp where id = 1;

  11. select * from emp where id in (select id from emp); -->模拟慢查询,执行时间很久,事务因此一直不提交,行锁也不释放.

  12. commit;


  13. 事务2:

  14. start transaction;

  15. delete from emp where id < 10; --> 处于等待id=1的行锁状态,当达到行锁超时时间(这里我配置了超时时间为 5s)后,返回行锁超时报错

  16. rollback;

5.1 手动复现场景

这个场景通常只需要通过 innodb 行锁等待脚本就可以知道当前 MySQL 的 innodb 行锁等待情况,例如我们一边模拟上述报错场景(模拟页面操作),另一边使用脚本查询(需要在超时之前查询,否则超时报错后就看不到了)。
  1. /*innodb 行锁等待脚本*/

  2. SELECT r.trx_mysql_thread_id waiting_thread,r.trx_query waiting_query,

  3. concat(timestampdiff(SECOND,r.trx_wait_started,CURRENT_TIMESTAMP()),'s') AS duration,

  4. b.trx_mysql_thread_id blocking_thread,t.processlist_command state,b.trx_query blocking_current_query,e.sql_text blocking_last_query

  5. FROM information_schema.innodb_lock_waits w

  6. JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id

  7. JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id

  8. JOIN performance_schema.threads t on t.processlist_id = b.trx_mysql_thread_id

  9. JOIN performance_schema.events_statements_current e USING(thread_id)

  1. --查看general_log是否开启,及文件名

  2. mysql> show variables like '%general_log%';

  3. +------------------+--------------------------------------+

  4. | Variable_name | Value |

  5. +------------------+--------------------------------------+

  6. | general_log | OFF |

  7. | general_log_file | /data/mysql_data/192-168-188-155.log |

  8. +------------------+--------------------------------------+


  9. --暂时开启general_log

  10. mysql> set global general_log = 1;

  11. Query OK, 0 rows affected (0.00 sec)


  12. --暂时关闭general_log

  13. mysql> set global general_log = 0;

  14. Query OK, 0 rows affected (0.00 sec)

开启 general_ log 后,手动复现的时候通过 innodb 行锁等待脚本查询结果中的线程 ID,去 general_ log 找到对应的事务分析即可,如下:

根据线程 ID 可以很轻易的从 general_log 中找到对应时间点的事务操作(实际场景下可能需要通过管道命令过滤)。如上图所示,事务 1 与事务 2 的全部 SQL 都可以找到,再通过这些 SQL 去代码中找到对应的位置即可,比如上图中线程 ID 为 111 的事务,执行 select * from emp where id in (select id from emp) 后到真正提交,过了 1min 左右,原因要么就是这条 SQL 查询慢,要么就是代码在执行其他交互操作。
PS:general_log 由于会记录所有 SQL,所以对 MySQL 性能影响较大,且容易暴涨,所以只在问题排查时暂时开启,问题排查后,请及时关闭!
我一般是通过在 Linux 上后台跑监控脚本(innodb_ lock_ monitor.sh)来记录 MySQL 阻塞情况,脚本如下:
  1. #!/bin/bash


  2. #账号、密码、监控日志

  3. user="root"

  4. password="Gepoint"

  5. logfile="/root/innodb_lock_monitor.log"


  6. while true

  7. do

  8. num=`mysql -u${user} -p${password} -e "select count(*) from information_schema.innodb_lock_waits" |grep -v count`

  9. if [[ $num -gt 0 ]];then

  10. date >> /root/innodb_lock_monitor.log

  11. mysql -u${user} -p${password} -e "SELECT r.trx_mysql_thread_id waiting_thread,r.trx_query waiting_query, \

  12. concat(timestampdiff(SECOND,r.trx_wait_started,CURRENT_TIMESTAMP()),'s') AS duration,\

  13. b.trx_mysql_thread_id blocking_thread,t.processlist_command state,b.trx_query blocking_query,e.sql_text \

  14. FROM information_schema.innodb_lock_waits w \

  15. JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id \

  16. JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id \

  17. JOIN performance_schema.threads t on t.processlist_id = b.trx_mysql_thread_id \

  18. JOIN performance_schema.events_statements_current e USING(thread_id) \G " >> ${logfile}

  19. fi

  20. sleep 5

  21. done

再次查看
  1. --使用 nohup 命令后台运行监控脚本

  2. [root@192-168-188-155 ~]# nohup sh innodb_lock_monitor.sh &

  3. [2] 31464

  4. nohup: ignoring input and appending output to ‘nohup.out’


  5. --查看 nohup.out 是否出现报错

  6. [root@192-168-188-155 ~]# tail -f nohup.out

  7. mysql: [Warning] Using a password on the command line interface can be insecure.

  8. mysql: [Warning] Using a password on the command line interface can be insecure.

  9. mysql: [Warning] Using a password on the command line interface can be insecure.


  10. --定时查看监控日志是否有输出(没有输出的话,这个日志也不会生成哦!)

  11. [root@192-168-188-155 ~]# tail -f innodb_lock_monitor.log

  12. Wed Feb 5 11:30:11 CST 2020

  13. *************************** 1. row ***************************

  14. waiting_thread: 112

  15. waiting_query: delete from emp where id < 10

  16. duration: 3s

  17. blocking_thread: 111

  18. state: Sleep

  19. blocking_query: NULL

  20. sql_text: select * from emp where id in (select id from emp)

介绍下主要原理:

1. 主要用的表有 2 张 events_ transactions_ history_ long 和 events_ statements_ history_ long。
2. transactions_ history_ long 会记录历史事务信息,events_ statements_ history_ long 则记录历史 SQL。
3. 从 transactions_ history_ long 中得到回滚事务的线程 ID,再根据时间范围去筛选出可疑的事务,最后从 events_ statements_ history_ long 得到事务对应的 SQL,从中排查哪个为源头。

缺点:

1. history_ long 相关表默认保留记录有限,可能会把有用的数据刷掉,尤其是在 SQL 运行较多的系统。
2. 如果要加大 history_ long 相关表的最大保留行数,需要重启 MySQL,无法在线修改参数。
3. history_ long 相关表记录中的时间均为相对时间,也就是距离 MySQL 启动的时长,看起来不是很方便。
4. history_ long 相关表不会主动记录行锁等待的信息,所以只能通过先根据时间范围刷选出可疑的事务,再进一步分析,不如脚本监控定位的准。
  1. /*开启performance_schema相关监控项,需要提前开启performance_schema*/

  2. UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' where name = 'transaction';

  3. UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' where name like '%events_transactions%';

  4. UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' where name like '%events_statements%';



  5. /*查看回滚事务SQL,确认是否是日志里报错的事务*/

  6. SELECT a.THREAD_ID

  7. ,b.EVENT_ID

  8. ,a.EVENT_NAME

  9. ,CONCAT (b.TIMER_WAIT / 1000000000000,'s') AS trx_druation

  10. ,CONCAT (a.TIMER_WAIT / 1000000000000,'s') sql_druation

  11. ,a.SQL_TEXT,b.STATE,a.MESSAGE_TEXT

  12. FROM performance_schema.events_statements_history_long a

  13. JOIN performance_schema.events_transactions_history_long b ON a.THREAD_ID = b.THREAD_ID

  14. AND (a.NESTING_EVENT_ID = b.EVENT_ID OR a.EVENT_ID = b.NESTING_EVENT_ID)

  15. WHERE b.autocommit = 'NO' AND a.SQL_TEXT IS NOT NULL AND b.STATE = 'ROLLED BACK'


  16. /*查看该时间段内可疑事务即超过5s的事务SQL,这里默认innodb_lock_wait_timeout为5s*/

  17. SELECT a.THREAD_ID

  18. ,b.EVENT_ID

  19. ,a.EVENT_NAME

  20. ,CONCAT (b.TIMER_WAIT / 1000000000000,'s') AS trx_druation

  21. ,CONCAT (a.TIMER_WAIT / 1000000000000,'s') sql_druation

  22. ,a.SQL_TEXT,b.STATE,a.MESSAGE_TEXT,a.ROWS_AFFECTED,a.ROWS_EXAMINED,a.ROWS_SENT

  23. FROM performance_schema.events_statements_history_long a

  24. JOIN performance_schema.events_transactions_history_long b ON a.THREAD_ID = b.THREAD_ID

  25. AND (a.NESTING_EVENT_ID = b.EVENT_ID OR a.EVENT_ID = b.NESTING_EVENT_ID)

  26. WHERE b.autocommit = 'NO' AND SQL_TEXT IS NOT NULL AND b.STATE = 'COMMITTED'

  27. AND b.TIMER_WAIT / 1000000000000 > 5

  28. AND b.TIMER_START < (SELECT TIMER_START FROM performance_schema.events_transactions_history_long

  29. WHERE THREAD_ID = 70402 /*上述SQL查询结果中的线程ID*/

  30. AND EVENT_ID = 518) /*上述SQL查询结果中的事件ID*/

  31. AND b.TIMER_END > ( SELECT TIMER_END FROM performance_schema.events_transactions_history_long

  32. WHERE THREAD_ID = 70402 /*上述SQL查询结果中的线程ID*/

  33. AND EVENT_ID = 518) /*上述SQL查询结果中的事件ID*/

  34. ORDER BY a.THREAD_ID