webgame中Mysql Deadlock ERROR 1213 (40001)错误的排查历程

案例发现: 从我们正在运营的一款webgame的异常日志中看到一些程序执行mysql 语句的报错信息。比较多的是 “deadlock found when trying to get lock; try restarting transaction” ,少部分是“error number: 1205:lock wait timeout exceeded; try restarti
案例发现:
从我们正在运营的一款webgame的异常日志中看到一些程序执行mysql 语句的报错信息。比较多的是“deadlock found when trying to get lock; try restarting transaction”,少部分是“error number: 1205:lock wait timeout exceeded; try restarting transaction”，如下：
001 --> 2012-11-22 06:05:36 --> error -->system/database/driver.php--777--log--debug002 --> 2012-11-22 06:05:36 --> error -->system/database/driver.php--295--error--jv_driver003 --> 2012-11-22 06:05:36 --> error -->system/database/activerecord.php--947--query--jv_driver004 --> 2012-11-22 06:05:36 --> error -->server/models/mrolemonster.php--84--update--jv_activerecord005 --> 2012-11-22 06:05:36 --> error -->server/daemon/update.php--392--kill--mrolemonster006 --> 2012-11-22 06:05:36 --> error --> database: xxx_roles_xxx(10.1.1.75) --> error number: 1205:#####lock wait timeout exceeded; try restarting transaction##### --> error message: #####db_query_error --> query error: update `monster` set `kills` = kills + 1 where `id` = '30036' and `role_id` = '19863'.##### --> query elapsed counter: 184293;time 590.4272678 --> database connection has be closed:dbwrole001 --> 2012-11-28 15:59:47 --> error -->system/database/driver.php--777--log--debug002 --> 2012-11-28 15:59:47 --> error -->system/database/driver.php--295--error--jv_driver003 --> 2012-11-28 15:59:47 --> error -->system/database/activerecord.php--948--query--jv_driver004 --> 2012-11-28 15:59:47 --> error -->server/models/mrole.php--1143--update--jv_activerecord005 --> 2012-11-28 15:59:47 --> error -->server/daemon/update_other.php--283--updaterolestate--mrole006 --> 2012-11-28 15:59:47 --> error --> database: xxx_roles_xxx(10.1.1.72) --> error number: 1213:#####deadlock found when trying to get lock; try restarting transaction##### --> error message: #####db_query_error --> query error: update `role_state` set `state` = 1where `role_id` = '53016'.##### --> query elapsed counter: 4972;time 4.2417307 --> database connection has be closed:dbwrole007 --> 2012-11-28 15:59:47 --> error -->system/database/driver.php--616--log--debug008 --> 2012-11-28 15:59:47 --> error -->server/daemon/combat_update.php--308--transcomplete--jv_driver009 --> 2012-11-28 15:59:47 --> error --> db transaction failure.
从报错的英文上理解，大约是发生了“死锁”，以及“事务锁等待超时”两个错误异常。而且，都是我们后台php常驻进程遇到的问题。异常的代码对应行数上，大约可理解为执行sql语句的一个指令，并无特殊的东西。有经验的程序员，很容易看出来，这不是程序的异常，这是mysql事务中，锁竞争的异常，客户端(php常驻进程)是没有语法上的错误的。那该如何排查呢？
一串疑问：
这是什么问题？如何排查？什么时候发生死锁？我怎么知道他发生了？发生之后去哪里排查？如何排查？怎么确定他们对应的事务中的所有sql语句？分别在哪几个事务中？谁先锁的？谁后锁的？谁没锁到？谁报的死锁错误？死锁是什么？为什么发生了？如何避免？还有哪些因素影响？
毫无头绪:
程序间数据交互，上strace神器?
跟踪谁？客户端(php)？你知道哪个客户端会发生这个问题？你知道啥时候会发生？在你开始抓包到抓到死锁的期间，这得是多大的数据量？
跟踪谁？服务端(mysql)？玩笑开大了吧？mysql以进程模式来处理客户端请求，每次都是一个新的进程，strace -ff参数的话，想想日志文件得被创建多少个，数据量会小么？
“万军之中取上将首级”这本事我可没…strace排查这种错误，还是算了吧。
这是谁报的错？显然是mysql，那就从根源找起–mysql server。
抓获现场:
我们要还原案发现场，有幸的时，我们有监控记录binlog以及show engine innodb status。到对应mysql服务器上，执行“show engine innodb status”获取innodb引擎当前信息，大约如下：
......------------------------latest detected deadlock------------------------121128 15:59:46*** (1) transaction:transaction ac512256, active 0 sec starting index readmysql tables in use 1, locked 1lock wait 4 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1mysql thread id 122562823, os thread handle 0x7fa5c4fbe700, query id 7457663621 10.1.1.8 s001_gamedb updatingupdate `role_state` set `state` = 1where `role_id` = '53016'*** (1) waiting for this lock to be granted:record locks space id 477 page no 1386 n bits 128 index `primary` of table `xxx_roles_xxx`.`role_state` trx id ac512256 lock_mode x locks rec but not gap waitingrecord lock, heap no 17 physical record: n_fields 80; compact format; info bits 0 0: len 3; hex 00cf18; asc ;;............*** (2) transaction:transaction ac512250, active 0 sec inserting, thread declared inside innodb 500mysql tables in use 1, locked 16 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 2mysql thread id 122679850, os thread handle 0x7fac007ff700, query id 7457663711 10.1.1.8 s001_gamedb updatereplace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`,****) values ('53016', 4967, 3291, 350, 174, ***)*** (2) holds the lock(s):record locks space id 477 page no 1386 n bits 128 index `primary` of table `xxx_roles_xxx`.`role_state` trx id ac512250 lock_mode x locks rec but not gaprecord lock, heap no 17 physical record: n_fields 80; compact format; info bits 0............*** (2) waiting for this lock to be granted:record locks space id 427 page no 488 n bits 192 index `primary` of table `xxx_roles_xxx`.`role_fight` trx id ac512250 lock_mode x locks rec but not gap waitingrecord lock, heap no 64 physical record: n_fields 51; compact format; info bits 0......*** we roll back transaction (1)......
这是我精简之后的信息，我抓去了latest detected deadlock部分的数据，这部分的数据是innodb的最后一次发生死锁的信息，更详细的说明见mysql官方手册对standard monitor and lock monitor output返回结果的解释。
ok,发现一场案例，保存这个innodb的状态数据备用。迅速到程序异常日志中查看相同时间点是否有死锁发生。果然，我们程序异常日志中记录了这起案例(文章开头的日志)。
再到binlog中抓去这个时间段前后10分钟(大约范围)的mysql sql语句执行日志。
案情分析：
engine status中，大约看出mysql记录了两个事务之间发生锁竞争时，遗留的数据，
事务1“执行”(注意，这里加了双引号)
update `role_state` set `state` = 1 where `role_id` = '53016'
发现被修改资源已经被lock_mode x locks了(详情见:innodb锁模式)，准备等待该资源锁被释放。
事务2执行
replace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`,****) values ('53016', 4967, 3291, 350, 174, ***)
也发现该资源被lock_mode x locks了。
最后部分，mysql给了很重要的一个数据“we roll back transaction (1)” mysql回滚了事物1。既然mysql回滚了1，那么肯定是事务1的语句触发了死锁，被mysql回滚了，也就是应该为程序中的异常日志所记录的那部分。同时，mysql执行了事务2，那么事务2的sql语句肯定被记录在binlog中了。
抽丝剥茧:
如何确定事务1、事务2执行了哪些sql语句呢？
根据show engine innodb status的结果，确定事务2被执行的
sql语句(业务逻辑的role_id唯一标识): replace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`,****) values (’53016′, 4967, 3291, 350, 174, ***)线程id(mysql的唯一标识): mysql thread id 122679850执行时间(时间线):121128 15:59:46根据这三个标识，以及binlog的起始表示“begin、commit”，几乎可以100%确定该事务所包含的sql语句。
binlog信息大约如下:
# at 511750764#121128 15:59:46 server id 1 end_log_pos 511750843 query thread_id=122679850 exec_time=0 error_code=0set timestamp=1354089586/*!*/;begin/*!*/;# at 511750843#121128 15:59:46 server id 1 end_log_pos 511751090 query thread_id=122679850 exec_time=0 error_code=0use xxx_roles_xxx/*!*/;set timestamp=1354089586/*!*/;update `role_pet` set `in_supporting` = 0, `levelup_pause_time` = 1354089587, `auto_feed` = 0, `supporting_pause_time` = 1354089587where `role_id` = '53016'and `id` = 9234/*!*/;# at 511751090#121128 15:59:46 server id 1 end_log_pos 511751240 query thread_id=122679850 exec_time=0 error_code=0set timestamp=1354089586/*!*/;update `role_state` set `pet` = 0, `pet_level` = 0where `role_id` = '53016'/*!*/;# at 511751240#121128 15:59:46 server id 1 end_log_pos 511751885 query thread_id=122679850 exec_time=0 error_code=0set timestamp=1354089586/*!*/;replace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`, `defend_physical`, `defend_internal`, `dodge_rate`, `critical_rate`, `hit_rate`, `speed`, `defend_physical_plus`, `defend_internal_plus`, `dodge_level`,*****) values ('53016', 4967, 3291, 350, 174, 518, 254, 500, 300, 9500, 913, 668, 668, 261, 700, 97, 133, 40.9, 34, *****)/*!*/;# at 511751885#121128 15:59:46 server id 1 end_log_pos 511751912 xid = 7457663579commit/*!*/;
ok,事务2的sql语句全部找齐了。那么事务1的呢？如何找？
根据php的异常报错，确定主要包含的语句sql信息，以及程序跟踪的代码行数，根据代码逻辑去确定该事务的所有sql语句。再去binlog中找到该用户的该业务的类似binlog：
# at 511805324#121128 15:59:53 server id 1 end_log_pos 511805403 query thread_id=122562823 exec_time=0 error_code=0set timestamp=1354089593/*!*/;begin/*!*/;# at 511805403#121128 15:59:53 server id 1 end_log_pos 511805560 query thread_id=122562823 exec_time=0 error_code=0use xxx_roles_xxx/*!*/;set timestamp=1354089593/*!*/;update `role_fight` set `last_update_life` = '1354089587'where `role_id` = '53016'/*!*/;# at 511805560#121128 15:59:53 server id 1 end_log_pos 511805695 query thread_id=122562823 exec_time=0 error_code=0set timestamp=1354089593/*!*/;update `role_state` set `state` = 1where `role_id` = '53016'/*!*/;# at 511805695#121128 15:59:53 server id 1 end_log_pos 511805889 query thread_id=122562823 exec_time=0 error_code=0use xxx_roles_xxx/*!*/;set timestamp=1354089593/*!*/;delete from `queue_combats_update_roles`where `combat_id` = 'f27d62dad8efcaeb04cd8f5d7c0424db'and `role_id` = '53016'/*!*/;# at 511805889#121128 15:59:53 server id 1 end_log_pos 511805916 xid = 7457670215commit/*!*/;
(请勿过于纠结上面binlog的thread_id跟show engine innodb status的thread_id一致的问题，这是因为我们程序是常驻进程，mysql连接不断开，不销毁，故一致了。而且，此日志是程序发现死锁之后，被mysql回滚之后，又重新提交的事务，算是不同时间点的相同事务)
案情还原：
根据案发现场的两个mysql innodb事务的全部sql语句，以及形成mysql innodb 死锁的原因(感谢dba组大雄哥的纠正)，我们大约可以这么还原案情:
事务1：
update `role_fight` set `last_update_life` = ’1354089587′ where `role_id` = ’53016′
update `role_state` set `state` = 1 where `role_id` = ’53016′
事务2：
update `role_state` set `pet` = 0, `pet_level` = 0 where `role_id` = ’53016′
replace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`,****) values (’53016′, 4967, 3291, 350, 174, ***)
这四条语句构成了本次事务死锁的全部原因。
执行顺序肯定如下:
时间点事务1 事务2 备注
1 begin
2 begin
3 update `role_state` set `pet` = 0, `pet_level` = 0 where `role_id` = ’53016′ 事务2 给 role_state表 role_id 53016记录上 x 锁
4 update `role_fight` set `last_update_life` = ’1354089587′ where `role_id` = ’53016′ 事务1 给 role_fight表 role_id 53016记录上 x 锁
5 replace into `role_fight` (`role_id`, `life_max`, `mana_max`, `attack_physical`, `attack_internal`,****) values (’53016′, 4967, 3291, 350, 174, ***) 这里是重点，事务2给role_fight表role_id的记录上 x 锁，发现被其他人(事务1)上锁了，开始等待他人提交事务…等待…
6 update `role_state` set `state` = 1 where `role_id` = ’53016′ 事物1打算给role_state表role_id为53016记录上 x 排它锁，发现被其他事务上了，而且此事务居然还在等他提交，这时mysql立刻回滚事务1…(php发现mysql返回死锁信息，随记录该信息到异常日志…发送回滚指令…mysql已经“帮”他回滚了…)
7 【执行成功…】事务2发现别人释放锁了，ok，获取x锁，修改成功
8 commit php程序发现上一条指令执行完毕，且执行无错，即，发送commit指令，提交事务。
好像有个参数%^:
innodb_lock_wait_timeout参数是干啥的呢？从mysql官方手册上看，此参数是针对锁等待时，一个限定等待时间的参数。跟死锁并无关系，一旦mysql发现死锁，立刻回滚导致死锁的语句。并不会用到该参数。
规避方式：
缩小事务的语句数量调整sql语句执行顺序，变“死锁”为“锁等待”，等待一会，总比整个事务回滚掉，全部重新再执行这个流程要强.其他.请补充关于锁等待:
缩小事务间sql语句的数量，减小规模吧。当然，提高检索速度，提高查询时间也是首要因素，我们就发现我们的sql语句中，有几个没有用到索引，导致锁表，导致锁等待发生…
备注:
年底了，冲kpi的，各位见笑了.
原文地址：webgame中mysql deadlock error 1213 (40001)错误的排查历程, 感谢原作者分享。

webgame中Mysql Deadlock ERROR 1213 (40001)错误的排查历程

推荐信息