Debugging Deadlocks (in Ruby on Rails) | by Ollie Haydon-Mulligan | Mar, 2022

0
53
Debugging Deadlocks (in Ruby on Rails) | by Ollie Haydon-Mulligan | Mar, 2022


Until recently, I hadn’t dug into the causes and implications of database deadlocks. Then one day, an endpoint I’d been working on suffered several thousand ActiveRecord::Deadlocked errors in a few hours 😬.

In this post, I’ll share some things I learnt about how deadlocks happen, how to avoid them, and why they are a problem. (Scroll to the bottom for those takeaways).

But in the main I’m going to focus on how I went about investigating them. Here’s the TL;DR of techniques that might help you debug a deadlock:

  • get a snapshot of what locks have been granted and denied to each running transaction with ActiveRecord::Base.connection.execute("SHOW ENGINE INNODB STATUS").each{|row| puts row} and innodb_status_output_locks enabled
  • replicate overlapping transactions (in specs) to see how they impact each other. Turn off use_transactional_tests and start separate transactions in separate Threads
  • after a deadlock has happened, check the “LAST DEADLOCK” section from the “SHOW ENGINE INNODB STATUS” output to get a detailed report on the cause

Background

It’ll help to have an idea of the endpoint I’d been working on.

It was responsible for creating or updating a Device record, by

a) setting some attributes on the Device itself (i.e on a row in the devices table) including token and platform, and

b) setting a list of push notification ‘subscriptions’ for the Device; each ‘subscription’ is a record stored in a separate push_notification_subscriptions table with a device_id referring to the Device and a public_alias which identifies a push notification type

The endpoint is called by apps when they’ve fetched a token from Firebase. The up-to-date Device record – with the Firebase token set – can later be used to send push notifications (via Firebase) to that device 🎉.

The change I had made was to insert all of the PushNotificationSubscription records for the Device in a single database insert, after a single delete operation to clear the out-of-date subscriptions. This cut the number of inserts significantly, and sped up the endpoint a lot.

But it also prompted a lot of deadlocks.

These were raised during the insert operation (for inserting all the PushNotificationSubscriptions), and we realised a lot of the time this insert was being carried out unnecessarily. So we added a check that massively cut the number of inserts, thereby cutting the number of deadlocks.

But still we were getting a steady-ish stream of deadlocks… and wanted to understand why, and what the implications were.

The Investigation 🕵️

1. Getting better visibility

We already suspected part of the cause of the deadlocks was rapid requests coming from the same device to the same endpoint. We confirmed (by adding some logging) that the apps do send requests less than 100ms apart from the same user. But how might these overlapping requests cause deadlocks?

I read up on locking in MySQL and some Stack Overflow advice about avoiding deadlocks, but I was struggling to apply the information to this particular problem. Reasoning about what locks should or might be requested and denied felt way too speculative. I needed to get some reliable and detailed information about what locks were actually being requested.

So I switched from looking into how locks work to finding out how to get visibility on what locks are being requested/granted/released/blocked.

The key tool that really helped lift the veil of mystery is SHOW ENGINE INNODB STATUS. Execute it in MySQL and it gives a detailed snapshot about what’s going on in the database. Turn on the innodb_status_output_locks option and that snapshot tells you exactly which locks each transaction currently holds and is waiting for.

To log out the report at a given point in a Rails app, run:

ActiveRecord::Base.connection.execute("SHOW ENGINE INNODB STATUS").each{|row| puts row}

To turn on innodb_status_output_locks, start MySQL with --innodb_status_output_locks. If you run MySQL using docker-compose, add --innodb_status_output_locks to the service command. Or you can turn the option on at runtime, after MySQL has already started.

2. Replicating the scenario

I wanted to replicate the “rapid requests” scenario locally so I could dig into what locks were being requested and try to understand how they could result in deadlocks.

After a bit of trial and error, I had the following:

This triggers two transactions in quick succession, each of which attempts to

  • update the existing device with a new token
  • delete the device’s existing push notification subscriptions
  • insert new push notification subscriptions for the device

3. Exploring “SHOW ENGINE INNODB STATUS”

I then executed “SHOW ENGINE INNODB STATUS” at various points in the DeviceCreationService.

The key section in the output is the list of TRANSACTIONS. When I ran it after the second transaction had begun, that section included:

---TRANSACTION 1133883, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 212, OS thread handle 140640384386816, query id 8100 172.19.0.1 root updating
UPDATE `devices` SET `devices`.`token` = 'new token', `devices`.`updated_at` = '2022-01-24 11:25:15.671074' WHERE `devices`.`id` = 54
Trx read view will not see trx with id >= 1133883, sees < 1133882
------- TRX HAS BEEN WAITING 0 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 34394 page no 3 n bits 72 index PRIMARY of table `global_test`.`devices` trx id 1133883 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 12; compact format; info bits 0

(Reading these reports takes a bit of getting used to 😬 😅, and it definitely helped studying the MySQL docs here and here first…)

Note that transaction 1133883 is waiting for a RECORD LOCK for a PRIMARY INDEX on table global_test.devices. This means the transaction is waiting for a lock on a specific row in the devices table.

Further down we see this:

---TRANSACTION 1133882, ACTIVE 0 sec
4 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 4
MySQL thread id 211, OS thread handle 140640384657152, query id 8101 172.19.0.1 root starting
SHOW ENGINE INNODB STATUS
Trx read view will not see trx with id >= 1133882, sees < 1133882
TABLE LOCK table `global_test`.`devices` trx id 1133882 lock mode IX
RECORD LOCKS space id 34394 page no 3 n bits 72 index PRIMARY of table `global_test`.`devices` trx id 1133882 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 12; compact format; info bits 0

Transaction 1133882 already has a RECORD LOCK for a PRIMARY INDEX on table global_test.devices.

In fact, 1133882 has exactly the lock needed by the other transaction, which makes sense. Both transactions want to update the same device record. The first transaction gets a lock on the record so it can carry on and carry out the update operation as well as the other changes in the push_notification_subscriptions table. The second transaction has to wait for the first transaction to be committed, at which point the lock is released and the second transaction can carry on.

So here we see the second transaction waiting for a lock, but there is no deadlock. The first transaction can carry on, commit, and release the lock the second transaction needs.

I was getting some understanding of how to read the “SHOW ENGINE INNODB STATUS” report and what locks were needed as part of DeviceCreationService. But I didn’t understand how the rapid requests could result in a deadlock. If both transactions want to update the same device record, there is no chance of a deadlock – the second simply waits for the first transaction to complete.

4. Testing a Theory

I therefore wondered if the deadlock might happen when neither request has to update the device record itself, so a lock on the device record isn’t needed.

Prompted by this advice in this Stack Overflow answer

keep the order in which you delete items consistent

I came up with a theory that might account for deadlocks when one of the transactions called DELETE WHERE push_notification_subscriptions.device_id = X.

I was imagining this sequence of events:

TRANSACTION 1: `DELETE WHERE push_notification_subscriptions.device_id = X`. Transaction holds lock on first record it wants to delete (record 1, say).

TRANSACTION 2: `DELETE WHERE push_notification_subscriptions.device_id = X`. Transaction holds lock on first record it wants to delete (record 2, say).

TRANSACTION 1: Attempts to get lock on the next record it wants to delete, which happens to be record 2. Must wait for Transaction 2 to finish in order for it to release its lock on record 2.

TRANSACTION 2: Attempts to get lock on the next record it wants to delete, which happens to be record 1. Must wait for Transaction 1 to finish in order for it to release its lock on record 1.

Deadlock!

This scenario relied upon different transactions processing the same records in a different order when they each carry out the same DELETE WHERE push_notification_subscriptions.device_id = X operation.

I updated my spec so neither transaction needed the initial lock on the device record:

But no deadlock happened.

I tried to force the DELETE WHERE push_notification_subscriptions.device_id = X to delete the rows in different orders by changing it to device.push_notification_subscriptions.order(“rand()”).delete_all. No deadlocks. I experimented with different fixed orders, but whatever I did, the output of “SHOW ENGINE INNODB STATUS” always showed that the same push_notification_subscriptions row was being locked first. The second transaction always had to wait for the cooksnap_reminder record to be unlocked:

---TRANSACTION 1133922, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 216, OS thread handle 140640384657152, query id 8257 172.19.0.1 root updating
DELETE FROM `push_notification_subscriptions` WHERE `push_notification_subscriptions`.`device_id` = 55
Trx read view will not see trx with id >= 1133922, sees < 1133921
------- TRX HAS BEEN WAITING 0 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 34420 page no 3 n bits 72 index PRIMARY of table `global_test`.`push_notification_subscriptions` trx id 1133922 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 7; compact format; info bits 32
0: len 8; hex 8000000000000132; asc 2;;
1: len 6; hex 000000114d61; asc Ma;;
2: len 7; hex 3b000001e72c7f; asc ; , ;;
3: len 8; hex 8000000000000037; asc 7;;
4: len 17; hex 636f6f6b736e61705f72656d696e646572; asc cooksnap_reminder;;
5: len 8; hex 99abf0bdd9061322; asc ";;
6: len 8; hex 99abf0bdd9061322; asc ";;

In general, if two concurrent transactions need locks on the same records and they process those records in different orders, there will be a deadlock. But here it seemed that whatever order was given to the DELETE operation, each transaction first needed a lock on the same record. That means the second transaction has to wait for the first to commit and release the lock – but there isn’t a deadlock, because the first transaction can carry on and isn’t blocked by a lock already held by the second transaction.

5. Testing Another Theory

At this point, it occurred to me that a deadlock doesn’t only happen when transactions both need a lock on records in the same table. It can happen when two transactions need locks on records in different tables in different orders.

That got me wondering whether our two rapid requests might each start by locking records in different tables. And then I thought: what if the first request doesn’t need to update the device record but the second request does? What if the two requests aren’t duplicates?

I now had another theory to test.

TRANSACTION 1: doesn’t need to update the device because the existing device record is up-to-date with the supplied params. Goes straight to locking a subscription record for deletion.

TRANSACTION 2: does need to update the device because the supplied params are different from the existing attributes. Starts by locking the device row.

TRANSACTION 1: for some reason, now needs to get a lock on the device row. Cannot get the lock because it is held by TRANSACTION 2.

TRANSACTION 2: now wants to start deleting subscription records so needs a lock on the first subscription record. Cannot get the lock because it is held by TRANSACTION 1.

At this point, neither transaction can progress. They each need the other to finish in order to get the lock they are waiting for. It’s a deadlock!

I recalled — having just read through some old discussions about the same endpoint — that the rapid requests sent by the apps don’t necessarily contain the same params.

The app sends the /me/devices request on launch, then it seems that about 150–180ms after the FCM library explicitly tells us didReceiveRegistrationToken and we POST it again.

And, crucially, the firebase token can be different in the second request.

So a realistic scenario (simulated in the spec below) is that the first request includes the firebase token already set on the existing device record, then a follow-up request is sent to update that firebase token.

And every time I ran this spec, I got a deadlock.

Now “SHOW ENGINE INNODB STATUS” was even more useful. If there has been a deadlock recently, it includes a section called “LAST DEADLOCK”. So I put a “SHOW INNODB STATUS” after the ensure, and it returned:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-01-24 14:10:24 0x7fe9642b2700
*** (1) TRANSACTION:
TRANSACTION 1134002, ACTIVE 1 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 224, OS thread handle 140640384116480, query id 8565 172.19.0.1 root updating
DELETE FROM `push_notification_subscriptions` WHERE `push_notification_subscriptions`.`device_id` = 57
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 34420 page no 3 n bits 72 index PRIMARY of table `global_test`.`push_notification_subscriptions` trx id 1134002 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 7; compact format; info bits 32
0: len 8; hex 800000000000013b; asc ;;;
1: len 6; hex 000000114db1; asc M ;;
2: len 7; hex 670000054921dd; asc g I! ;;
3: len 8; hex 8000000000000039; asc 9;;
4: len 17; hex 636f6f6b736e61705f72656d696e646572; asc cooksnap_reminder;;
5: len 8; hex 99abf0e29709c5fd; asc ;;
6: len 8; hex 99abf0e29709c5fd; asc ;;

*** (2) TRANSACTION:
TRANSACTION 1134001, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1136, 6 row lock(s), undo log entries 4
MySQL thread id 223, OS thread handle 140640384657152, query id 8566 172.19.0.1 root update
INSERT INTO `push_notification_subscriptions` (`public_alias`,`created_at`,`updated_at`,`device_id`) VALUES ('comment', '2022-01-24 14:10:23.556665', '2022-01-24 14:10:23.556665', 57) ON DUPLICATE KEY UPDATE `public_alias`=`public_alias`
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 34420 page no 3 n bits 72 index PRIMARY of table `global_test`.`push_notification_subscriptions` trx id 1134001 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 2 PHYSICAL RECORD: n_fields 7; compact format; info bits 32
0: len 8; hex 800000000000013b; asc ;;;
1: len 6; hex 000000114db1; asc M ;;
2: len 7; hex 670000054921dd; asc g I! ;;
3: len 8; hex 8000000000000039; asc 9;;
4: len 17; hex 636f6f6b736e61705f72656d696e646572; asc cooksnap_reminder;;
5: len 8; hex 99abf0e29709c5fd; asc ;;
6: len 8; hex 99abf0e29709c5fd; asc ;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 7; compact format; info bits 32
0: len 8; hex 800000000000013c; asc <;;
1: len 6; hex 000000114db1; asc M ;;
2: len 7; hex 67000005492213; asc g I" ;;
3: len 8; hex 8000000000000039; asc 9;;
4: len 20; hex 6d656e74696f6e65645f696e5f636f6d6d656e74; asc mentioned_in_comment;;
5: len 8; hex 99abf0e2970a95ef; asc ;;
6: len 8; hex 99abf0e2970a95ef; asc ;;

Record lock, heap no 4 PHYSICAL RECORD: n_fields 7; compact format; info bits 32
0: len 8; hex 800000000000013d; asc =;;
1: len 6; hex 000000114db1; asc M ;;
2: len 7; hex 67000005492249; asc g I"I;;
3: len 8; hex 8000000000000039; asc 9;;
4: len 18; hex 6d6f6465726174696f6e5f6d657373616765; asc moderation_message;;
5: len 8; hex 99abf0e2970b60c5; asc ` ;;
6: len 8; hex 99abf0e2970b60c5; asc ` ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 34394 page no 3 n bits 72 index PRIMARY of table `global_test`.`devices` trx id 1134001 lock mode S locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 12; compact format; info bits 0
0: len 8; hex 8000000000000039; asc 9;;
1: len 6; hex 000000114db2; asc M ;;
2: len 7; hex 680000021414f6; asc h ;;
3: len 4; hex 80000039; asc 9;;
4: len 15; hex 646966666572656e7420746f6b656e; asc different token;;
5: len 3; hex 696f73; asc ios;;
6: len 8; hex 99abf0e297087e79; asc ~y;;
7: len 8; hex 99abf0e2970d59e7; asc Y ;;
8: SQL NULL;
9: SQL NULL;
10: SQL NULL;
11: SQL NULL;

*** WE ROLL BACK TRANSACTION (1)

This tells the story I predicted, but, confusingly, the first transaction (the one that began first) is called (2) in the report.

That transaction holds a lock on the cooksnap_reminder subscription record, because it went straight ahead to destroy the subscription records. It is waiting for a lock on the device record.

The other transaction (the second one to begin) holds a lock on the device record, because it started by updating that record. It is waiting for a lock on the cooksnap_reminder subscription record, because it wants to destroy that record.

But why does the first transaction need a lock on the device record if it didn’t actually need to update the device record? The output tells us it is stuck waiting for the lock at this operation:

INSERT INTO `push_notification_subscriptions` (`public_alias`,`created_at`,`updated_at`,`device_id`) VALUES ('comment', '2022-01-24 14:10:23.556665', '2022-01-24 14:10:23.556665', 57) ON DUPLICATE KEY UPDATE `public_alias`=`public_alias`

So inserting into the push_notification_subscriptions stable requires a lock on the device record 🤯. But it isn’t waiting for an exclusive lock – it needs an S lock, or Shared lock, which means the lock isn’t for updating or deleting the record. Then it dawned on me – the transaction is reading the device record to make sure that record exists, and continues to exist, while new push_notification_subscriptions records are inserted, because those new record refer to the device record. Sure enough, there is a foreign key constraint which means a subscription record can only be inserted if its device_id has a corresponding device record.

Another thing I realised: the transaction that errors is the one that starts by updating the device record. This is actually the one we want to succeed. So the result isn’t just a deadlock and failed request. The device ends up with an out-of-date token.

Avoiding the deadlock 👷

I then thought about how this deadlock could be avoided.

Option 1: Split up the transactions

The deadlock happens when the first transaction wants a lock on the device record that the second transaction is holding onto (and the second transaction wants a lock on a subscription record that the first transaction is holding onto). We can avoid this circular dependency by moving the device update operation into a separate transaction — this separate transaction can release the lock on the device without waiting for the first transaction to finish.

We’d end up with three transactions.

TRANSACTION 1: doesn’t need to update the device because the existing device record is up-to-date with the supplied params. Goes straight to locking a subscription record for deletion.

TRANSACTION 2: does need to update the device because the supplied params are different from the existing attributes. Locks the device row.

TRANSACTION 1: needs to get a lock on the device row in order to insert new subscription records. Cannot get the lock because it is held by TRANSACTION 2.

TRANSACTION 2: finishes updating the device, commits and releases its lock.

TRANSACTION 1: is free to get a lock on the device record, finishes inserting the new subscriptions and completes.

TRANSACTION 3: new transaction is opened to refresh subscription records. Wants a lock on the first subscription record to delete. Transaction 1 has released its locks so Transaction 3 can delete the persisted subscription records and insert new ones.

In code, this change amounts to removing the the transaction that currently wraps the whole DeviceCreationService.

Instead of:

Device.transaction do
if device.update(device_params)
refresh_push_notification_subscriptions
end
end

We can have

if device.update(device_params)
refresh_push_notification_subscriptions
end

And indeed, the spec no longer results in a deadlock.

The possible downside is that now if an error happens after updating the device, the device will still have been updated. It may be set up to receive push notifications, but not the correct types of notification.

Option 2: Ensure both transactions request the same lock first

Another solution is to stick with two transactions, but ensure the first begins by requesting a lock on the device record even though it doesn’t need to update that record.

Instead of:

Device.transaction do
if device.update(device_params)
refresh_push_notification_subscriptions
end
end

We can have

device.with_lock do
if device.update(device_params)
refresh_push_notification_subscriptions
end
end

device.with_lock opens a transaction and does the following as an initial operation:

SELECT `devices`.* FROM `devices` WHERE `devices`.`id` = X LIMIT 1 FOR UPDATE

This SELECT FOR UPDATE locks the row. If another concurrent transaction wants to open the same lock, it waits for the first transaction to finish.

The first transaction gets the lock on the device row first, and the second transaction (which first of all wants to update the device) cannot do anything until that lock is released. The second transaction waits for the first transaction to commit, then it can progress.

The important thing is that the second transaction doesn’t begin by getting some other lock that will block the first transaction in future. Both transactions begin by needing the same lock, which means the second waits without causing problems for the first.

And indeed, with this implementation, the spec no longer results in a deadlock, and the device ends up with the updated token.

Takeaways 🎁

1. Deadlocks don’t necessarily mean slow requests

Whenever I’d heard about “database deadlocks” in the past, I’d vaguely assumed they brought the database to a standstill and resulted in very slow requests. But when I was able to replicate a deadlock, the error happened as soon as the second transaction needed the lock it couldn’t get. There was no delay. With a deadlock, the database (or more precisely INNODB, the storage engine) knows immediately that it’s impossible for the conflicting transactions to progress. So deadlocks don’t themselves slow requests down.

2. So what’s the problem?

Firstly, a deadlock can still indicate a slow transaction, because deadlocks are more likely when there are slow transactions. They only happen when transactions overlap, and that is more likely the longer each transaction takes. More deadlocks may indicate missing indexes, for example, or slow joins, or large transactions with lots of operations.

Secondly, a deadlock results in a transaction being rolled back, which can result in an attempted database change not being made, though this isn’t always the case — the transaction that does succeed may make the desired change anyway.

Thirdly, a deadlock is an error, and if that error is surfaced as a failed request to an API, it could prompt clients to retry requests, resulting in more traffic, which can have an effect on overall response times.

3. Avoiding deadlocks

In general, we can reduce the likelihood of deadlocks by trying to avoid slow transactions and by reducing the number of operations in transactions. The more a transaction does, the more likely a deadlock is, because the order of operations is more likely to be a problem: the longer the sequence of locks a transaction requires, the less likely that sequence is to be in the ‘right order’ (i.e. compatible with other transactions). It can be very important to combine operations in the same transaction — but when there isn’t a reason, we should keep in mind that there is a cost to doing so. (For some more general advice on how to minimise deadlocks, check out the MySQL docs here.)

Understanding a specific deadlock means working out which transactions required the same locks in a different order. Fixing the problem can then mean splitting up the transactions into smaller parts, or reordering operations, or adding a lock to one or both to ensure they both start by requiring the same lock.



Source link

Leave a reply

Please enter your comment!
Please enter your name here