Concurrency and Avoiding Deadlocks

The_Polyglot
6 min readApr 27, 2022

The Tale

This lesson begins with a cautionary tale that some of you might relate to.

Our friend Joe has been spending every ounce of his free time building out what’s to be the next great app. He’s now on the brink of his first round funding and must really hone in on getting this POC finished.

When a user uploads a required asset, he wants it to be stored temporarily in a directory on his server with a database entry pointing to its location and file name. A cron job runs every ten minutes to sync each of these files to an AWS S3 bucket. After the S3 sync is successful, the job then truncates the table of all temporary uploads and waits for the next cron job to execute.

An S3 bucket is a file system in the cloud where you can store and access assets quickly. It makes for a much sounder approach rather than permanently keeping assets locally until you run out of hard space.

He originally used a delete statement to delete only the entries within the last 5 minutes, but found that once his user base grew and uploads began reaching 100k+ a day, the deletion grew rather slowly considering it needed to stay within a 5 minute window. Unlike a delete statement that deletes one row at a time, truncate could apply one nearly instantaneous statement, so he thought it seemed like the best use case.

Joe submits his change to use truncate instead of delete and wakes up the next morning to find that he has two queries that have been executing since the night before! The two queries were:

  1. The insert statement from when a user has uploaded their asset to temporary storage and needs a record of where it exists.

2. The truncate statement on the upload table that the job was to execute.

Joe had never run into this scenario. He was able to kill the database process but now the users’ upload queries have failed, leaving the system with no way of finding their uploaded assets. The scenario escalates as he has endless assets flooding through day and night that are lost since the insert statements continue to clash with his job’s truncate statement

Introducing Locks

Where joe’s truncate and delete differ is how they apply locks.

A lock can be applied at database, table and row level with the intention of not allowing two statements to change the same data at the same time.

So if you place a lock on a row in the user table to change a username at the same time that the user tries to update their username, the second statement will be placed in a queue until the first one has been executed and its lock released.

When the delete statement was being used, even though it was slower, it was applying locks at row level and only for the rows it was deleting, thus not colliding with any new inserts that were being made. The truncate statement places a lock on the entire table, preventing any statements from being made on it.

The job’s truncate statement was attempting to place a lock at the exact same time that the user’s insert query was attempting to place a lock. When this happens, it’s called a Deadlock and its one of the biggest show stoppers you can come across in application development because when your tables are deadlocked, they are inaccessible to your entire user base until you manually clear the process ID for the database thread associated with it.

A deadlock happens when two transactions wait indefinitely for each other to unlock data. Locks can be placed by a transaction or automatically by the database engine when performing updates such as delete and truncate. These deadlocks that are never resolved without admin intervention are also known as a deadly embrace

A visual representation of how joe’s processes got caught in their deadlock

Where Joe Went Wrong

We can likely point out the larger picture mistakes Joe made but like with any system flaw, we will start with smaller solutions and result to larger ones as needed.

Keeping Transactions Small

Joe opted for truncation which made sense at the time. This asset table would soon accumulate millions of entries and clearing one at a time with a delete statement would cause a world of other locking issues as jobs run on top of each other. Unfortunately, truncate locks the entire table and is too risky an operation for running on a high access table.

Transactions should be kept lean and without risky exclusive locks that are placed on needed data or entire tables.

Exclusive locks are when the resource is locked exclusively for the transactions access. No read or update statements can be performed against the locked resource when this lock is held.

The solution to this would be moving to more frequent deletion periods or handling uploads differently so that mass deletion is not needed. Deleting a million records one a time is not ideal but truncating every 5 minutes is even worse.

Avoid Table Wide Operations

If you are making updates to a significant amount of rows or in this case an entire table you should consider the tables use and how often it is exposed to user interaction. Now that you know the power of a table/row lock and how concurrency can go horribly wrong, you understand that locking up a table that a user or job frequently access is a death sentence for your processes.

In joe’s case, the most straightforward solution would be to remove the job entirely and have his server upload images directly to the S3 bucket. Bypassing the need for placing assets in a temporary directory for a job to then batch upload to S3 removes all risk associated with the job. Of course there are tradeoffs. Making this connection to s3 for every upload takes a little longer then dumping the asset in a directory that lives on the server. Its up to the developer to weigh these tradeoffs and if the app’s stability is worth the latency.

Other Remedies

Rather we change the jobs expectations or remove the job entirely, there are other remedies we can use so that a deadlock does not become the end all.

innodb_lock_wait_timeout: setting this option in your database to lets say 30 seconds will assure that any lock/deadlock that lasts longer than 30 seconds will be terminated or rolled back. The upside to this is a deadlock will end quickly instead of shutting down your table until you notice and are able to kill the process yourself. The downside is any statements that last longer than 30 seconds will now be terminated so adjust this setting with caution.

Using temporary tables: I would advise against temp tables if you can and look at your problem from a wider perspective but if you find yourself in a bind where you absolutely must do a full table update or truncation, it is a way to maintain read access for other processes while making expensive updates. Before your process begins making an expensive update on your table that will lock that table up, likely causing other client queries to be queued or deadlocked, create a an empty clone of your table. Your large insertion will now be performed on your temporary table, leaving the original table completely open and without locks for client processes to access.

Conclusion

These are just a few remedies & suggestions that could steer our next data reliant application away from concurrency issues. At the end of the day concurrency starts at system design and how we plan to avoid overlap but understanding how and why these overlaps occur feeds back into our initial planning. To understand locks better, I suggest creating a few tables and experimenting with concurrent inserts, reads, truncates and deletions. Take the time to understand locks at a table level compared to row level and why some locks are riskier than others.

--

--

The_Polyglot

Live to share and teach. The numbers mean nothing if I reach 1. 8 years startups, now at 🍏