One fine morning, around 6 AM, I got a pagerduty alert that the production systems were down, some of our users were able to intermittently access the application but there were mostly database errors. The application I am talking about is a Golang microservice connected to the MYSQL database. The UI became unresponsive as the backend was not able to serve API calls. This post is a story about how the problem was debugged and fixed and why such a seemingly simple UPDATE query caused so much harm.
Looking at the logs, it was clear that all our database calls were failing after a release 1 day back. However, it was not clear why, the database's basic metrics CPU, memory etc. all were okay. After ruling out all the basic possible causes, we had no idea of what was wrong. Since it was affecting live users, we tried restarting the service once, but we were out of luck and the system got into the same state in merely seconds. So we decided to find out what was wrong with the database.
After some debugging, we found that the database has approximately 380 stuck queries, interestingly the queries were almost similar:
update agents set is_deleted = 0, last_activity_at = 1684463485031367053 where hostname = 'my-agent-1';
update agents set is_deleted = 0, last_activity_at = 1684555485031367111 where hostname = 'my-agent-2';
update agents set is_deleted = 0, last_activity_at = 1684477775031367223 where hostname = 'my-agent-3';
Such simple queries (only by looks though) have been in a running state for hours. My first guess was okay maybe the table has millions of records and thus the updates are taking time, to my surprise the table only had around 1,500 records. Whaaaat??
The next thing I did was to check the InnoDB status output. I suspected that since there are a lot of queries stuck, it might be possible that transactions are getting locked and there is a possible deadlock. SHOW INNODB STATUS is a command which prints out a lot of internal InnoDB performance counters, statistics, and information about transaction processing.
As I expected, after going through the output, I reduced it to below:
Table: agents
Primary Key: org_id, hostname, cluster_name
Transaction 1:
query: UPDATE agents SET last_activity_at = 1684459174165685110, is_deleted = 0 WHERE hostname = 'my-agent-1'
Holds the locks for below:
hostname: my-agent-2
Waiting for:
hostname: my-agent-1
Transaction 2:
query: UPDATE agents SET last_activity_at = 1684459021724497037, is_deleted = 0 WHERE hostname = 'my-agent-2'
Holds the locks for below:
hostname: my-agent-1
Waiting for:
hostname: my-agent-2
It was surprising to see that a query which wants to update a row with a hostname my-agent-1
was holding a lock on row with hostname my-agent-2
and another query which wants to update row with hostname my-agent-2
was holding a lock on my-agent-1
and hence both of them causes a deadlock for each other.
How on earth that was happening? Why a query is locking a different row that it doesn't even need to update?
During the debugging, I also noticed something concerning, Here is the table structure:
| agents | CREATE TABLE `agents` (
`org_id` varchar(64) NOT NULL DEFAULT '',
`container_id` varchar(256) NOT NULL DEFAULT '',
`last_activity_at` bigint(20) NOT NULL DEFAULT '0',
`cluster_name` varchar(64) NOT NULL DEFAULT '',
`hostname` varchar(100) NOT NULL,
`is_deleted` tinyint(1) NOT NULL DEFAULT '0',
`created_at` bigint(20) NOT NULL DEFAULT '0'
PRIMARY KEY (`org_id`,`hostname`,`cluster_name`)
As is clear from the structure, the table has a Primary key index on org_id
, hostname
, cluster_name
, our query had below WHERE
clause
where hostname = 'my-agent-1';
Thus, it means our query is NOT using the composite index. To check how the composite indexes work, click this link.
This query was part of huge asynchronous events which were coming into the service. Each query as it is NOT using any index has to scan multiple database rows to search for the row with the given hostname. Now here is a special thing about a MySQL UPDATE query which we usually ignore or have no idea of, from the MYSQL documentation:
To our surprise, around 400 rows of our table were getting locked since we were not searching by an index. Normally we use UPDATE queries in APIs and we never faced this problem, but in a concurrent system, UPDATE queries can end up locking many rows if indexes are not used properly.
To fix the problem, we updated the query to use the full index and thus it will only lock 1 row. Here is the sample query after the fix:
update agents set is_deleted = 0, last_activity_at = 1684463485031367053 where org_id = '123' AND hostname = 'my-agent-1' AND cluster_name = 'prod';
Our database system stopped responding to queries after the number of stuck queries reached approximately 400 since our database max open connections were set to 400. Hence it caused an outage for the complete service since no queries were getting fulfilled by the database.
I hope you either learned a thing or two or at least got a kick out of my misery. This issue was a pain to debug, but in hindsight, a quirky bug to blog about. Thanks for reading!