This post's content
This post is based on a true story from one of our team members.
Here at EverSQL, we have high appreciation for the MySQL database.
Said that, every product has its pros and cons and even some bugs here and there.
So let's dive into how we used some "bad practices" to overcome a potential bug in MySQL's optimizer.
TL;DR: This case study will describe a scenario where the MySQL optimizer will not choose to use an existing and allegedly good index to optimize a search query, which eventually caused a response time of 80 seconds of a simple page in a web application.
The Problem
The scenario from a business perspective - the development team was trying to implement a web page that presents data about the user actions in the last year.
The scenario from a technical database perspective - The team was trying to join two (same issue will occur with more than two) tables and use a range condition that is not applied to the first table (in the optimizer's execution plan). If this wasn't very clear, look into the query and the explain plan below and get back to this last sentence again.
The tables involved in the scenario are the users and user_actions, which self explanatory from their name and structure.
The users table contains 209 users and the user_actions table holds 2,200,000,000 (Yes, 2.2 billion) records.
CREATE TABLE users ( id INT(10) NOT NULL, user_name VARCHAR(40) NOT NULL, status VARCHAR(10) NOT NULL, PRIMARY KEY(id), KEY status (status) )
CREATE TABLE user_actions( Id INT(10) NOT NULL AUTO_INCREMENT, user_id INT(10) NOT NULL, date DATE NOT NULL, PRIMARY KEY (id), KEY user_date (user_id, date) )
Using this simplified SQL query, we tried to obtain data, from 2017, about active users and their actions.
SELECT * FROM users u JOIN user_actions ua ON u.id=ua.user_id WHERE ua.date>'2017-01-01' AND u.status='Active';
The execution duration of this SQL query was 80 seconds, which is of course unacceptable by any application these days for any user initiated action (maybe other than report generation, which isn't the case here).
So we turned to the EXPLAIN plan to maybe find salvation there:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: u
type: ref
possible_keys: PRIMARY,status
key: status
key_len: 32
ref: const
rows: 203
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: ua
type: ref
possible_keys: user_date
key: user_date
key_len: 4
ref: u.id
rows: 3837000
Extra: Using where; Using index
As you can see, MySQL chose to use the user_date index, but the key_len is only 4 bytes! The full key length should be 7 bytes (4 bytes for the user_id and 3 bytes for the date column).
As mentioned above, we witnessed this behavior in cases where you join two or more tables and use a range condition that is not applied on the first table (first table as in the EXPLAIN). In our example - we’ve joined `users` and `user_actions`, while using a range condition on the `date` column from `user_actions` (the second table in the execution plan).
As you can imagine, this behaviour can have a large impact on performance.
Possible Workarounds
Please note that some of these workarounds are not ideal in terms of design and "awesomeness". Some even consider them "bad practices" when used without a good reason.
The only reason we suggest to use them in this case is to work around the potential bug in the MySQL optimizer.
Solving the problem using bad practice #1 (Run a query in a loop)
Add another condition that uses the user_id column (which we used in the ON clause) to the query:
SELECT * FROM users u JOIN user_actions ua ON u.id=ua.user_id WHERE ua.date>'2017-01-01' AND u.status='Active' and u.id=10;
Now, let's look at the explain plan again:
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: u type: const possible_keys: PRIMARY,status key: PRIMARY key_len: 4 ref: const rows: 1 Extra: *************************** 2. row *************************** id: 1 select_type: SIMPLE table: ua type: ref possible_keys: user_date key: user_date key_len: 7 ref: const rows: 404090 Extra: Using where; Using index
In our case, this solution wasn't sufficient, as we needed the data of all user_id's and not only of one of them. Therefore, building upon this workaround got us to execute the query for each user_id in a loop from the application code. This might sound expensive and a bad practice, but in fact, in our specific case, the overall performance was significantly better than using the original query (which doesn’t use the user_date index correctly). Please note that if the amount of users was significantly larger, this might not be the outcome.
Solving the problem using bad practice #2 (Cartesian product)
Create a new dates table and populate it with all valid and relevant dates, starting the minimum date that is relevant for you application all the way up to the maximum possible date.
CREATE TABLE `dates` ( `date` date NOT NULL, PRIMARY KEY (`date`) )
Then join to it in the query:
SELECT STRAIGHT_JOIN * FROM users u JOIN dates d JOIN user_actions ua ON u.id=ua.user_id AND d.date=ua.date WHERE d.date>'2017-01-01' AND u.status='Active';
Let's look into the explain plan of the updated query:
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: u type: ref possible_keys: PRIMARY,status key: status key_len: 32 ref: const rows: 203 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: d type: range possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: NULL rows: 12305 Extra: Using where; Using index; Using join buffer *************************** 3. row *************************** id: 1 select_type: SIMPLE table: ua type: ref possible_keys: user_date key: user_date key_len: 7 ref: u.id,d.date rows: 30420 Extra: Using index
You’ve probably noticed that this query will use a cartesian product of `users` and `dates`. I know. It doesn’t look good at all. BUT, in our case (and others similar we've seen), this workaround improved execution time by a very large magnitude. In this specific example, the query's response time improved dramatically, going down from 80 seconds to 0.04 seconds!
* The straight_join is used to help MySQL use the right execution plan. After all, using a cartesian product is not usually the optimizer's first choice.
Please note there are already several bug reports on this (i.e, bug 8569, bug 19548), but as far as I know, no fix was scheduled / released yet.
Conclusion
Remember, we do not recommend the approaches above for any query. Use them wisely and only for the specific use case we described above: when joining multiple tables and using a range scan. And again, test, test & test again before deciding which approach to go with.
Did you ran into a similar issue and can’t find a way around it? Tell us about your specific use case in the comments and we’ll do our best to assist!
Did you find some other valid workarounds? Great! We'll be more than glad to hear about them!