The table structure is:
CREATE TABLE `test` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `from` int(10) unsigned NOT NULL, `to` int(10) unsigned NOT NULL, `message` text NOT NULL, `sent` int(10) unsigned NOT NULL DEFAULT '0', `read` tinyint(1) unsigned NOT NULL DEFAULT '0', `direction` tinyint(1) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `one` (`to`,`direction`,`from`,`id`), KEY `two` (`from`,`direction`,`to`,`id`), KEY `three` (`read`,`direction`,`to`), KEY `four` (`read`,`direction`,`from`) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
I have a strange issue. Please have a look at the following query:
select test.id, test.from, test.to, test.message, test.sent, test.read, test.direction from test where ( (test.to = 244975 and test.direction <> 2 and test.direction <> 3 and ( (test.from = 204177 and test.id > 5341203) OR (test.from = 214518 and test.id > 5336549) OR (test.from = 231429 and test.id > 5338284) OR (test.from = 242739 and test.id > 5339541) OR (test.from = 243834 and test.id > 5340438) OR (test.from = 244354 and test.id > 5337489) OR (test.from = 244644 and test.id > 5338572) OR (test.from = 244690 and test.id > 5338467) ) ) or (test.from = 244975 and test.direction <> 1 and test.direction <> 3 and ( (test.to = 204177 and test.id > 5341203) OR (test.to = 214518 and test.id > 5336549) OR (test.to = 231429 and test.id > 5338284) OR (test.to = 242739 and test.id > 5339541) OR (test.to = 243834 and test.id > 5340438) OR (test.to = 244354 and test.id > 5337489) OR (test.to = 244644 and test.id > 5338572) OR (test.to = 244690 and test.id > 5338467) ) ) or (test.read <> 1 and test.direction <> 3 and test.direction <> 2 and test.to = 244975 and test.from not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) or (test.read <> 1 and test.direction = 2 and test.from = 244975 and test.to not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) ) order by test.id;
If I do an explain on this query, it goes through all the rows:
1 SIMPLE test index PRIMARY,one,two,three,four PRIMARY 4 1440596 Using where
If I remove both the "not in" statements, then it works fine:
select test.id, test.from, test.to, test.message, test.sent, test.read, test.direction from test where ( (test.to = 244975 and test.direction <> 2 and test.direction <> 3 and ( (test.from = 204177 and test.id > 5341203) OR (test.from = 214518 and test.id > 5336549) OR (test.from = 231429 and test.id > 5338284) OR (test.from = 242739 and test.id > 5339541) OR (test.from = 243834 and test.id > 5340438) OR (test.from = 244354 and test.id > 5337489) OR (test.from = 244644 and test.id > 5338572) OR (test.from = 244690 and test.id > 5338467) ) ) or (test.from = 244975 and test.direction <> 1 and test.direction <> 3 and ( (test.to = 204177 and test.id > 5341203) OR (test.to = 214518 and test.id > 5336549) OR (test.to = 231429 and test.id > 5338284) OR (test.to = 242739 and test.id > 5339541) OR (test.to = 243834 and test.id > 5340438) OR (test.to = 244354 and test.id > 5337489) OR (test.to = 244644 and test.id > 5338572) OR (test.to = 244690 and test.id > 5338467) ) ) or (test.read <> 1 and test.direction <> 3 and test.direction <> 2 and test.to = 244975 ) or (test.read <> 1 and test.direction = 2 and test.from = 244975 ) ) order by test.id;
Now the explain query returns:
1 SIMPLE test index_merge PRIMARY,one,two,three,four one,two 5,5 30 Using sort_union(one,two); Using where; Using filesort
I am not sure why it does not work right. What am I missing in the indexes?
5 Answers
Answers 1
If your mySQL version less then 5.0.7, the mysql issue can be reason
Have a look this ticket in MySQL bug tracking https://bugs.mysql.com/bug.php?id=10561
Answers 2
Mixing AND
and OR
would often cause weird query plan on MySQL, in my experience. I don't have enough data on the table to test, but I would try rewriting your query using UNION ALL
. After all, OR
in WHERE
is basically UNION
.
The idea is to break it down on smaller conditions so MySQL can use different index optimized for each part as opposed to jamming all together.
SELECT * FROM ( SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM test WHERE test.to = 244975 AND test.direction <> 2 AND test.direction <> 3 AND ( (test.from = 204177 AND test.id > 5341203) OR (test.from = 214518 AND test.id > 5336549) OR (test.from = 231429 AND test.id > 5338284) OR (test.from = 242739 AND test.id > 5339541) OR (test.from = 243834 AND test.id > 5340438) OR (test.from = 244354 AND test.id > 5337489) OR (test.from = 244644 AND test.id > 5338572) OR (test.from = 244690 AND test.id > 5338467) ) UNION ALL SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM test WHERE test.from = 244975 AND test.direction <> 1 AND test.direction <> 3 AND ( (test.to = 204177 and test.id > 5341203) OR (test.to = 214518 and test.id > 5336549) OR (test.to = 231429 and test.id > 5338284) OR (test.to = 242739 and test.id > 5339541) OR (test.to = 243834 and test.id > 5340438) OR (test.to = 244354 and test.id > 5337489) OR (test.to = 244644 and test.id > 5338572) OR (test.to = 244690 and test.id > 5338467) ) UNION ALL SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM test WHERE test.read <> 1 AND test.direction <> 3 AND test.direction <> 2 AND test.to = 244975 AND test.from NOT IN (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) UNION ALL SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM test WHERE test.read <> 1 AND test.direction = 2 AND test.from = 244975 AND test.to NOT IN (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) test ORDER BY test.id
Answers 3
It would be good to have a dump of sample data to test with, but I created some of my own anyway. Next I split each of the four outer OR conditions into sub-queries, UNIONed them, and moved the ordering to the final result set.
I've had issues with indexes when using complex WHERE clauses and to me it looks like you have a chat/messaging app and are attempting to get messages to and from a particular user in a single query. Personally, I'd split these into individual queries to simplify the code/queries.
Here is my query:
SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM ( SELECT * FROM test WHERE test.to = 244975 AND test.direction not in (2,3) AND ( (test.from = 204177 AND test.id > 5341203) OR (test.from = 214518 AND test.id > 5336549) OR (test.from = 231429 AND test.id > 5338284) OR (test.from = 242739 AND test.id > 5339541) OR (test.from = 243834 AND test.id > 5340438) OR (test.from = 244354 AND test.id > 5337489) OR (test.from = 244644 AND test.id > 5338572) OR (test.from = 244690 AND test.id > 5338467) ) UNION SELECT * FROM test WHERE test.from = 244975 AND test.direction not in (1,3) AND ( (test.to = 204177 AND test.id > 5341203) OR (test.to = 214518 AND test.id > 5336549) OR (test.to = 231429 AND test.id > 5338284) OR (test.to = 242739 AND test.id > 5339541) OR (test.to = 243834 AND test.id > 5340438) OR (test.to = 244354 AND test.id > 5337489) OR (test.to = 244644 AND test.id > 5338572) OR (test.to = 244690 AND test.id > 5338467) ) UNION SELECT * FROM test WHERE test.read != 1 AND test.direction not in (2,3) AND test.to = 244975 AND test.from not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) UNION SELECT * FROM test WHERE test.read != 1 AND test.direction = 2 AND test.from = 244975 AND test.to not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) test ORDER BY test.id;
Answers 4
This is probably due to the additional level of nesting / complexity the in
condition on an additional column adds to your where clause.
The index merge union sort your 2nd query uses converts the where clause into a set of range conditions combined by OR
.
Each value you compare using in
counts as another range predicate, so adding the two in
conditions with 8 values each to your first query adds 64 more predicates.
As the number of predicates goes up, at some point the optimizer decides it is faster to just scan the whole table.
Answers 5
I am not sure why it does not work right. What am I missing in the indexes?
I'm pretty sure the query planner is working perfectly, you're not missing anything in the indexes that would help in this case. The query planner decided that it would be faster to use a different index because the two queries are very different.
We can make the optimizer use a union of the indexes for us which will make it considerably faster. You can keep the not in
and not change any of the or
statements. I ran some basic benchmarks of the method I used against the union method. Caveats apply because your DB configuration may be vastly different than mine. Running the query a 1000 times and doing this 3 times I took the best time for each query...
Optimized above
real 0m15.410s user 0m6.681s sys 0m2.641s
rewritten as a set of unions
real 0m17.747s user 0m6.798s sys 0m2.812s
NOT IN vs IN
not in
and in
are very different. The difference between these in this case is access pattern, do I need the data temporarily or as part of the result set. When you use not in
with a few keys and the index holds millions of keys it might need to read lots of records if the data is part of the result set. Even when using indexes not in
can be reading millions of records from disk... in
with a few keys and these are the keys you need is asking to find and use a small subset. The two access patterns are very different. The following example might help make this clear...
1. I don't want these 10 items from a 1,000,000 records I need the other 999,990, this reads the whole index. 2. I only want these 10 from a 1,000,000 records. This might only require one disk seek.
Number 2 is faster because the of the access pattern ie I found the 10 I need, Nunmber 1. may need to read a million records.
MySQL's query optimizer is seeing this ie the last two OR statements are asking for large subsets of data from either the table or the index ie case 1. above. Seeing that and the fact that it needed to use the primary key anyway the optimizer decided it would be quicker to use the primary key.
When you remove the not in
things change ie now the query planner can use the indexes because in the other two or
clauses they're in effect get me the few from the many
and it performs an index_merge on the two keys that share a to
and a from
column along with the id
.
To see what I mean don't remove the 'not in' part of the query, change it to in
to see what happens, on my machine the query plan changed to use a range index.
What follows is an acute example of what I'm describing above and how to take advantage of it ie how do we let the optimizer know we can work with less data.
Think like the optimizer and work with less data
The following SQL is several orders of magnitude faster in tests on a ~4 million row database. The key change is the following line
(select * from test where test.from_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) or test.to_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) as test
This one line is vastly reducing the dataset that mysql needs to work on because we're using in
instead of not in
. This is the new query.
select SQL_NO_CACHE test.id, test.from_, test.to_, test.message, test.sent, test.read_, test.direction from (select * from test where test.from_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) or test.to_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) as test where ( (test.to_ = 244975 and test.direction <> 2 and test.direction <> 3 and test.from_ in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) and ( (test.from_ = 204177 and test.id > 5341203) OR (test.from_ = 214518 and test.id > 5336549) OR (test.from_ = 231429 and test.id > 5338284) OR (test.from_ = 242739 and test.id > 5339541) OR (test.from_ = 243834 and test.id > 5340438) OR (test.from_ = 244354 and test.id > 5337489) OR (test.from_ = 244644 and test.id > 5338572) OR (test.from_ = 244690 and test.id > 5338467) ) ) or (test.from_ = 244975 and test.direction <> 1 and test.direction <> 3 and test.to_ in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) and ( (test.to_ = 204177 and test.id > 5341203) OR (test.to_ = 214518 and test.id > 5336549) OR (test.to_ = 231429 and test.id > 5338284) OR (test.to_ = 242739 and test.id > 5339541) OR (test.to_ = 243834 and test.id > 5340438) OR (test.to_ = 244354 and test.id > 5337489) OR (test.to_ = 244644 and test.id > 5338572) OR (test.to_ = 244690 and test.id > 5338467) )) or (test.read_ <> 1 and test.direction <> 2 and test.direction <> 3 and test.to_ = 244975 and test.from_ not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) or (test.read_ <> 1 and test.direction = 2 and test.from_ = 244975 and test.to_ not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) ) order by test.id;
The explain plan for this looks very different...
mysql> \. sql_fixed.sql *************************** 1. row *************************** id: 1 select_type: PRIMARY table: <derived2> type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 226 filtered: 100.00 Extra: Using where; Using filesort *************************** 2. row *************************** id: 2 select_type: DERIVED table: test type: index_merge possible_keys: one,two key: two,one key_len: 4,4 ref: NULL rows: 226 filtered: 100.00 Extra: Using sort_union(two,one); Using where 2 rows in set, 1 warning (0.01 sec)
Note the rows column is now 266. The optimizer being smart immediately can see that it doesn't need most of the data because up front we've told it to use an IN
statement with a few keys. Most query optimizers attach a high cost to disk accesses so anything that reduces this it typically preferred by the optimizer.
0 comments:
Post a Comment