Friday, April 15, 2016

mysql not using indexes when “not in” statement is present

Leave a Comment

The table structure is:

CREATE TABLE `test` (   `id` int(10) unsigned NOT NULL AUTO_INCREMENT,   `from` int(10) unsigned NOT NULL,   `to` int(10) unsigned NOT NULL,   `message` text NOT NULL,   `sent` int(10) unsigned NOT NULL DEFAULT '0',   `read` tinyint(1) unsigned NOT NULL DEFAULT '0',   `direction` tinyint(1) unsigned NOT NULL DEFAULT '0',   PRIMARY KEY (`id`),   KEY `one` (`to`,`direction`,`from`,`id`),   KEY `two` (`from`,`direction`,`to`,`id`),   KEY `three` (`read`,`direction`,`to`),   KEY `four` (`read`,`direction`,`from`) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8; 

I have a strange issue. Please have a look at the following query:

select test.id, test.from, test.to, test.message, test.sent, test.read, test.direction from test  where (      (test.to = 244975 and test.direction <> 2 and test.direction <> 3 and          (         (test.from = 204177 and test.id > 5341203) OR          (test.from = 214518 and test.id > 5336549) OR         (test.from = 231429 and test.id > 5338284) OR         (test.from = 242739 and test.id > 5339541) OR         (test.from = 243834 and test.id > 5340438) OR         (test.from = 244354 and test.id > 5337489) OR         (test.from = 244644 and test.id > 5338572) OR         (test.from = 244690 and test.id > 5338467)          )      )      or       (test.from = 244975 and test.direction <> 1 and test.direction <> 3 and          (         (test.to = 204177 and test.id > 5341203) OR         (test.to = 214518 and test.id > 5336549) OR         (test.to = 231429 and test.id > 5338284) OR         (test.to = 242739 and test.id > 5339541) OR         (test.to = 243834 and test.id > 5340438) OR         (test.to = 244354 and test.id > 5337489) OR         (test.to = 244644 and test.id > 5338572) OR         (test.to = 244690 and test.id > 5338467)         )     )      or       (test.read <> 1 and test.direction <> 3 and test.direction <> 2 and test.to = 244975  and test.from not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)      )      or      (test.read <> 1 and test.direction = 2 and test.from = 244975 and test.to not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)      )        )         order by test.id; 

If I do an explain on this query, it goes through all the rows:

1   SIMPLE  test    index   PRIMARY,one,two,three,four  PRIMARY 4       1440596 Using where 

If I remove both the "not in" statements, then it works fine:

select test.id, test.from, test.to, test.message, test.sent, test.read, test.direction from test  where (      (test.to = 244975 and test.direction <> 2 and test.direction <> 3 and          (         (test.from = 204177 and test.id > 5341203) OR          (test.from = 214518 and test.id > 5336549) OR         (test.from = 231429 and test.id > 5338284) OR         (test.from = 242739 and test.id > 5339541) OR         (test.from = 243834 and test.id > 5340438) OR         (test.from = 244354 and test.id > 5337489) OR         (test.from = 244644 and test.id > 5338572) OR         (test.from = 244690 and test.id > 5338467)          )      )      or       (test.from = 244975 and test.direction <> 1 and test.direction <> 3 and          (         (test.to = 204177 and test.id > 5341203) OR         (test.to = 214518 and test.id > 5336549) OR         (test.to = 231429 and test.id > 5338284) OR         (test.to = 242739 and test.id > 5339541) OR         (test.to = 243834 and test.id > 5340438) OR         (test.to = 244354 and test.id > 5337489) OR         (test.to = 244644 and test.id > 5338572) OR         (test.to = 244690 and test.id > 5338467)         )     )      or       (test.read <> 1 and test.direction <> 3 and test.direction <> 2 and test.to = 244975       )      or      (test.read <> 1 and test.direction = 2 and test.from = 244975       )        )         order by test.id; 

Now the explain query returns:

1   SIMPLE  test    index_merge PRIMARY,one,two,three,four  one,two 5,5     30  Using sort_union(one,two); Using where; Using filesort 

I am not sure why it does not work right. What am I missing in the indexes?

5 Answers

Answers 1

If your mySQL version less then 5.0.7, the mysql issue can be reason

Have a look this ticket in MySQL bug tracking https://bugs.mysql.com/bug.php?id=10561

Answers 2

Mixing AND and OR would often cause weird query plan on MySQL, in my experience. I don't have enough data on the table to test, but I would try rewriting your query using UNION ALL. After all, OR in WHERE is basically UNION.

The idea is to break it down on smaller conditions so MySQL can use different index optimized for each part as opposed to jamming all together.

SELECT * FROM ( SELECT     test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM     test  WHERE     test.to = 244975     AND test.direction <> 2     AND test.direction <> 3     AND (         (test.from = 204177 AND test.id > 5341203) OR          (test.from = 214518 AND test.id > 5336549) OR         (test.from = 231429 AND test.id > 5338284) OR         (test.from = 242739 AND test.id > 5339541) OR         (test.from = 243834 AND test.id > 5340438) OR         (test.from = 244354 AND test.id > 5337489) OR         (test.from = 244644 AND test.id > 5338572) OR         (test.from = 244690 AND test.id > 5338467)      ) UNION ALL SELECT     test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM     test  WHERE     test.from = 244975     AND test.direction <> 1     AND test.direction <> 3     AND (         (test.to = 204177 and test.id > 5341203) OR         (test.to = 214518 and test.id > 5336549) OR         (test.to = 231429 and test.id > 5338284) OR         (test.to = 242739 and test.id > 5339541) OR         (test.to = 243834 and test.id > 5340438) OR         (test.to = 244354 and test.id > 5337489) OR         (test.to = 244644 and test.id > 5338572) OR         (test.to = 244690 and test.id > 5338467)     ) UNION ALL SELECT     test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM     test  WHERE     test.read <> 1     AND test.direction <> 3     AND test.direction <> 2     AND test.to = 244975     AND test.from NOT IN (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) UNION ALL SELECT     test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM     test  WHERE     test.read <> 1     AND test.direction = 2     AND test.from = 244975     AND test.to NOT IN (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) test ORDER BY test.id 

Answers 3

It would be good to have a dump of sample data to test with, but I created some of my own anyway. Next I split each of the four outer OR conditions into sub-queries, UNIONed them, and moved the ordering to the final result set.

I've had issues with indexes when using complex WHERE clauses and to me it looks like you have a chat/messaging app and are attempting to get messages to and from a particular user in a single query. Personally, I'd split these into individual queries to simplify the code/queries.

Here is my query:

SELECT test.id, test.from, test.to, test.message, test.sent, test.read, test.direction FROM (   SELECT *   FROM test   WHERE test.to = 244975     AND test.direction not in (2,3)     AND (       (test.from = 204177 AND test.id > 5341203)       OR (test.from = 214518 AND test.id > 5336549)       OR (test.from = 231429 AND test.id > 5338284)       OR (test.from = 242739 AND test.id > 5339541)       OR (test.from = 243834 AND test.id > 5340438)       OR (test.from = 244354 AND test.id > 5337489)       OR (test.from = 244644 AND test.id > 5338572)       OR (test.from = 244690 AND test.id > 5338467)     )   UNION   SELECT *   FROM test   WHERE test.from = 244975     AND test.direction not in (1,3)     AND (       (test.to = 204177 AND test.id > 5341203)       OR (test.to = 214518 AND test.id > 5336549)       OR (test.to = 231429 AND test.id > 5338284)       OR (test.to = 242739 AND test.id > 5339541)       OR (test.to = 243834 AND test.id > 5340438)       OR (test.to = 244354 AND test.id > 5337489)       OR (test.to = 244644 AND test.id > 5338572)       OR (test.to = 244690 AND test.id > 5338467)     )   UNION   SELECT *   FROM test   WHERE test.read != 1     AND test.direction not in (2,3)     AND test.to = 244975     AND test.from not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)   UNION   SELECT *   FROM test   WHERE test.read != 1     AND test.direction = 2     AND test.from = 244975     AND test.to not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) ) test ORDER BY test.id; 

Answers 4

This is probably due to the additional level of nesting / complexity the in condition on an additional column adds to your where clause.

The index merge union sort your 2nd query uses converts the where clause into a set of range conditions combined by OR.

Each value you compare using in counts as another range predicate, so adding the two inconditions with 8 values each to your first query adds 64 more predicates.

As the number of predicates goes up, at some point the optimizer decides it is faster to just scan the whole table.

Answers 5

I am not sure why it does not work right. What am I missing in the indexes?

I'm pretty sure the query planner is working perfectly, you're not missing anything in the indexes that would help in this case. The query planner decided that it would be faster to use a different index because the two queries are very different.

We can make the optimizer use a union of the indexes for us which will make it considerably faster. You can keep the not in and not change any of the or statements. I ran some basic benchmarks of the method I used against the union method. Caveats apply because your DB configuration may be vastly different than mine. Running the query a 1000 times and doing this 3 times I took the best time for each query...

Optimized above

real    0m15.410s user    0m6.681s sys 0m2.641s 

rewritten as a set of unions

real    0m17.747s user    0m6.798s sys 0m2.812s 

NOT IN vs IN

not in and in are very different. The difference between these in this case is access pattern, do I need the data temporarily or as part of the result set. When you use not in with a few keys and the index holds millions of keys it might need to read lots of records if the data is part of the result set. Even when using indexes not in can be reading millions of records from disk... in with a few keys and these are the keys you need is asking to find and use a small subset. The two access patterns are very different. The following example might help make this clear...

1. I don't want these 10 items from a 1,000,000 records I need the other 999,990, this reads the whole index. 2. I only want these 10 from a 1,000,000 records. This might only require one disk seek. 

Number 2 is faster because the of the access pattern ie I found the 10 I need, Nunmber 1. may need to read a million records.

MySQL's query optimizer is seeing this ie the last two OR statements are asking for large subsets of data from either the table or the index ie case 1. above. Seeing that and the fact that it needed to use the primary key anyway the optimizer decided it would be quicker to use the primary key.

When you remove the not in things change ie now the query planner can use the indexes because in the other two or clauses they're in effect get me the few from the many and it performs an index_merge on the two keys that share a to and a from column along with the id.

To see what I mean don't remove the 'not in' part of the query, change it to in to see what happens, on my machine the query plan changed to use a range index.

What follows is an acute example of what I'm describing above and how to take advantage of it ie how do we let the optimizer know we can work with less data.

Think like the optimizer and work with less data

The following SQL is several orders of magnitude faster in tests on a ~4 million row database. The key change is the following line

(select * from test where test.from_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) or test.to_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) as test  

This one line is vastly reducing the dataset that mysql needs to work on because we're using in instead of not in. This is the new query.

select SQL_NO_CACHE test.id, test.from_, test.to_, test.message, test.sent, test.read_, test.direction  from (select * from test where test.from_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) or test.to_ in (244975, 204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690)) as test  where (   (test.to_ = 244975 and test.direction <> 2 and test.direction <> 3 and test.from_ in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) and          (            (test.from_ = 204177 and test.id > 5341203) OR           (test.from_ = 214518 and test.id > 5336549) OR         (test.from_ = 231429 and test.id > 5338284) OR         (test.from_ = 242739 and test.id > 5339541) OR         (test.from_ = 243834 and test.id > 5340438) OR         (test.from_ = 244354 and test.id > 5337489) OR         (test.from_ = 244644 and test.id > 5338572) OR         (test.from_ = 244690 and test.id > 5338467)          )        )        or       (test.from_ = 244975 and test.direction <> 1 and test.direction <> 3 and test.to_ in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690) and          (            (test.to_ = 204177 and test.id > 5341203) OR         (test.to_ = 214518 and test.id > 5336549) OR         (test.to_ = 231429 and test.id > 5338284) OR         (test.to_ = 242739 and test.id > 5339541) OR         (test.to_ = 243834 and test.id > 5340438) OR         (test.to_ = 244354 and test.id > 5337489) OR         (test.to_ = 244644 and test.id > 5338572) OR         (test.to_ = 244690 and test.id > 5338467)         ))       or       (test.read_ <> 1 and test.direction <> 2 and test.direction <> 3 and test.to_ = 244975  and test.from_ not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690))     or       (test.read_ <> 1 and test.direction = 2 and test.from_ = 244975 and test.to_ not in (204177, 214518, 231429, 242739, 243834, 244354, 244644, 244690))      )         order by test.id; 

The explain plan for this looks very different...

mysql> \. sql_fixed.sql *************************** 1. row ***************************            id: 1   select_type: PRIMARY         table: <derived2>          type: ALL possible_keys: NULL           key: NULL       key_len: NULL           ref: NULL          rows: 226      filtered: 100.00         Extra: Using where; Using filesort *************************** 2. row ***************************            id: 2   select_type: DERIVED         table: test          type: index_merge possible_keys: one,two           key: two,one       key_len: 4,4           ref: NULL          rows: 226      filtered: 100.00         Extra: Using sort_union(two,one); Using where 2 rows in set, 1 warning (0.01 sec) 

Note the rows column is now 266. The optimizer being smart immediately can see that it doesn't need most of the data because up front we've told it to use an IN statement with a few keys. Most query optimizers attach a high cost to disk accesses so anything that reduces this it typically preferred by the optimizer.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment