Saturday, December 23, 2017

A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

Leave a Comment

I am using a MySQL DB, and have the following table:

CREATE TABLE SomeTable (   PrimaryKeyCol BIGINT(20) NOT NULL,   A BIGINT(20) NOT NULL,   FirstX INT(11) NOT NULL,   LastX INT(11) NOT NULL,   P INT(11) NOT NULL,   Y INT(11) NOT NULL,   Z INT(11) NOT NULL,   B BIGINT(20) DEFAULT NULL,   PRIMARY KEY (PrimaryKeyCol),   UNIQUE KEY FirstLastXPriority_Index (FirstX,LastX,P) ) ENGINE=InnoDB; 

The table contains 4.3 million rows, and never changes once initialized.

The important columns of this table are FirstX, LastX, Y, Z and P.

As you can see, I have a unique index on the rows FirstX, LastX and P.

The columns FirstX and LastX define a range of integers.

The query I need to run on this table fetches for a given X all the rows having FirstX <= X <= LastX (i.e. all the rows whose range contains the input number X).

For example, if the table contains the rows (I'm including only the relevant columns):

FirstX     LastX      P        Y         Z ------     ------     -       ---       --- 100000     500000     1       111       222  150000     220000     2       333       444 180000     190000     3       555       666 550000     660000     4       777       888    700000     900000     5       999       111  750000     850000     6       222       333  

and I need, for example, the rows that contain the value 185000, the first 3 rows should be returned.

The query I tried, which should be using the index, is:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10; 

Even without the LIMIT, this query should return a small number of records (less than 50) for any given X.

This query was executed by a Java application for 120000 values of X. To my surprise, it took over 10 hours (!) and the average time per query was 0.3 seconds.

This is not acceptable, not even near acceptable. It should be much faster.

I examined a single query that took 0.563 seconds to make sure the index was being used. The query I tried (the same as the query above with a specific integer value instead of ?) returned 2 rows.

I used EXPLAIN to find out what was happening:

id               1 select_type      SIMPLE table            SomeTable  type             range possible_keys    FirstLastXPriority_Index key              FirstLastXPriority_Index  key_len          4 ref              NULL rows             2104820 Extra            Using index condition 

As you can see, the execution involved 2104820 rows (nearly 50% of the rows of the table), even though only 2 rows satisfy the conditions, so half of the index is examined in order to return just 2 rows.

Is there something wrong with the query or the index? Can you suggest an improvement to the query or the index?

EDIT:

Some answers suggested that I run the query in batches for multiple values of X. I can't do that, since I run this query in real time, as inputs arrive to my application. Each time an input X arrives, I must execute the query for X and perform some processing on the output of the query.

11 Answers

Answers 1

I found a solution that relies on properties of the data in the table. I would rather have a more general solution that doesn't depend on the current data, but for the time being that's the best I have.

The problem with the original query:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10; 

is that the execution may require scanning a large percentage of the entries in the FirstX,LastX,P index when the first condition FirstX <= ? is satisfied by a large percentage of the rows.

What I did to reduce the execution time is observe that LastX-FirstX is relatively small.

I ran the query:

SELECT MAX(LastX-FirstX) FROM SomeTable; 

and got 4200000.

This means that FirstX >= LastX – 4200000 for all the rows in the table.

So in order to satisfy LastX >= ?, we must also satisfy FirstX >= ? – 4200000.

So we can add a condition to the query as follows:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND FirstX >= ? - 4200000 AND LastX >= ? LIMIT 10; 

In the example I tested in the question, the number of index entries processed was reduced from 2104820 to 18 and the running time was reduced from 0.563 seconds to 0.0003 seconds.

I tested the new query with the same 120000 values of X. The output was identical to the old query. The time went down from over 10 hours to 5.5 minutes, which is over 100 times faster.

Answers 2

WHERE col1 < ... AND ... < col2 is virtually impossible to optimize.

Any useful query will involve a "range" on either col1 or col2. Two ranges (on two different columns) cannot be used in a single INDEX.

Therefore, any index you try has the risk of checking a lot of the table: INDEX(col1, ...) will scan from the start to where col1 hits .... Similarly for col2 and scanning until the end.

To add to your woes, the ranges are overlapping. So, you can't pull a fast one and add ORDER BY ... LIMIT 1 to stop quickly. And if you say LIMIT 10, but there are only 9, it won't stop until the start/end of the table.

One simple thing you can do (but it won't speed things up by much) is to swap the PRIMARY KEY and the UNIQUE. This could help because InnoDB "clusters" the PK with the data.

If the ranges did not overlap, I would point you at http://mysql.rjweb.org/doc.php/ipranges .

So, what can be done?? How "even" and "small" are the ranges? If they are reasonably 'nice', then the following would take some code, but should be a lot faster. (In your example, 100000 500000 is pretty ugly, as you will see in a minute.)

Define buckets to be, say, floor(number/100). Then build a table that correlates buckets and ranges. Samples:

FirstX  LastX  Bucket 123411  123488  1234 222222  222444  2222 222222  222444  2223 222222  222444  2224 222411  222477  2224 

Notice how some ranges 'belong' to multiple buckets.

Then, the search is first on the bucket(s) in the query, then on the details. Looking for X=222433 would find two rows with bucket=2224, then decide that both are OK. But for X=222466, two rows have the bucket, but only one matches with firstX and lastX.

WHERE bucket = FLOOR(X/100)   AND firstX <= X   AND X <= lastX 

with

INDEX(bucket, firstX) 

But... with 100000 500000, there would be 4001 rows because this range is in that many 'buckets'.

Plan B (to tackle the wide ranges)

Segregate the ranges into wide and narrow. Do the wide ranges by a simple table scan, do the narrow ranges via my bucket method. UNION ALL the results together. Hopefully the "wide" table would much smaller than the "narrow" table.

Answers 3

You need to add another index on LastX.

The unique index FirstLastXPriority_Index (FirstX,LastX,P) represents the concatenation of these values, so it will be useless with the 'AND LastX >= ?' part of your WHERE clause.

Answers 4

It seems that the only way to make the query fast is to reduce the number of fetched and compared fields. Here is the idea.

We can declare a new indexed field (for instance UNSIGNED BIGINT) and store both values FistX and LastX in it using an offset for one of the fields.

For example:

FirstX     LastX      CombinedX 100000     500000     100000500000 150000     220000     150000220000 180000     190000     180000190000 550000     660000     550000660000    70000      90000      070000090000  75         85         000075000085 

an alternative is to declare the field as DECIMAL and store FirstX + LastX / MAX(LastX) in it. Later look for the values satisfying the conditions comparing the values with a single field CombinedX.

APPENDED

And then you can fetch the rows checking only one field: by something like where param1=160000

SELECT * FROM new_table  WHERE (CombinedX <= 160000*1000000) AND (CombinedX % 1000000 >= 160000); 

Here I assume that for all FistX < LastX. Of course, you can calculate the param1*offset in advance and store it in a variable against which the further comparisons will be done. Of course, you can consider not decimal offsets but bitwise shifts instead. Decimal offsets were chosen as they are easier to read by a human to show in the sample.

Answers 5

Eran, I believe the solution you found youself is the best in terms of minimum costs. It is normal to take into account distribution properties of the data in the DB during optimization process. Moreover, in large systems, it is usually impossible to achieve satisfactory performance, if the nature of the data is not taken into account.

However, this solution also has drawbacks. And the need to change the configuration parameter with every data change is the least. More important may be the following. Let's suppose that one day a very large range appears in the table. For example, let its length cover half of all possible values. I do not know the nature of ​​your data, so I can not definitely know if such a range can ever appear or not, so this is just an assumption. From the point of view to the result, it's okay. It just means that about every second query will now return one more record. But even just one such interval will completely kill your optimization, because the condition FirstX <=? AND FirstX> =? - [MAX (LastX-FirstX)] will no longer effectively cut off enough records.

Therefore, if you do not have assurance if too long ranges will ever come, I would suggest you to keep the same idea, but take it from other side. I propose, when loading new data to the table, break all long ranges into smaller with a length not exceeding a certain value. You wrote that The important columns of this table are FirstX, LastX, Y, Z and P. So you can once choose some number N, and every time loading data to the table, if found the range with LastX-FirstX > N, to replace it with several rows:

FirstX; FirstX + N FirstX + N; FirstX + 2N ... FirstX + kN; LastX 

and for the each row, keep the same values ​​of Y, Z and P.

For the data prepared that way, your query will always be the same:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <=? AND FirstX> =? - N AND LastX> =? 

and will always be equally effective.

Now, how to choose the best value for N? I would take some experiments with different values and see what would be better. And it is possible for the optimum to be less than the current maximum length of the interval 4200000. At first it could surprise one, because the lessening of N is surely followed by growth of the table so it can become much larger than 4.3 million. But in fact, the huge size of the table is not a problem, when your query uses the index well enough. And in this case with lessening of N, the index will be used more and more efficiently.

Answers 6

Edit: Idea #2

Do you have control over the Java app? Because, honestly, 0.3 seconds for an index scan is not bad. Your problem is that you're trying to get a query, run 120,000 times, to have a reasonable end time.

If you do have control over the Java app, you could either have it submit all the X values at once - and let SQL not have to do an index scan 120k times. Or you could even just program the logic on the Java side, since it would be relatively easy to optimize.

Original Idea:

Have you tried creating a Multiple-Column index?

The problem with having multiple indexes is that each index is only going to narrow it down to ~50% of the records - it has to then match those ~2 million rows of Index A against ~2 million rows of Index B.

Instead, if you get both columns in the same index, the SQL engine can first do a Seek operation to get to the start of the records, and then do a single Index Scan to get the list of records it needs. No matching one index against another.

I'd suggest not making this the Clustered Index, though. The reason for that? You're not expecting many results, so matching the Index Scan's results against the table isn't going to be time consuming. Instead, you want to make the Index as small as possible, so that the Index Scan goes as fast as possible. Clustered Indexes are the table - so a Clustered Index is going to have the same Scan speed as the table itself. Along the same lines, you probably don't want any other fields other than FirstX and LastX in your index - make that Index as tiny as you can, so that the scan flies along.

Finally, like you're doing now, you're going to need to clue the engine in that you're not expecting a large set of data back from the search - you want to make sure it's using that compact Index for its scan (instead of it saying, "Eh, I'd be better off just doing a full table scan.)

Answers 7

One way might be to partition the table by different ranges then only querying stuff that fit into a range hence making the amount it needs to check much smaller. This might not work since the java may be slower. But it might put less stress on the database. There might be a way also to not Query the database so many times and have a more inclusive SQL(you might be able to send a list of values and have the sql send it to a different table).

Answers 8

Suppose you got the execution time down to 0.1 seconds. Would the resulting 3 hours, twenty minutes be acceptable?

The simple fact is that thousands of calls to the same query is incredibly inefficient. Quite aside from what the database has to endure, there is network traffic to think of, disk seek times and all kinds of processing overhead.

Supposing that you don't already have the 120,000 values for x in a table, that's where I would start. I would insert them into a table in batches of 500 or so at a time:

insert into xvalues (x) select 14 union all select 18 union all select 42 /* and so on */ 

Then, change your query to join to xvalues.

I reckon that optimisation alone will get your run-time down to minutes or seconds instead of hours (based on many such optimisations I have done through the years).

It also opens up the door for further optimisations. If the x values are likely to have at least some duplicates (say, at least 20% of values occur more than once) it may be worth investigating a solution where you only run the query for unique values and do the insert into SomeTable for every x with the matching value.

As a rule: anything you can do in bulk is likely to exponentially outperform anything you do row by row.

PS:

You referred to a query, but a stored procedure can also work with an input table. In some RDBMSs you can pass a table as parameter. I don't think that works in MySQL, but you can create a temporary table that the calling code fills in and the stored procedure joins to. Or a permanent table used in the same way. The major drawback of not using a temp table, is that you may need to concern yourself with session management or discarding stale data. Only you will know if that is applicable to your case.

Answers 9

So, I dont have enough data to be sure of the run time. This will only work if column P is unique? In order to get two indexes working, I created two indexes and the following query...

Index A - FirstX, P, Y, Z Index B - P, LastX 

This is the query

select A.P, A.Y, A.Z  from      (select P, Y, Z from asdf A where A.firstx <= 185000 ) A     join      (select P from asdf A where A.LastX >= 185000 ) B     ON A.P = B.P 

For some reason this seemed faster than

select A.P, A.Y, A.Z  from asdf A join asdf B on A.P = B.P where A.firstx <= 185000 and B.LastX >= 185000 

Answers 10

Indexes will not help you in this scenario, except for a small percentage of all possible values of X. You have already figured this out.

Lets say for example that:

  • FirstX contains 100 distinct values from 1 to 100
  • LastX is contains 100 distinct values from 1 to 142 and it is greater than/equal to FirstX

And you have following indexes:

  1. FirstX, LastX, <covering fields>
  2. LastX, FirstX, <covering fields>

Now:

  • If X is 5 the clause FirstX <= 5 matches approximately 5% rows while LastX >= 5 matches approximately 96% rows. MySQL will use the first index.

  • If X is 135 the clause FirstX <= 135 matches approximately 100% rows while LastX >= 135 matches approximately 5% rows. MySQL will use the second index.

  • Any X between these two will cause MySQL to not use either index (I don't know the exact threshold but 5% worked in my tests).

Even if MySQL uses the index, there are just too many matches and it will most likely be used for covering instead of seeking.

Your solution is the best. What you are doing is defining upper and lower bound of "range" search:

WHERE FirstX <= 10000000 AND   FirstX >= 10000000 - 4200000 AND   ... 

This will work even if you search FirstX for values in the middle. Having said that, you got lucky with 4200000 value, possibly because FirstX contains values up to 10x larger than this, no?


If it helps, you can do the following after loading the data:

ALTER TABLE testdata ADD COLUMN delta INT NOT NULL; UPDATE testdata SET delta = LastX - FirstX; ALTER TABLE testdata ADD INDEX delta (delta); 

This makes selecting MAX(LastX - FirstX) easier.

Answers 11

To optimize this query:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;

Here's 2 resources you can use:

  • descending indexes
  • spatial indexes

Descending indexes:

One option is to use an index that is descending on FirstX and ascending on LastX.

https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html

something like:

CREATE INDEX SomeIndex on SomeTable (FirstX DESC, LastX);

Conversely, you could create instead the index (LastX, FirstX DESC).

Spatial indexes:

Another option is to use a SPATIAL INDEX with (FirstX, LastX). If you think of FirstX and LastX as 2D spatial coordinates, then your search what it does is select the points in a contiguous geographic area delimited by the lines FirstX<=LastX, FirstX>=0, LastX>=X.

Here's a link on spatial indexes (not specific to MySQL, but with drawings):

https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment