Wednesday, December 6, 2017

Improve speed of complex postgres query in rails app

Leave a Comment

I have a view in my app that visualizes a lot of data, and in the backend the data is produced using this query:

DataPoint Load (20394.8ms)   SELECT communities.id as com,         consumers.name as con,         array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,         array_agg(consumption ORDER BY data_points.timestamp ASC) as cons  FROM "data_points"       INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"       INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"       INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"       INNER JOIN "clusterings" ON "clusterings"."id" = "communities"."clustering_id"  WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)     AND "data_points"."interval_id" = $3     AND "clusterings"."id" = 1  GROUP BY communities.id, consumers.id   [["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]] 

The query takes about 20 seconds to execute, which seems a bit excessive.

The code for generating the query is this:

res = {} DataPoint.joins(consumer: {communities: :clustering} )          .where('clusterings.id': self,                timestamp: chart_cookies[:start_date] .. chart_cookies[:end_date],                interval_id: chart_cookies[:interval_id])          .group('communities.id')          .group('consumers.id')          .select('communities.id as com, consumers.name as con',                 'array_agg(timestamp ORDER BY data_points.timestamp asc) as tims',                 'array_agg(consumption ORDER BY data_points.timestamp ASC) as cons')          .each do |d|       res[d.com] ||= {}       res[d.com][d.con] = d.tims.zip(d.cons)       res[d.com]["aggregate"] ||= d.tims.map{|t| [t,0]}       res[d.com]["aggregate"]  = res[d.com]["aggregate"].zip(d.cons).map{|(a,b),d| [a,(b+d)]} end res 

And the relevant database models are the following:

  create_table "data_points", force: :cascade do |t|     t.bigint "consumer_id"     t.bigint "interval_id"     t.datetime "timestamp"     t.float "consumption"     t.float "flexibility"     t.datetime "created_at", null: false     t.datetime "updated_at", null: false     t.index ["consumer_id"], name: "index_data_points_on_consumer_id"     t.index ["interval_id"], name: "index_data_points_on_interval_id"     t.index ["timestamp", "consumer_id", "interval_id"], name: "index_data_points_on_timestamp_and_consumer_id_and_interval_id", unique: true     t.index ["timestamp"], name: "index_data_points_on_timestamp"   end    create_table "consumers", force: :cascade do |t|     t.string "name"     t.string "location"     t.string "edms_id"     t.bigint "building_type_id"     t.bigint "connection_type_id"     t.float "location_x"     t.float "location_y"     t.string "feeder_id"     t.bigint "consumer_category_id"     t.datetime "created_at", null: false     t.datetime "updated_at", null: false     t.index ["building_type_id"], name: "index_consumers_on_building_type_id"     t.index ["connection_type_id"], name: "index_consumers_on_connection_type_id"     t.index ["consumer_category_id"], name: "index_consumers_on_consumer_category_id"   end    create_table "communities_consumers", id: false, force: :cascade do |t|     t.bigint "consumer_id", null: false     t.bigint "community_id", null: false     t.index ["community_id", "consumer_id"], name: "index_communities_consumers_on_community_id_and_consumer_id"     t.index ["consumer_id", "community_id"], name: "index_communities_consumers_on_consumer_id_and_community_id"   end    create_table "communities", force: :cascade do |t|     t.string "name"     t.text "description"     t.bigint "clustering_id"     t.datetime "created_at", null: false     t.datetime "updated_at", null: false     t.index ["clustering_id"], name: "index_communities_on_clustering_id"   end    create_table "clusterings", force: :cascade do |t|     t.string "name"     t.text "description"     t.datetime "created_at", null: false     t.datetime "updated_at", null: false   end 

How can I make the query execute faster? Is it possible to refactor the query to simplify it, or to add some extra index to the database schema so that it takes a shorter time?

Interestingly, a slightly simplified version of the query, which I use in another view, runs much faster, in only 1161.4ms for the first request and 41.6ms for the following requests:

DataPoint Load (1161.4ms)   SELECT consumers.name as con,         array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,         array_agg(consumption ORDER BY data_points.timestamp ASC) as cons  FROM "data_points"      INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"      INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"      INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"  WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)     AND "data_points"."interval_id" = $3     AND "communities"."id" = 100 GROUP BY communities.id, consumers.name   [["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]] 

Using command EXPLAIN (ANALYZE, BUFFERS) with query in dbconsole, I get the following output:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  GroupAggregate  (cost=12.31..7440.69 rows=246 width=57) (actual time=44.139..20474.015 rows=296 loops=1)    Group Key: communities.id, consumers.id    Buffers: shared hit=159692 read=6148105 written=209    ->  Nested Loop  (cost=12.31..7434.54 rows=246 width=57) (actual time=20.944..20436.806 rows=49728 loops=1)          Buffers: shared hit=159685 read=6148105 written=209          ->  Nested Loop  (cost=11.88..49.30 rows=1 width=49) (actual time=0.102..6.374 rows=296 loops=1)                Buffers: shared hit=988 read=208                ->  Nested Loop  (cost=11.73..41.12 rows=1 width=57) (actual time=0.084..4.443 rows=296 loops=1)                      Buffers: shared hit=396 read=208                      ->  Merge Join  (cost=11.58..40.78 rows=1 width=24) (actual time=0.075..1.365 rows=296 loops=1)                            Merge Cond: (communities_consumers.community_id = communities.id)                            Buffers: shared hit=5 read=7                            ->  Index Only Scan using index_communities_consumers_on_community_id_and_consumer_id on communities_consumers  (cost=0.27..28.71 rows=296 width=16) (actual time=0.039..0.446 rows=296 loops=1)                                  Heap Fetches: 4                                  Buffers: shared hit=1 read=6                            ->  Sort  (cost=11.31..11.31 rows=3 width=16) (actual time=0.034..0.213 rows=247 loops=1)                                  Sort Key: communities.id                                  Sort Method: quicksort  Memory: 25kB                                  Buffers: shared hit=4 read=1                                  ->  Bitmap Heap Scan on communities  (cost=4.17..11.28 rows=3 width=16) (actual time=0.026..0.027 rows=6 loops=1)                                        Recheck Cond: (clustering_id = 1)                                        Heap Blocks: exact=1                                        Buffers: shared hit=4 read=1                                        ->  Bitmap Index Scan on index_communities_on_clustering_id  (cost=0.00..4.17 rows=3 width=0) (actual time=0.020..0.020 rows=8 loops=1)                                              Index Cond: (clustering_id = 1)                                              Buffers: shared hit=3 read=1                      ->  Index Scan using consumers_pkey on consumers  (cost=0.15..0.33 rows=1 width=33) (actual time=0.007..0.008 rows=1 loops=296)                            Index Cond: (id = communities_consumers.consumer_id)                            Buffers: shared hit=391 read=201                ->  Index Only Scan using clusterings_pkey on clusterings  (cost=0.15..8.17 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=296)                      Index Cond: (id = 1)                      Heap Fetches: 296                      Buffers: shared hit=592          ->  Index Scan using index_data_points_on_consumer_id on data_points  (cost=0.44..7383.44 rows=180 width=24) (actual time=56.128..68.995 rows=168 loops=296)                Index Cond: (consumer_id = consumers.id)                Filter: (("timestamp" >= '2015-11-20 09:23:00'::timestamp without time zone) AND ("timestamp" <= '2015-11-27 09:23:00'::timestamp without time zone) AND (interval_id = 2))                Rows Removed by Filter: 76610                Buffers: shared hit=158697 read=6147897 written=209  Planning time: 1.811 ms  Execution time: 20474.330 ms (40 rows) 

The bullet gem returns the following warnings:

USE eager loading detected   Community => [:communities_consumers]   Add to your finder: :includes => [:communities_consumers]  USE eager loading detected   Community => [:consumers]   Add to your finder: :includes => [:consumers] 

6 Answers

Answers 1

Did you try adding an index on:

"data_points".timestamp" + "data_points".consumer_id"

OR

data_points".consumer_id only ?

Answers 2

What version of Postgres are you using? In Postgres 10, they introduced native table partitioning. If your "data_points" table is very large, this may significantly speed up your query since you are looking at a time range:

WHERE (data_points.TIMESTAMP BETWEEN $1 AND $2)  

One strategy you can look into is to add partitioning on the DATE value of the "timestamp" field. Then modify your query to include an extra filter so the partitioning kicks in:

WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)     AND (CAST("data_points"."timestamp" AS DATE) BETWEEN CAST($1 AS DATE) AND CAST($2 AS DATE))    AND "data_points"."interval_id" = $3     AND "data_points"."interval_id" = $3     AND "communities"."clustering_id"  = 1  

If your "data_points" table is very large and your "Timestamp" filtering range is small, this should help, since it would quickly filter out blocks of rows that don't need to be processed.

I haven't done this in Postgres, so I'm not sure how feasible, helpful, blah blah blah, it is. But it's something to look into :)

https://www.postgresql.org/docs/10/static/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE

Answers 3

Do you have foreign key on clusterings_id? Also - try to alter your condition like this:

SELECT communities.id as com,         consumers.name as con,         array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,         array_agg(consumption ORDER BY data_points.timestamp ASC) as cons  FROM "data_points"       INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"       INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"       INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"  WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)     AND "data_points"."interval_id" = $3     AND "communities"."clustering_id"  = 1  GROUP BY communities.id, consumers.id  

Answers 4

  1. You don't need to join clusterings. So try removing that from your query, and use communities.clustering_id = 1 to replace that instead. This should get rid of 3 steps in your query plan. This should save you the most since you query plan is doing a few index scans on it inside of three nested loops.

  2. You can also try to tweak the way your aggregate timestamp. I assume you don't need to aggregate them at a level of seconds?

  3. I'd also remove the "index_data_points_on_timestamp" index since you already have a composite index. And this is practically useless. This should improve your write performance.

Answers 5

The index on data_points.timestamp is not being used, perhaps due to the ::timestamp conversion.

I wonder if altering the column datatype or creating a funtional index would help.

EDIT - the datetime in your CREATE TABLE is how Rails chooses to display the Postgres timestamp data type, I guess, and so there may be no conversion taking place after all.

Nevertheless, the index on timestamp is not being used but depending on your data distribution this could be a very smart choice by the optimizer.

Answers 6

So here we have Postgres 9.3 and long query. Well Before query you have to ensure that you have optimal settings for you data base and suitable for your read and write percentage to disk, type of disk ssd or old hard, and you don't switch autovacuum, you check bloating for tables and indexes and you have good selectivity for indexes that are used for building optimal plans.

Check row types and size filled in row. Change type of row also cen reduce size of table and time.

So now you ensure in all this. Now lets think in way of how Postgres execute and how we can reduce time and efforts. ORM good for simple queries, but if you try to do complicated query you have to use execute by sql methods and keep in in Query Service Objects.

Write simpler queries as possible in sql Postgres also waste time for parse queries.

Check indexes on on all joins fields. Use explain analyze to check that now you have optimal scanning methods.

Next point. You try to do 4 joins! Postgres try to find optimal query plan in 4! times (4 factorial times!) let think to use subqueries or tables with predefined table for this selection.

Use separated query or function for 4 joins (try subqueries):

SELECT * FROM "data_points" as predefined INNER JOIN "consumers" ON "consumers"."id" ="data_points"."consumer_id"  INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"  INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"  INNER JOIN "clusterings" ON "clusterings"."id" "communities"."clustering_id"   WHERE "data_points"."interval_id" = 2  AND "clusterings"."id" = 1  

2) Next (don't use variables just pass)

SELECT * FROM predefined WHERE "data_points"."timestamp" BETWEEN "2015-11-20 09:23:00" AND 2015-11-27 09:23:00 

3) You have 3 times of asking data_points for query, you need less:

array_agg(timestamp ORDER BY data_points.timestamp asc) as tims array_agg(consumption ORDER BY data_points.timestamp ASC) as cons WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) 

You should remember long query it's not all about query, at about settings, ORM usage, sql, and how Postgres works with it all.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment