Tuesday, October 16, 2018

Improving Postgres performance on graph-like queries of multi-level self-joins (comparison to Neo4j)

Leave a Comment

One of the claims Neo4j makes in their marketing is that relational databases aren't good at doing multi-level self-join queries:

enter image description here

I found the code repository corresponding to the book that claim is taken from, and translated it to Postgres:

CREATE TABLE t_user (   id bigserial PRIMARY KEY,   name text NOT NULL );  CREATE TABLE t_user_friend (   id bigserial PRIMARY KEY,   user_1 bigint NOT NULL REFERENCES t_user,   user_2 bigint NOT NULL REFERENCES t_user );  CREATE INDEX idx_user_friend_user_1 ON t_user_friend (user_1); CREATE INDEX idx_user_friend_user_2 ON t_user_friend (user_2);  /* Create 1M users, each getting a random 10-character name */ INSERT INTO t_user (id, name)   SELECT x.id, substr(md5(random()::text), 0, 10)   FROM generate_series(1,1000000) AS x(id);  /* For each user, create 50 random friendships for a total of 50M friendship records */ INSERT INTO t_user_friend (user_1, user_2)   SELECT g1.x AS user_1, (1 + (random() * 999999)) :: int AS user_2   FROM generate_series(1, 1000000) as g1(x), generate_series(1, 50) as g2(y); 

And these are the queries at various depths Neo4j is comparing against:

/* Depth 2 */  SELECT   COUNT(DISTINCT f2.user_2) AS cnt  FROM   t_user_friend f1    INNER JOIN     t_user_friend f2      ON f1.user_2 = f2.user_1  WHERE   f1.user_1 = 1;  /* Depth 3 */  SELECT   COUNT(DISTINCT f3.user_2) AS cnt  FROM   t_user_friend f1    INNER JOIN     t_user_friend f2      ON f1.user_2 = f2.user_1    INNER JOIN     t_user_friend f3      ON f2.user_2 = f3.user_1  WHERE   f1.user_1 = 1;  /* Depth 4 */  SELECT   COUNT(DISTINCT f4.user_2) AS cnt  FROM   t_user_friend f1    INNER JOIN     t_user_friend f2      ON f1.user_2 = f2.user_1    INNER JOIN     t_user_friend f3      ON f2.user_2 = f3.user_1    INNER JOIN     t_user_friend f4      ON f3.user_2 = f4.user_1  WHERE   f1.user_1 = 1;  /* Depth 5 */  SELECT   COUNT(DISTINCT f5.user_2) AS cnt  FROM   t_user_friend f1    INNER JOIN     t_user_friend f2      ON f1.user_2 = f2.user_1    INNER JOIN     t_user_friend f3      ON f2.user_2 = f3.user_1    INNER JOIN     t_user_friend f4      ON f3.user_2 = f4.user_1    INNER JOIN     t_user_friend f5      ON f4.user_2 = f5.user_1  WHERE   f1.user_1 = 1; 

I was roughly able to reproduce the book's claimed results, getting these sorts of execution times against the 1M users, 50M friendships:

| Depth | Count(*) | Time (s) | |-------|----------|----------| | 2     | 2497     | 0.067    | | 3     | 117301   | 0.118    | | 4     | 997246   | 8.409    | | 5     | 999999   | 214.56   | 

(Here's an EXPLAIN ANALYZE of a depth 5 query)

My question is, is there a way to improve the performance of these queries to meet or exceed Neo4j's execution time of ~2s at depth level 5?

I tried with this recursive CTE:

WITH RECURSIVE chain(user_2, depth) AS (   SELECT t.user_2, 1 as depth   FROM t_user_friend t   WHERE t.user_1 = 1 UNION   SELECT t.user_2, c.depth + 1 as depth   FROM t_user_friend t, chain c   WHERE t.user_1 = c.user_2   AND depth < 4 ) SELECT COUNT(*) FROM (SELECT DISTINCT user_2 FROM chain) AS temp; 

However it's still pretty slow, with a depth 4 taking 5s and a depth 5 taking 48s (EXPLAIN ANALYZE)

1 Answers

Answers 1

I'd like to note from the start that comparing relational and non-relation databases are not like-for-like comparison.

It is likely that non-relation database maintains some extra pre-calculated structures as the data is updated. This makes updates somewhat slower and requires more disk space. Pure relational schema that you use don't have anything extra, which makes updates as fast as possible and keeps disk usage to the minimum.

I'll focus on what could be done with the given schema.


At first I'd make a composite index

CREATE INDEX idx_user_friend_user_12 ON t_user_friend (user_1, user_2); 

One such index should be enough.

Then, we know that there are only 1M users in total, so final result can't be more than 1M.

The 5-level query ends up generating 312.5M rows (50*50*50*50*50). This is way more than maximum possible result, which means that there are a lot of duplicates.

So, I'd try to materialize intermediate results and eliminate duplicates early in the process.

We know that Postgres materializes CTEs, so I'd try to use that.

Something like this:

WITH CTE12 AS (     SELECT         DISTINCT f2.user_2     FROM         t_user_friend f1          INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1     WHERE         f1.user_1 = 1 ) ,CTE3 AS (     SELECT         DISTINCT f3.user_2     FROM         CTE12         INNER JOIN t_user_friend f3 ON CTE12.user_2 = f3.user_1 ) ,CTE4 AS (     SELECT         DISTINCT f4.user_2     FROM         CTE3         INNER JOIN t_user_friend f4 ON CTE3.user_2 = f4.user_1 ) SELECT     COUNT(DISTINCT f5.user_2) AS cnt FROM     CTE4     INNER JOIN t_user_friend f5 ON CTE4.user_2 = f5.user_1 ; 

Most likely SELECT DISTINCT would require sorts, which would allow to use merge joins.


As far as I could understand from the execution plan for the query above https://explain.depesz.com/s/Sjov , Postgres is not smart enough and does some unnecessary sorts. Also, it uses hash aggregate for some SELECT DISTINCT, which requires extra sort.

So, the next attempt would be to use temporary tables with proper indexes for each step explicitly.

Also, I'd define the idx_user_friend_user_12 index as unique. It may provide an extra hint to optimizer.

It would be interesting to see how the following performs.

CREATE TABLE temp12 (     user_2 bigint NOT NULL PRIMARY KEY ); CREATE TABLE temp3 (     user_2 bigint NOT NULL PRIMARY KEY ); CREATE TABLE temp4 (     user_2 bigint NOT NULL PRIMARY KEY );  INSERT INTO temp12(user_2) SELECT     DISTINCT f2.user_2 FROM     t_user_friend f1      INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1 WHERE     f1.user_1 = 1 ;  INSERT INTO temp3(user_2) SELECT     DISTINCT f3.user_2 FROM     temp12     INNER JOIN t_user_friend f3 ON temp12.user_2 = f3.user_1 ;  INSERT INTO temp4(user_2) SELECT     DISTINCT f4.user_2 FROM     temp3     INNER JOIN t_user_friend f4 ON temp3.user_2 = f4.user_1 ;  SELECT     COUNT(DISTINCT f5.user_2) AS cnt FROM     temp4     INNER JOIN t_user_friend f5 ON temp4.user_2 = f5.user_1 ;  DROP TABLE temp12; DROP TABLE temp3; DROP TABLE temp4; 

As an added bonus of explicit temp tables you can measure how much time each extra level takes.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment