How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:

User User <--MxN--> Organization User <--MxN--> School User <--MxN--> Credentials

I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.

Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:

When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.

If I introduce another level or two to the hierarchy:

Organization <--1xN--> Project School <--1xN--> Course Project <--MxN--> Credentials Course <--MxN--> Credentials

The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.

Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.

e.g. Operations, executed serially as queries:

Get user
Get user's Organizations
Get user's Schools
Get user's credentials
For each Organization, get its Projects
For each School, get its Courses
For each Project, get its Credentials
For each Course, get its Credentials

Still, it all finishes in less than 200ms.

I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.

e.g. Step 1 (in parallel):

Get user
Get user's Organizations
Get user's Schools
Get user's credentials

Step 2 (in parallel):

For each Organization, get its Projects
For each School, get its Courses

Step 3 (in parallel):

For each Project, get its Credentials
For each Course, get its Credentials

Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:

e.g. Step 1 (in parallel):

Get user
Get user's Organizations
Get user's Schools
Get user's credentials

Step 2 (in parallel):

Get Organizations
Get Schools
Get the Organizations' Projects, join with Credentials
Get the Schools' Courses, join with Credentials

2 Answers

Answers 1

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).

You've mentioned that you're using different loading strategies so I guess you've read through the docs on that ( http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.

One thing to note, you might need to something like:

db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()

With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.

Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.

You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?

Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

Answers 2

MySQL has no parallelism in a single connection. For the ORM to do such would require multiple connections to MySQL. Generally, the overhead of trying to do such is "not worth it".

To get a user, his Organizations, Schools, etc, can all be done (in mysql) via a single query:

SELECT user, organization, ...     FROM Users     JOIN Organizations ON ...     etc.

This is significantly more efficient than

SELECT user FROM ...; SELECT organization ... WHERE user = ...; etc.

(This is not "parallelism".)

Or maybe your "steps" are not quite 'right'?...

SELECT user, organization, project     FROM Users     JOIN Organizations ...     JOIN Projects ...

That gets, in a single step, all users, together with all their organizations and projects.

But is a "user" associated with a "project"? If not, then this is the wrong approach.

If the ORM is not providing a mechanism to generate queries like those, than it is "getting in the way".

Coding Question

Thursday, February 9, 2017

How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

2 Answers

Answers 1

Answers 2

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook