I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:
User User <--MxN--> Organization User <--MxN--> School User <--MxN--> Credentials
I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.
Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:
When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.
If I introduce another level or two to the hierarchy:
Organization <--1xN--> Project School <--1xN--> Course Project <--MxN--> Credentials Course <--MxN--> Credentials
The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.
Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.
e.g. Operations, executed serially as queries:
- Get user
- Get user's Organizations
- Get user's Schools
- Get user's credentials
- For each Organization, get its Projects
- For each School, get its Courses
- For each Project, get its Credentials
- For each Course, get its Credentials
Still, it all finishes in less than 200ms.
I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent
module, asyncio
or by using gevent
.
e.g. Step 1 (in parallel):
- Get user
- Get user's Organizations
- Get user's Schools
- Get user's credentials
Step 2 (in parallel):
- For each Organization, get its Projects
- For each School, get its Courses
Step 3 (in parallel):
- For each Project, get its Credentials
- For each Course, get its Credentials
Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:
e.g. Step 1 (in parallel):
- Get user
- Get user's Organizations
- Get user's Schools
- Get user's credentials
Step 2 (in parallel):
- Get Organizations
- Get Schools
- Get the Organizations' Projects, join with Credentials
- Get the Schools' Courses, join with Credentials
2 Answers
Answers 1
The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True
on your engine configuration or look at some db logs (not sure how to do that with mysql).
You've mentioned that you're using different loading strategies so I guess you've read through the docs on that ( http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.
One thing to note, you might need to something like:
db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()
With filter.first
rather that get
, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.
Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.
You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?
Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)
Answers 2
MySQL has no parallelism in a single connection. For the ORM to do such would require multiple connections to MySQL. Generally, the overhead of trying to do such is "not worth it".
To get a user
, his Organizations
, Schools
, etc, can all be done (in mysql) via a single query:
SELECT user, organization, ... FROM Users JOIN Organizations ON ... etc.
This is significantly more efficient than
SELECT user FROM ...; SELECT organization ... WHERE user = ...; etc.
(This is not "parallelism".)
Or maybe your "steps" are not quite 'right'?...
SELECT user, organization, project FROM Users JOIN Organizations ... JOIN Projects ...
That gets, in a single step, all users, together with all their organizations and projects.
But is a "user" associated with a "project"? If not, then this is the wrong approach.
If the ORM is not providing a mechanism to generate queries like those, than it is "getting in the way".
0 comments:
Post a Comment