Tuesday, April 26, 2016

How do I optimize an ActiveRecord find_in_batches query?

Leave a Comment

I'm using Rails 4.0.0 and Ruby 2.0.0. My Post (as in blog posts) model is associated with a user with a combination of the user's user_name, first_name, last_name. I'd like to migrate the data so that posts are associated to users by a foreign key, which is the user's id.

I have about 11 million records in the posts table.

I'm running the below code to migrate the data, using a rake task on a Linux server. However, my task keeps getting "Killed" by the sever, presumably due to the rake task, specifically the below code, consuming too much memory.

I've found that lowering the batch_size to 20 and increasing sleep(10) to sleep(60) allows the task to run longer, updating more records in total without being Killed, but takes significantly more time.

How can I optimize this code for speed and memory usage?

Post.where(user_id: nil).find_in_batches(batch_size: 1000) do |posts|   puts "*** Updating batch beginning with post #{posts.first.id}..."   sleep(10) # Hopefully, saving some memory usage.   posts.each do |post|     begin       user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name)       post.update(user_id: user.id)     rescue NoMethodError => error # user could be nil, so user.id will raise a NoMethodError       puts "No user found."     end   end   puts "*** Finished batch." end 

5 Answers

Answers 1

Do all the work in the database which is WAY faster than moving data back and forth.

This can be accomplished with ActiveRecord. Of course PLEASE test this before you unleash it on important data.

Post   .where(user_id: nil)   .joins("inner join users on posts.user_name = users.user_name")   .update_all("posts.user_id = users.id") 

Further, if posts have an index on user_id, and users has an index on user_name, then that will help this particular query run more quickly.

Answers 2

Check out the #uncached method on AR models. Basically, for request optimization, AR will cache a lot of query data as it is doing #find_in_batches, but it's a hinderance to large processing scripts like this.

Post.uncached do   # perform all your heavy query magic here end 

Ultimately, if that doesn't work, consider using the mysql2 gem to avoid the AR overhead, as long as you're not depending on any callbacks/business logic in the update.

Answers 3

If a join is possible I'd go with the approach from z5h. Otherwise you could add an index to the user model (possibly in a separate migration) and also skip the validations, callbacks and stuff when updating each post:

add_index :users, [:user_name, :first_name, :last_name] # Speed up search queries Post.where(user_id: nil).find_each do |post|   if user = User.find_by(user_name:  post.user_name,                          first_name: post.first_name,                          last_name:  post.last_name)     post.update_columns(user_id: user.id) # ...to skip validations and callbacks.   end end 

Please note that find_each is equivalent to find_in_batches + iterating over each post, but possibly not faster (see Rails Guides on Active Record Query Interface)

Good luck!

Answers 4

Combining other answers, I was able to join tables, and update multiple columns, in batches of 1000 rows, with a reduction in speed and without my process being killed by the server.

Here's the combines approach that I found to work best, keeping the code within the ActiveRecord API as much as possible.

Post.uncached do   Post.where(user_id: nil, organization_id: nil).find_each do |posts|     puts "** Updating batch beginning with post #{posts.first.id}..."      # Update 1000 records at once     posts.map!(&:id) # posts is an array, not a relation     Post.where(id: posts).       joins("INNER JOIN users ON (posts.user_name = users.user_name)").       joins("INNER JOIN organizations ON (organizations.id = users.organization_id)").       update_all("posts.user_id = users.id, posts.organization_id = organizations.id")      puts "** Finished batch."   end end 

Answers 5

Add new temporary boolean attribute updated

Post.where(updated: false).find_in_batches(batch_size: 1000) do |posts|   ActiveRecord::Base.transaction do     puts "*** Updating batch beginning with post #{posts.first.id}..."     posts.each do |post|       user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name)       if user         post.update_columns(user_id: user.id, updated: true)       else         post.update_columns(updated: true)       end     end     puts "*** Finished batch."   end end 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment