I'm using Rails 4.0.0 and Ruby 2.0.0. My Post
(as in blog posts) model is associated with a user with a combination of the user's user_name, first_name, last_name. I'd like to migrate the data so that posts are associated to users by a foreign key, which is the user's id.
I have about 11 million records in the posts
table.
I'm running the below code to migrate the data, using a rake task on a Linux server. However, my task keeps getting "Killed" by the sever, presumably due to the rake task, specifically the below code, consuming too much memory.
I've found that lowering the batch_size
to 20 and increasing sleep(10)
to sleep(60)
allows the task to run longer, updating more records in total without being Killed, but takes significantly more time.
How can I optimize this code for speed and memory usage?
Post.where(user_id: nil).find_in_batches(batch_size: 1000) do |posts| puts "*** Updating batch beginning with post #{posts.first.id}..." sleep(10) # Hopefully, saving some memory usage. posts.each do |post| begin user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) post.update(user_id: user.id) rescue NoMethodError => error # user could be nil, so user.id will raise a NoMethodError puts "No user found." end end puts "*** Finished batch." end
5 Answers
Answers 1
Do all the work in the database which is WAY faster than moving data back and forth.
This can be accomplished with ActiveRecord. Of course PLEASE test this before you unleash it on important data.
Post .where(user_id: nil) .joins("inner join users on posts.user_name = users.user_name") .update_all("posts.user_id = users.id")
Further, if posts have an index on user_id
, and users has an index on user_name
, then that will help this particular query run more quickly.
Answers 2
Check out the #uncached
method on AR models. Basically, for request optimization, AR will cache a lot of query data as it is doing #find_in_batches
, but it's a hinderance to large processing scripts like this.
Post.uncached do # perform all your heavy query magic here end
Ultimately, if that doesn't work, consider using the mysql2
gem to avoid the AR overhead, as long as you're not depending on any callbacks/business logic in the update.
Answers 3
If a join is possible I'd go with the approach from z5h. Otherwise you could add an index to the user model (possibly in a separate migration) and also skip the validations, callbacks and stuff when updating each post:
add_index :users, [:user_name, :first_name, :last_name] # Speed up search queries Post.where(user_id: nil).find_each do |post| if user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) post.update_columns(user_id: user.id) # ...to skip validations and callbacks. end end
Please note that find_each
is equivalent to find_in_batches
+ iterating over each post, but possibly not faster (see Rails Guides on Active Record Query Interface)
Good luck!
Answers 4
Combining other answers, I was able to join tables, and update multiple columns, in batches of 1000 rows, with a reduction in speed and without my process being killed by the server.
Here's the combines approach that I found to work best, keeping the code within the ActiveRecord API as much as possible.
Post.uncached do Post.where(user_id: nil, organization_id: nil).find_each do |posts| puts "** Updating batch beginning with post #{posts.first.id}..." # Update 1000 records at once posts.map!(&:id) # posts is an array, not a relation Post.where(id: posts). joins("INNER JOIN users ON (posts.user_name = users.user_name)"). joins("INNER JOIN organizations ON (organizations.id = users.organization_id)"). update_all("posts.user_id = users.id, posts.organization_id = organizations.id") puts "** Finished batch." end end
Answers 5
Add new temporary boolean attribute updated
Post.where(updated: false).find_in_batches(batch_size: 1000) do |posts| ActiveRecord::Base.transaction do puts "*** Updating batch beginning with post #{posts.first.id}..." posts.each do |post| user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name) if user post.update_columns(user_id: user.id, updated: true) else post.update_columns(updated: true) end end puts "*** Finished batch." end end
0 comments:
Post a Comment