Tuesday, April 12, 2016

How to handle file paths in distributed environment

Leave a Comment

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.

As I'm writing my task, I've got something like this:

@app.task def do_ocr(pk, file_path):     content = run_tesseract_command(file_path)     item = Document.objects.get(pk=pk)     item.content = ocr_content     item.save() 

The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.

3 Answers

Answers 1

Well, there are multiple ways to handle it, but let's stick to one of the simpliest one:

  • since you'd like to process big amount of files using multiple servers, my first suggestion would be to use the same OS in each server, so you won't have to worry about cross-platform compatibility
  • using the word 'cluster' indicates that all of those servers should know their mutual state - it adds complexity, try to switch to the farm of stateless workers (by 'stateless' I mean "not knowing about other's" as they should be aware of at least their own state, e.g.: IDLE, IN_PROGRESS, QUEUE_FULL or more if needed)
  • for the file list processing part you could use pull or push model:
    • push model could be easily implemented by a simple app that crawls the files and dispatches them (e.g.: over SCP, FTP, whatever) to a set of available servers; servers can monitor their local directories for changes and pick up new files to process; it's also very easy to scale - just spin up more servers and update the push client (even in runtime); the only limit is your push client's performance
    • pull model is a little bit more tricky, cause you have to handle more complexity; having a set of servers implicates having a proper starting index per node and offset - it will make error handling more difficult, plus, it doesn't scale easily (imagine adding twice as more servers to speedup the processing and updating indices and offsets properly on each node.. seems like an error-prone solution)
  • I assume that the network traffic isn't a big concern - having 3M files to process will generate it somewhere, one way or the other..
  • collecting/storing the results is a different ballpark, but here the list of possible solutions is limitless

Answers 2

If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node

For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file

sshfs#user@node2:/var/your/app/pdfs    /mnt/node2 fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0 .... sshfs#user@nodeN:/var/your/app/pdfs    /mnt/nodeN fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0 

In current node (node1) create a symlink named as current server pointing to pdf's path

ln -s /var/your/app/pdfs node1 

Your mnt folder should contain remote's filesystem and a symlink

user@node1:/mnt$ ls -lsa 0 lrwxrwxrwx  1 user user      16 apr 12  2016 node1 -> /var/your/app/pdfs 0 lrwxrwxrwx  1 user user      16 apr 12  2016 node2 ... 0 lrwxrwxrwx  1 user user      16 apr 12  2016 nodeN 

Then your function should look like this:

import os MOUNT_POINT = '/mtn' @app.task def do_ocr(pk, node_name, file_path):     content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))     item = Document.objects.get(pk=pk)     item.content = ocr_content     item.save() 

It works like all files are in the current machine but there's remote-logic working for you transparently

Answers 3

Since I miss a lot of your architecture details and your application specifics, you can take this answer as a guiding answer rather than a strict one. You can take this approach, in the following order:

1- deploy an internal file server that stores all the files in one place and serve them

Example:

http://interanal-ip-address/storage/filenameA.pdf

http://interanal-ip-address/storage/filenameB.pdf

http://interanal-ip-address/storage/filenameC.pdf

and so on ...

2- Install/Deploy Redis

3- Create an upload client/service/process that takes the files you want to upload and pass them to the above storage location (/storage/), so your files will be available once they are uploaded, at the same time push the full file path URL to a predefined Redis List/Queue (build on linked lists data structure), like this: http://internal-ip-address/storage/filenameA.pdf

You can get more details here about LPUSH and RPOP under Redis Lists here: http://redis.io/topics/data-types-intro

Examples:

  1. A file upload form, that stores the files directly to storage area
  2. A file upload utility/command-line/background-process, that you can create it yourself or use some existing tool to upload files to the storage location, that gets the files from specific location, be it a web address or some other server that has your files

4- Now we come to your celery workers, each one of your workers should pull (RPOP) one of the files URLs from Redis queue, download the file from your internal file server (we built in first step), and do the required processing on the way you wanted it to be.

An important thing to note from Redis documentation:

Lists have a special feature that make them suitable to implement queues, and in general as a building block for inter process communication systems: blocking operations.

However it is possible that sometimes the list is empty and there is nothing to process, so RPOP just returns NULL. In this case a consumer is forced to wait some time and retry again with RPOP. This is called polling, and is not a good idea in this context because it has several drawbacks

So Redis implements commands called BRPOP and BLPOP which are versions of RPOP and LPOP able to block if the list is empty: they'll return to the caller only when a new element is added to the list, or when a user-specified timeout is reached.

Let me know if that answers your question.

Things to keep in mind

  • You can add as many workers as you want since this solution is very scalable, and your only bottleneck is Redis server, which you can make cluster of and persist your queue in case of power outage or server crash

  • You can replace redis with RabbitMQ, Beanstalk, Kafka, or any other queuing/messaging system, but Redis has ben nominated in this race due to simplicity and meriad of features introduced out of the box.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment