Monday, February 6, 2017

Extracting Tables from PDF files in Ruby

Leave a Comment

What is the best way for extracting Tables which are embedded in PDF documents?

I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.

Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?

I'm sure someone has had the same problem before :) I appreciate your help!

3 Answers

Answers 1

You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.

Answers 2

Checkout this gem I think it's what your looking for: pdf-reader gem

Answers 3

You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.

def extract_to_text(pdf_path)   command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')   `#{command}` end  def extract_to_html(pdf_path)   command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')   `#{command}` end 

These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.

You can install poppler on a mac with homebrew:

brew install poppler 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment