What is the best way for extracting Tables which are embedded in PDF documents?
I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.
Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?
I'm sure someone has had the same problem before :) I appreciate your help!
3 Answers
Answers 1
You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.
Answers 2
Checkout this gem I think it's what your looking for: pdf-reader gem
Answers 3
You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.
def extract_to_text(pdf_path)   command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')   `#{command}` end  def extract_to_html(pdf_path)   command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')   `#{command}` end   These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.
You can install poppler on a mac with homebrew:
brew install poppler      
0 comments:
Post a Comment