Saturday, December 9, 2017

How to Extract Table as Text from the PDF using python

Leave a Comment

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there i am capturing that page and saving into another PDF.

import PyPDF2  PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored  pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object  pg4 = pfr.getPage(126) #extract pg 127  writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4)  NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream:     writer.write(outputStream) #write pages to new PDF 

My Goal is to extract the table from the whole pdf document.

Please have a look at the sample image of a page in PDF

Please help me out Guys. Thanks in advance.

4 Answers

Answers 1

in my opinion you have 4 possibilities:

  • You may treat the pdf directly using tabula

  • You may convert the pdf to text using pdftotext, then parse text with python

  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.

  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

Your question is near similar with:

Regards

Answers 2

Just as a keyword for your further research: There is also the option to use zonal OCR. I have used this with good success in a project. But this method not suited for high-volume/high-speed, and it requires to define the extraction template for each field you need:

enter image description here

On the plus side, since it works visually, it works with any kind of table (text, image, scan).

Answers 3

You can try to convert your pdf file to excel file and then you can use openpyxl library to extract data from excel file, add that file to array and then convert array to json.

Answers 4

Try converting your file to excel file and read it using CSV reader.

import csv with open('file_name') as csvfile:    reader = csv.DictReader(csvfile, delimiter=',')    for line in reader:       standards=line['standards']       ....... 

Refer python documents here for CSV module https://docs.python.org/3.6/library/csv.html

Also you can follow this answer Extract PDF files

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment