Saturday, December 9, 2017

How to Extract Table as Text from the PDF using python

By Hường Hana 3:30 AM pdf, python Leave a Comment

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there i am capturing that page and saving into another PDF.

import PyPDF2  PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored  pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object  pg4 = pfr.getPage(126) #extract pg 127  writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4)  NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream:     writer.write(outputStream) #write pages to new PDF

My Goal is to extract the table from the whole pdf document.

Please help me out Guys. Thanks in advance.

4 Answers

Answers 1

in my opinion you have 4 possibilities:

You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

Your question is near similar with:

Regards

Answers 2

Just as a keyword for your further research: There is also the option to use zonal OCR. I have used this with good success in a project. But this method not suited for high-volume/high-speed, and it requires to define the extraction template for each field you need:

On the plus side, since it works visually, it works with any kind of table (text, image, scan).

Answers 3

You can try to convert your pdf file to excel file and then you can use openpyxl library to extract data from excel file, add that file to array and then convert array to json.

Answers 4

Try converting your file to excel file and read it using CSV reader.

import csv with open('file_name') as csvfile:    reader = csv.DictReader(csvfile, delimiter=',')    for line in reader:       standards=line['standards']       .......

Refer python documents here for CSV module https://docs.python.org/3.6/library/csv.html

Also you can follow this answer Extract PDF files

Coding Question

Saturday, December 9, 2017

How to Extract Table as Text from the PDF using python

4 Answers

Answers 1

Answers 2

Answers 3

Answers 4

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook