I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.
Right now am doing manually to find the Table from the page. From there i am capturing that page and saving into another PDF.
import PyPDF2 PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object pg4 = pfr.getPage(126) #extract pg 127 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4) NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream: writer.write(outputStream) #write pages to new PDF
My Goal is to extract the table from the whole pdf document.
Please help me out Guys. Thanks in advance.
4 Answers
Answers 1
in my opinion you have 4 possibilities:
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
Your question is near similar with:
Regards
Answers 2
Just as a keyword for your further research: There is also the option to use zonal OCR. I have used this with good success in a project. But this method not suited for high-volume/high-speed, and it requires to define the extraction template for each field you need:
On the plus side, since it works visually, it works with any kind of table (text, image, scan).
Answers 3
You can try to convert your pdf file to excel file and then you can use openpyxl library to extract data from excel file, add that file to array and then convert array to json.
Answers 4
Try converting your file to excel file and read it using CSV reader.
import csv with open('file_name') as csvfile: reader = csv.DictReader(csvfile, delimiter=',') for line in reader: standards=line['standards'] .......
Refer python documents here for CSV module https://docs.python.org/3.6/library/csv.html
Also you can follow this answer Extract PDF files
0 comments:
Post a Comment