Sunday, June 3, 2018

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Leave a Comment

How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf

I tried the below code to read the file and save it to a text file

PdfDocument doc = new PdfDocument(); doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf");  StringBuilder buffer = new StringBuilder();  foreach (PdfPageBase page in doc.Pages) {     buffer.Append(page.ExtractText()); }  doc.Close(); String fileName = @"C:\Users\Tamal\Desktop\101395a.txt"; File.WriteAllText(fileName, buffer.ToString()); System.Diagnostics.Process.Start(fileName); 

But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.

How do I get the desired result as in the desired text file?

Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

3 Answers

Answers 1

Using iText

File inputFile = new File("input.pdf");  PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));  SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy(); PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes); canvasProcessor.processPageContent(pdfDocument.getPage(1));  System.out.println(stes.getResultantText()); 

This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.

Answers 2

Use IronOCR

var Ocr = new IronOcr.AutoOcr(); var Results = Ocr.ReadPdf("E:\Demo.pdf"); File.WriteAllText("E:\Demo.txt", Convert.ToString(Results)); 

For reference https://ironsoftware.com/csharp/ocr/

Using this you should get formatted text output, but not exact desire output which you want.

If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

Answers 3

That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment