How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument(); doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf"); StringBuilder buffer = new StringBuilder(); foreach (PdfPageBase page in doc.Pages) { buffer.Append(page.ExtractText()); } doc.Close(); String fileName = @"C:\Users\Tamal\Desktop\101395a.txt"; File.WriteAllText(fileName, buffer.ToString()); System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
3 Answers
Answers 1
Using iText
File inputFile = new File("input.pdf"); PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile)); SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy(); PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes); canvasProcessor.processPageContent(pdfDocument.getPage(1)); System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.
Answers 2
Use IronOCR
var Ocr = new IronOcr.AutoOcr(); var Results = Ocr.ReadPdf("E:\Demo.pdf"); File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
Answers 3
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.
0 comments:
Post a Comment