Showing posts with label pdf. Show all posts
Showing posts with label pdf. Show all posts

Wednesday, September 19, 2018

Extract a single page (or range of pages) from pdf data without loading the whole pdf (which takes too much RAM sometimes)

Leave a Comment

Using PDFKit in swift, you can use PDFDocument to open pdf files. That's easy and works well. But I'm building a custom pdf viewer (for comic book pdfs) that suits my needs and there is one problem I have. In a viewer, I don't need to have the whole pdf file in memory. I only need about a few pages at a time.

Also, the pdfs consist only of images. There's no text or anything.

When instantiating a PDFDocument, the whole pdf data is being loaded into memory. If you have really huge pdf files (over 1GB) this isn't optimal (and can crash on some devices). As far as I know, there's no way in PDFKit to only load parts of a pdf document.

Is there anything I can do about that? I haven't found a swift/obj-c library that can do this (though I don't really know the right keywords to search for it).

My workaround would be to preprocess pdfs and save each page as image in the .documents director (or similar) using FileManager. That would result in a tremendous amount of files but would solve the memory problem. I'm not sure I like this approach, though.

Update:

So I did what @Prcela and @Sahil Manchanda proposed. It seems to be working for now.

@yms: Hm, that could be a problem, indeed. Does this even happen when there are only images? Without anything else in the pdf.

@Carpsen90: They are local (saved in the documents directory).

1 Answers

Answers 1

I have an idea how you could achieve this in PDFKit. After reading the documentation there is a function which allows for the selection of certain pages. Which would probably solve your problem if you would add it to a collectionFlowView.

func selection(from startPage: PDFPage, atCharacterIndex startCharacter: Int, to endPage: PDFPage, atCharacterIndex endCharacter: Int) -> PDFSelection? 

However as I read that you mainly have images there is another function which allows to extract parts of the pdf based on CGPoints:

func selection(from startPage: PDFPage, at startPoint: CGPoint, to endPage: PDFPage, at endPoint: CGPoint) -> PDFSelection? 

Also have a look at this: https://developer.apple.com/documentation/pdfkit/pdfview

as this might be what you need if you only want to view the pages without any annotations editing etc.

I also prepared a little code to extract one page below. Hope it helps.

import PDFKit import UIKit  class PDFViewController: UIViewController {      override func viewDidLoad() {         super.viewDidLoad()          guard let url = Bundle.main.url(forResource: "myPDF", withExtension: "pdf") else {fatalError("INVALID URL")}         let pdf = PDFDocument(url: url)         let page = pdf?.page(at: 10) // returns a PDFPage instance         // now you have one page extracted and you can play around with it.     } } 

EDIT 1: Have a look at this code extraction. I understand that the whole PDF gets loaded however this approach might be more memory efficient as perhaps iOS will be handling it better in a PDFView:

func readBook() {

if let oldBookView = self.view.viewWithTag(3) {     oldBookView.removeFromSuperview()     // This removes the old book view when the user chooses a new book language }  if #available(iOS 11.0, *) {     let pdfView: PDFView = PDFView()     let path = BookManager.getBookPath(bookLanguageCode: book.bookLanguageCode)     let url = URL(fileURLWithPath: path)     if let pdfDocument = PDFDocument(url: url) {         pdfView.displayMode = .singlePageContinuous         pdfView.autoScales = true         pdfView.document = pdfDocument         pdfView.tag = 3 // I assigned a tag to this view so that later on I can easily find and remove it when the user chooses a new book language         let lastReadPage = getLastReadPage()          if let page = pdfDocument.page(at: lastReadPage) {             pdfView.go(to: page)             // Subscribe to notifications so the last read page can be saved             // Must subscribe after displaying the last read page or else, the first page will be displayed instead             NotificationCenter.default.addObserver(self, selector: #selector(self.saveLastReadPage),name: .PDFViewPageChanged, object: nil)         }     }      self.containerView.addSubview(pdfView)     setConstraints(view: pdfView)     addTapGesture(view: pdfView) } 
Read More

Tuesday, September 11, 2018

Convert Non-Searchable Pdf to Searchable Pdf in Windows Python

Leave a Comment

Need a solution to convert a PDF file where every page is image and a page can either contains text, table or combination of both to a searchable pdf.

I have used ABBY FineReader Online which is doing the job perfectly well but I am looking for a solution which can be achieved via Windows Python

I have done detailed analysis and below are the links which came close to what I want but not exactly:

Scanned Image/PDF to Searchable Image/PDF

It is telling to use Ghost script to convert it 1st to image and then it does directly convert to text. I don't believe tesseract converts non-searchable to searchable PDF's.

Converting searchable PDF to a non-searchable PDF

The above solution helps in reverse i.e. converting searchable to non-searchable. Also I think these are valid in Ubuntu/Linux/MacOS.

Can someone please help in telling what should be the Python code for achieving non-searchable to searchable in Windows Python?


UPDATE 1

I have got the desired result with Asprise Web Ocr. Below is the link and code:

https://asprise.com/royalty-free-library/python-ocr-api-overview.html

I am looking for a solution which can be done through Windows Python libraries only as

  1. Need not to pay subscription costs in future
  2. I need to convert thousands of documents daily and it will be cumbersome to upload one to API and then download and so on.

UPDATE 2

I know the solution of converting non-searchable pdf directly to text. But I am looking is their any way to convert non-searchable to searchable PDF. I have the code for converting the PDF to text using PyPDF2.

3 Answers

Answers 1

Well you don't actually need to transform everything inside the pdf to text. Text will remain text, table will remain table and if possible image should become text. You would need a script that actually reads the pdf as is, and begins the conversion on blocks. The script would write blocks of text until the document has been read completely and then transform it into a pdf. Something like

if line_is_text():     write_the_line_as_is() elif line_is_img():     transform_img_in_text()# comments below code ... .. . 

Now transform_img_in_text() I think it could be done with many external libraries, one you can use could be:

Tesseract OCR Python

You can download this lib via pip, instructions provided in the link above.

Answers 2

If an online ocr solution is acceptable to you, the free OCR API from OCR.space can also create searchable PDFs and works well.

In the free version the created PDF contains a watermark. To remove the watermark you need to upgrade to their commercial PRO plan. You can test the api with the web form on the front page.

OCR.space is also available as non-subscription on-premise option, but I am unsure about the price. Personally I use the free ocr api with good success.

Answers 3

I've used pypdfocr in the past to do this. It hasn't been updated recently though.

From the README:

pypdfocr filename.pdf --> filename_ocr.pdf will be generated 

Read carefully the Install instructions for Windows.

Read More

Monday, August 27, 2018

Add pdf metadata with accents in python

Leave a Comment

I want to change the metadata of the pdf file using this code:

from PyPDF2 import PdfFileReader, PdfFileWriter  title = "Vice-présidence pour l'éducation" fin = open(filename, 'rb') reader = PdfFileReader(fin) writer = PdfFileWriter() writer.appendPagesFromReader(reader) metadata = reader.getDocumentInfo()  metadata.update({'/Title':title})  writer.addMetadata(metadata)  fout = open(filename, 'wb') writer.write(fout)  fin.close() fout.close() 

It works fine if the title is in english(no accents) but when it has accents I get the following error:

TypeError: createStringObject should have str or unicode arg 

How can I add a title with accent to the metadata ?

Thank you

1 Answers

Answers 1

The only way to get this error message is to have the wrong type for the parameter string in the createStringObject(string)-function in the library itself.

It's looking for type string or bytes using these functions in utils.py

import builtins bytes_type = type(bytes()) # Works the same in Python 2.X and 3.X string_type = getattr(builtins, "unicode", str) 

I can only reproduce your error if I rewrite your code with an obviously wrong type like this (code is rewritten using with statement but only the commented line is important):

from PyPDF2 import PdfFileReader, PdfFileWriter   with open(inputfile, "rb") as fr, open(outputfile, "wb") as fw:     reader = PdfFileReader(fr)     writer = PdfFileWriter()      writer.appendPagesFromReader(reader)     metadata = reader.getDocumentInfo()      # metadata.update({'/Title': "Vice-présidence pour l'éducation"})     metadata.update({'/Title': [1, 2, 3]})  # <- wrong type here !     writer.addMetadata(metadata)      writer.write(fw) 

It seems that the type of your string title = "Vice-présidence pour l'éducation" is not matching to whatever bytes_type or string_type is resolved. Either you have a weird type of the title variable (which I cannot see in your code, maybe because of creating a MCVE) or you have trouble getting bytes_type or string_type as types intended by library writer (this can be a bug in the library or an erroneous installation, hard to tell for me).

Without reproducible code, it's hard to provide a solution. But hopefully this will give you the right direction to go. Maybe it's enough to set the type of your string to whatever bytes_type or string_type is resolved to. Other solutions would be on library site or simply hacks.

Read More

Monday, June 18, 2018

Heading identification with Regex

Leave a Comment

I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.

private string GetMarkingSection(string text)     {       int startIndex = 0;       int endIndex = 0;       bool startIndexFound = false;       Regex rx = new Regex(HEADINGREGEX);       foreach (Match match in rx.Matches(text))       {         if (startIndexFound)         {           endIndex = match.Index;           break;         }         if (match.ToString().ToLower().Contains("marking"))         {           startIndex = match.Index;           startIndexFound = true;         }       }       return text.Substring(startIndex, (endIndex - startIndex));     } 

Once the marking section is found, I use this to find subsections.

private Dictionary<string, string> GetSubsections(string text)     {       Dictionary<string, string> subsections = new Dictionary<string, string>();       string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);       string title = "";       string content = "";       foreach(string s in unprocessedSubSecs)       {         if(s != "") //sometimes it pulls in empty strings         {           Match m = Regex.Match(s, SUBSECTIONREGEX);           if (m.Success)           {             title = s;           }           else           {             content = s;             if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))             {               subsections.Add(title, content);             }           }         }       }       return subsections;     } 

Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me. These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:

  • Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
  • Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
  • Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$

Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.

My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.

That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)

I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.

0 Answers

Read More

Sunday, June 3, 2018

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Leave a Comment

How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf

I tried the below code to read the file and save it to a text file

PdfDocument doc = new PdfDocument(); doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf");  StringBuilder buffer = new StringBuilder();  foreach (PdfPageBase page in doc.Pages) {     buffer.Append(page.ExtractText()); }  doc.Close(); String fileName = @"C:\Users\Tamal\Desktop\101395a.txt"; File.WriteAllText(fileName, buffer.ToString()); System.Diagnostics.Process.Start(fileName); 

But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.

How do I get the desired result as in the desired text file?

Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

3 Answers

Answers 1

Using iText

File inputFile = new File("input.pdf");  PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));  SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy(); PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes); canvasProcessor.processPageContent(pdfDocument.getPage(1));  System.out.println(stes.getResultantText()); 

This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.

Answers 2

Use IronOCR

var Ocr = new IronOcr.AutoOcr(); var Results = Ocr.ReadPdf("E:\Demo.pdf"); File.WriteAllText("E:\Demo.txt", Convert.ToString(Results)); 

For reference https://ironsoftware.com/csharp/ocr/

Using this you should get formatted text output, but not exact desire output which you want.

If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

Answers 3

That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

Read More

Sunday, May 27, 2018

Place image over PDF

Leave a Comment

How can I place an image over an existing PDF file at an specific coordinate location. The pdf represents a drawing sheet with one page. The image will be scaled. I'm checking ReportLab but can't find the answer. Thanks.

7 Answers

Answers 1

http://pybrary.net/pyPdf/:

from pyPdf import PdfFileWriter, PdfFileReader  output = PdfFileWriter() input1 = PdfFileReader(file("document1.pdf", "rb")) watermark = PdfFileReader(file("watermark.pdf", "rb"))  page4.mergePage(watermark.getPage(0))  # finally, write "output" to document-output.pdf outputStream = file("document-output.pdf", "wb") output.write(outputStream) outputStream.close() 

I think it's like watermark, see the manual for better idea

Answers 2

Its been 5 years, I think these answers need some TLC. Here is a complete solution.

The following is tested with Python 2.7

Install dependencies

pip install reportlab  pip install pypdf2 

Do the magic

from reportlab.pdfgen import canvas from PyPDF2 import PdfFileWriter, PdfFileReader  # Create the watermark from an image c = canvas.Canvas('watermark.pdf')  # Draw the image at x, y. I positioned the x,y to be where i like here c.drawImage('test.png', 15, 720)  # Add some custom text for good measure c.drawString(15, 720,"Hello World") c.save()  # Get the watermark file you just created watermark = PdfFileReader(open("watermark.pdf", "rb"))  # Get our files ready output_file = PdfFileWriter() input_file = PdfFileReader(open("test2.pdf", "rb"))  # Number of pages in input document page_count = input_file.getNumPages()  # Go through all the input file pages to add a watermark to them for page_number in range(page_count):     print "Watermarking page {} of {}".format(page_number, page_count)     # merge the watermark with the page     input_page = input_file.getPage(page_number)     input_page.mergePage(watermark.getPage(0))     # add page from input file to output document     output_file.addPage(input_page)  # finally, write "output" to document-output.pdf with open("document-output.pdf", "wb") as outputStream:     output_file.write(outputStream) 

References:

New home of pypdf: http://mstamy2.github.io/PyPDF2/

Reportlab docs: http://www.reportlab.com/apis/reportlab/2.4/pdfgen.html

Reportlab complete user guide: https://www.reportlab.com/docs/reportlab-userguide.pdf

Answers 3

I combined ReportLab (http://www.reportlab.com/software/opensource/rl-toolkit/download/) and pyPDF (http://pybrary.net/pyPdf/) to insert an image directly without having to generate the PDF up front:

from pyPdf import PdfFileWriter, PdfFileReader from reportlab.pdfgen import canvas from StringIO import StringIO   # Using ReportLab to insert image into PDF imgTemp = StringIO() imgDoc = canvas.Canvas(imgTemp)  # Draw image on Canvas and save PDF in buffer imgPath = "path/to/img.png" imgDoc.drawImage(imgPath, 399, 760, 160, 160)    ## at (399,760) with size 160x160 imgDoc.save()  # Use PyPDF to merge the image-PDF into the template page = PdfFileReader(file("document.pdf","rb")).getPage(0) overlay = PdfFileReader(StringIO(imgTemp.getvalue())).getPage(0) page.mergePage(overlay)  #Save the result output = PdfFileWriter() output.addPage(page) output.write(file("output.pdf","w")) 

Answers 4

Thx to the previous answers. My way with python3.4

# -*- coding: utf-8 -*- from io import BytesIO from PyPDF2 import PdfFileWriter, PdfFileReader from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4  def gen_pdf():     # there are 66 slides (1.jpg, 2.jpg, 3.jpg...)     path = 'slades/{0}.jpg'     pdf = PdfFileWriter()      for num in range(1, 67):  # for each slide         # Using ReportLab Canvas to insert image into PDF         imgTemp = BytesIO()         imgDoc = canvas.Canvas(imgTemp, pagesize=A4)         # Draw image on Canvas and save PDF in buffer         imgDoc.drawImage(path.format(num), -25, -45)         # x, y - start position         # in my case -25, -45 needed         imgDoc.save()         # Use PyPDF to merge the image-PDF into the template         pdf.addPage(PdfFileReader(BytesIO(imgTemp.getvalue())).getPage(0))      pdf.write(open("output.pdf","wb"))   if __name__ == '__main__':     gen_pdf() 

Answers 5

This is what worked for me

from PyPDF2 import PdfFileWriter, PdfFileReader  def watermarks(temp, watermar,new_file):     template = PdfFileReader(open(temp, 'rb'))     wpdf = PdfFileReader(open(watermar, 'rb'))     watermark = wpdf.getPage(0)      for i in xrange(template.getNumPages()):         page = template.getPage(i)         page.mergePage(watermark)         output.addPage(page)          with open(new_file, 'wb') as f:             output.write(f) 

Answers 6

This is quite easy to do with PyMuPDF without merging two PDFs:

import fitz  src_pdf_filename = 'source.pdf' dst_pdf_filename = 'destination.pdf' img_filename = 'barcode.jpg'  # http://pymupdf.readthedocs.io/en/latest/rect/ # Set position and size according to your needs img_rect = fitz.Rect(100, 100, 120, 120)  document = fitz.open(src_pdf_filename)  # We'll put image on first page only but you could put it elsewhere page = document[0] page.insertImage(img_rect, filename=img_filename)  # See http://pymupdf.readthedocs.io/en/latest/document/#Document.save and # http://pymupdf.readthedocs.io/en/latest/document/#Document.saveIncr for # additional parameters, especially if you want to overwrite existing PDF # instead of writing new PDF document.save(dst_pdf_filename)  document.close() 

Answers 7

Since, its existing pdf, the most easy way to do it is:

  1. Convert pdf to .doc or .odt ( Check http://www.zamzar.com/ )
  2. Add images into the converted file however you want.
  3. Convert back to PDF (openoffice and libreoffice makes it easy to save pdfs)

PS: If the pdf file needs to further edited always keep a backup of source .doc file, so that the changes can be done easily, too much conversion have bad effects on file quality.

Read More

Thursday, May 24, 2018

How to implement a PDF viewer that loads pages asynchronously

Leave a Comment

We need to allow users of our mobile app to browse a magazine with an experience that is fast, fluid and feels native to the platform (similar to iBooks/Google Books).

Some featurs we need are being able to see Thumbnails of the whole magazine, and searching for specific text.

The problem is that our magazines are over 140 pages long and we can’t force our users to have to fully download the whole ebook/PDF beforehand. We need pages to be loaded asynchronously, that is, to let users start reading without having to fully download the content.

I studied PDFKit for iOS however I didn’t find any mention in the documentation about downloading a PDF asynchronously.

Are there any solutions/libraries to implement this functionality on iOS and Android?

2 Answers

Answers 1

What you're looking for is called linearization and according to this answer.

The first object immediately after the %PDF-1.x header line shall contain a dictionary key indicating the /Linearized property of the file.

This overall structure allows a conforming reader to learn the complete list of object addresses very quickly, without needing to download the complete file from beginning to end:

  • The viewer can display the first page(s) very fast, before the complete file is downloaded.

  • The user can click on a thumbnail page preview (or a link in the ToC of the file) in order to jump to, say, page 445, immediately after the first page(s) have been displayed, and the viewer can then request all the objects required for page 445 by asking the remote server via byte range requests to deliver these "out of order" so the viewer can display this page faster. (While the user reads pages out of order, the downloading of the complete document will still go on in the background...)

You can use this native library to linearization a PDF.

However I wouldn't recommend made it has rendering the PDFs wont be fast, fluid or feel native. For those reasons, as far as I know there is no native mobile app that does linearization. Moreover, you have to create your own rendering engine for the PDF as most PDF viewing libraries do not support linearization . What you should do instead is convert the each individual page in the PDF to HTML on the server end and have the client only load the pages when required and cache. We will also save PDFs plan text separately in order to enable search. This way everything will be smooth as the resources will be lazy loaded. In order to achieve this you can do the following.

Firstly On the server end, whenever you publish a PDF, the pages of the PDF should be split into HTML files as explained above. Page thumbs should also be generated from those pages. Assuming that your server is running on python with a flask microframework this is what you do.

from flask import Flask,request from werkzeug import secure_filename import os from pyPdf import PdfFileWriter, PdfFileReader import imgkit from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.layout import LAParams import io import sqlite3 import Image  app = Flask(__name__)   @app.route('/publish',methods=['GET','POST']) def upload_file():      if request.method == 'POST':         f = request.files['file']         filePath = "pdfs/"+secure_filename(f.filename)         f.save(filePath)         savePdfText(filePath)         inputpdf = PdfFileReader(open(filePath, "rb"))          for i in xrange(inputpdf.numPages):             output = PdfFileWriter()             output.addPage(inputpdf.getPage(i))             with open("document-page%s.pdf" % i, "wb") as outputStream:                 output.write(outputStream)                 imgkit.from_file("document-page%s.pdf" % i, "document-page%s.jpg" % i)                 saveThum("document-page%s.jpg" % i)                 os.system("pdf2htmlEX --zoom 1.3  pdf/"+"document-page%s.pdf" % i)       def saveThum(infile):         save = 124,124         outfile = os.path.splitext(infile)[0] + ".thumbnail"         if infile != outfile:             try:                 im = Image.open(infile)                 im.thumbnail(size, Image.ANTIALIAS)                 im.save(outfile, "JPEG")             except IOError:                 print("cannot create thumbnail for '%s'" % infile)      def savePdfText(data):         fp = open(data, 'rb')         rsrcmgr = PDFResourceManager()         retstr = io.StringIO()         codec = 'utf-8'         laparams = LAParams()         device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)         # Create a PDF interpreter object.         interpreter = PDFPageInterpreter(rsrcmgr, device)         # Process each page contained in the document.         db = sqlite3.connect("pdfText.db")         cursor = db.cursor()         cursor.execute('create table if not exists pagesTextTables(id INTEGER PRIMARY KEY,pageNum TEXT,pageText TEXT)')         db.commit()         pageNum = 1         for page in PDFPage.get_pages(fp):             interpreter.process_page(page)             data =  retstr.getvalue()             cursor.execute('INSERT INTO pagesTextTables(pageNum,pageText) values(?,?) ',(str(pageNum),data ))             db.commit()             pageNum = pageNum+1      @app.route('/page',methods=['GET','POST'])     def getPage():         if request.method == 'GET':             page_num = request.files['page_num']             return send_file("document-page%s.html" % page_num, as_attachment=True)      @app.route('/thumb',methods=['GET','POST'])     def getThum():         if request.method == 'GET':             page_num = request.files['page_num']             return send_file("document-page%s.thumbnail" % page_num, as_attachment=True)      @app.route('/search',methods=['GET','POST'])     def search():         if request.method == 'GET':             query = request.files['query ']                    db = sqlite3.connect("pdfText.db")             cursor = db.cursor()            cursor.execute("SELECT * from pagesTextTables Where pageText LIKE '%"+query +"%'")            result = cursor.fetchone()            response = Response()            response.headers['queryResults'] = result             return response 

Here is an explanation of what the flask app is doing.

  1. The /publish route is responsible for the publishing of your magazine, turning very page to HTML, saving the PDFs text to an SQlite db and generating thumbnails for those pages. I've used pyPDF for splitting the PDF to individual pages, pdfToHtmlEx to convert the pages to HTML, imgkit to generate those HTML to images and PIL to generate thumbs from those images. Also, a simple Sqlite db saves the pages' text.
  2. The /page, /thumb and /search routes are self explanatory. They simply return the HTML, thumb or search query results.

Secondly, on the client end you simply download the HTML page whenever the user scrolls to it. Let me give you an example for android OS. Firstly, you'd want to Create some Utils to handle the GET requestrs

public static byte[] GetPage(int mPageNum){ return CallServer("page","page_num",Integer.toString(mPageNum)) }  public static byte[] GetThum(int mPageNum){ return CallServer("thumb","page_num",Integer.toString(mPageNum)) }  private  static byte[] CallServer(String route,String requestName,String requestValue) throws IOException{          OkHttpClient client = new OkHttpClient.Builder().connectTimeout(30, TimeUnit.SECONDS).writeTimeout(30, TimeUnit.SECONDS).readTimeout(30, TimeUnit.SECONDS).build();         MultipartBody.Builder mMultipartBody = new MultipartBody.Builder().setType(MultipartBody.FORM).addFormDataPart(requestName,requestValue);          RequestBody mRequestBody = mMultipartBody.build();         Request request = new Request.Builder()                 .url("yourUrl/"+route).post(mRequestBody)                 .build();         Response response = client.newCall(request).execute();         return response.body().bytes();     } 

The helper utils above simple handle the queries to the server for you, they should be self explanatory. Next, you simple create an RecyclerView with a WebView viewHolder or better yet an advanced webview as it will give you more power with customization.

    public static class ViewHolder extends RecyclerView.ViewHolder {         private AdvancedWebView mWebView;         public ViewHolder(View itemView) {             super(itemView);          mWebView = (AdvancedWebView)itemView;}     }     private class ContentAdapter extends RecyclerView.Adapter<YourFrament.ViewHolder>{         @Override         public ViewHolder onCreateViewHolder(ViewGroup container, int viewType) {              return new ViewHolder(new AdvancedWebView(container.getContext()));         }          @Override         public int getItemViewType(int position) {              return 0;         }          @Override         public void onBindViewHolder( ViewHolder holder, int position) { handlePageDownload(holder.mWebView);         }        private void handlePageDownload(AdvancedWebView mWebView){....}          @Override         public int getItemCount() {             return numberOfPages;         }     } 

That should be about it.

Answers 2

I am sorry to say, But there is no any library or SDK available which provides asynchronously pages loading functionality. It is next to impossible on the mobile device to open PDF file without downloading the full pdf file.

Solution:

I have already done R&D for the same and fulfilled your requirement in the project. I am not sure iBooks and Google books used below mechanism or not. But is working fine as per your requirements.

  • Divide your pdf into n number of part (E.g Suppose you have 150 pages in pdf then every pdf contain 15 pages -> It will take some effort from web end.)
  • Once first part download successfully then display it to the user and other part downloading asynchronously.
  • After downloading all part of the pdf file, Use below code the merge Pdf file.

How to Merge PDF file

UIGraphicsBeginPDFContextToFile(oldFile, paperSize, nil);

for (pageNumber = 1; pageNumber <= count; pageNumber++) {     UIGraphicsBeginPDFPageWithInfo(paperSize, nil);      //Get graphics context to draw the page     CGContextRef currentContext = UIGraphicsGetCurrentContext();      //Flip and scale context to draw the pdf correctly     CGContextTranslateCTM(currentContext, 0, paperSize.size.height);     CGContextScaleCTM(currentContext, 1.0, -1.0);      //Get document access of the pdf from which you want a page     CGPDFDocumentRef newDocument = CGPDFDocumentCreateWithURL ((CFURLRef) newUrl);      //Get the page you want     CGPDFPageRef newPage = CGPDFDocumentGetPage (newDocument, pageNumber);      //Drawing the page     CGContextDrawPDFPage (currentContext, newPage);      //Clean up     newPage = nil;     CGPDFDocumentRelease(newDocument);     newDocument = nil;     newUrl = nil;  }  UIGraphicsEndPDFContext(); 

Reference: How to merge PDF file.

Update: Main advantage of this mechanism is Logic remain same for all device Android and iOS Device.

Read More

Wednesday, May 9, 2018

How to convert DOCX to PDF with nameddest to be linkable in FireFox and Chrome

Leave a Comment

From my website, I'm linking several sections within a PDF document using URL in format http://www.example.com/Document.pdf#nameddest=sectionXY (as discussed e.g. here). My PDF document is manually created from a DOCX document using the "PDF export" function in MS-Word 2016. The labels are marked as MS-Word bookmarks in the source document.

Unfortunately, the PDF viewer in the web browser scrolls to the proper section only in the Google Chrome. In other browsers (FireFox, IE 11 or Edge) the PDF document is always opened on the first page.

I'm sure my solution used to work several years ago both in Chrome, FireFox and IE.

Is there any way to make it work at least in Chrome and FireFox? I'm able to use another converter (or even some PDF library) but I cannot afford to have my source document in any other format than DOCX. I'm even able to mark my "labels" another way than using MS Word bookmarks.

2 Answers

Answers 1

This is probably not something you'll be able to change.

The PDF standard itself for instance does not really specify whether links like the one you posted should work. So support for them is not something you commonly find.

Of course if the browser is open source, you may always post a pull request.

Answers 2

Word's PDF export may or may not create "named destinations" in the PDF file. It seems to vary based on platform (Mac vs Windows) and by version (2010 vs 2016).

LibreOffice can import .docx and has PDF export with specific options for creating named destinations from the document bookmarks.

Chrome uses PDFium Firefox uses PDF.js as its built-in PDF viewer, both of which support navigating to named destinations with #nameddest=sectionXY as well as some other navigation styles specified in RFC like #page=2. (See a related question on linking to sections.)

You can check your PDF file for named destinations with Popper's pdfinfo or other tools (see a related question on Unix.SE about listing named destinations).

Read More

Wednesday, January 31, 2018

PDF as blank page in HTML

Leave a Comment

My problem is, everything is fine opening PDFs using my browsers, until I uploaded a pdf with a form inside. Then, if I embed it, it returns a blank page. But the other pdfs with forms open normally. Please see my code below:

<object data="{{ asset($test->file_path) }}" type="application/pdf" width="100%" height="100%">     <embed src="{{ asset($test->file_path) }}" type='application/pdf'>     <center>         <a href="{{ route('download.test', ['id' => $test->id]) }}" class="btn btn-primary">Please click here to view</a>     </center> </object> 

Note: I've also tried to use <iframe> but still returns blank page.

4 Answers

Answers 1

<a href="{{ route('download.test', ['id' => $test->id] ,['target'=>'_blank']) }}" class="btn btn-primary">Please click here to view</a> 

Answers 2

It's late, and I'm tired, so apologies if I misread the question.

I noticed that the PDF is hosted on a site that doesn't support HTTPS. It showed a blank page if it was embedded on a site using HTTPS, but worked fine when it was using HTTP.

I think you need to either move the PDF to a site that supports HTTPS or make the site hosting the PDF start using HTTPS.

Answers 3

Consider using Objects and Iframes (Rather than Object and Embed)

Something like this should work for you:

<object data="http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf" type="application/pdf" width="100%" height="100%">     <iframe src="http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf" width="100%" height="100%" style="border: none;">         This browser does not support PDFs. Please download the PDF to view it: <a href="/pdf/example.pdf">Download PDF</a>     </iframe> </object> 

This worked when I tested it locally but I can't show JSFiddle since it uses HTTPS. Also, have a look at these examples: https://pdfobject.com/static.html

Answers 4

Not sure if this will work as I am not able to test your case. You can try this, it always works for me. Try replacing http://yoursite.com/the.pdf with the correct path.

<object data="http://yoursite.com/the.pdf" type="application/pdf" width="750px" height="750px">     <embed src="http://yoursite.com/the.pdf" type="application/pdf">         <p>This browser does not support PDFs. Please download the PDF to view it: <a href="http://yoursite.com/the.pdf">Download PDF</a>.</p>     </embed> </object> 
Read More

Sunday, January 14, 2018

First page of PDF in WkWebView is rendered pixelated

Leave a Comment

I'm having some issues with rendering a multi paged pdf in the WkWebView.

If it's a single page it looks fine. But the first page of any multiple page PDF looks bad. I can't figure out why.

Example

I load the request like this

wkWebView.load(URLRequest(url: request as! URL)) 

I keep a wkWebView inside a another view and to avoid a scroll in scroll (there is a reason for this I know how scrollView works) I then set the height of the wWebView and it's scrollView to the same height as the contentSize.

wkWebView.frame.size.height = wkWebView.scrollView.contentSize.height wkWebView.frame.size.width = UIScreen.main.bounds.width wkWebView.autoresizingMask = [.flexibleWidth, .flexibleHeight] wkWebView.scrollView.delegate = self 

3 Answers

Answers 1

The reason why this happens is because you need to add your wkWebView as a subview to the scrollView that will contain it BEFORE you do the actual request. Otherwise the pages might end up pixelated and they will also not behave correctly when zoomed in.

Solution:

self.scrollView.addSubview(wkWebView) wkWebView.load(URLRequest(url: request as! URL)) 

Wont work:

wkWebView.load(URLRequest(url: request as! URL)) self.scrollView.addSubview(wkWebView) 

Similar to this issue: https://stackoverflow.com/a/44623268/3418097

Answers 2

i am mehedi. i have a apps but don't know how can i promote it. please help me. here is the apps

Answers 3

Use the following methods:

[self.wkWebView loadFileURL:fileURL allowingReadAccessToURL:baseUrl]; 

Note: FileURL, is the need to load the HTML file path BaseUrl is HTML file path at the next higher level. This is pit, baseUrl and fileURL can't the same!

Read More

Friday, December 15, 2017

Why does bottom table cell that has centered-text get cut off when displayed as PDF in iOS?

Leave a Comment

Question

I have an iOS app that takes an html file, turns it into a PDF, and displays it to a WebKit web view. I have a weird problem where the bottom table cell gets cut off when I display to PDF. The weird thing is that the bottom table is only cut off when the text is centered. If it is left-aligned, everything works fine. Why does this happen?

Please note that I am not looking for a work around solution. Rather, I am seeking to understand why this happens. Is this a bug in iOS? Or did I miss something?

Overview

In the image below, there are two tables <table>. One table is large, has a reddish (coral) background, and has a height of 505px. The other table is below the first with a white background (height is not set). Both have some text. The text is centered for both tables.

The navbar title shows the details of the current view. For example, as shown in the image below, a title of PDF Landscape 505 means that the view shows a PDF in landscape dimensions with a main table height of 505px.

enter image description here

The Problem

The problem arises when I increase the height by 10px. In the image below, the main table height is 515px and the lower table is now cut off.

enter image description here

Take the exact same html and css code and change only the text-alignment to be left-aligned. Now the lower table is not cut off anymore. I also changed the background color to green for distinction. Green means that text is left-aligned. Red means that text is centered.

enter image description here

The following image shows a main-table height of 745px and still the lower table is not cut off because it is left-aligned.

enter image description here

Code

Below is the html code used for this test.

<!DOCTYPE html> <html> <head>   <title>#COLOR#</title>   <meta charset="utf-8">   <style>   table, th, td {     border-collapse: collapse;     border: 3px solid black;     text-align: #ALIGN#;   }   table.main-table {     width: 1012px;     height: #HEIGHT#px;     background-color: #COLOR#;   }   table.bottom-table {     width: 1012px;   }   </style> </head>  <body>    <table class="main-table">     <tr><td>Hello World.</td></tr>   </table>   <table class="bottom-table">     <tr><td>This text gets cut off when centered. It does *not* get cut when left-aligned.</td></tr>   </table>  </body> </html> 

In MyViewController, the getHTML() function pulls the html source from sample.html. The function then replaces #ALIGN#, #COLOR#, and #HEIGHT# with their respective values.

func getHTML() -> String? {     let htmlPath: String? = Bundle.main.path(forResource: htmlResource, ofType: "html")     guard let path = htmlPath else { return nil }     do {         // Load the HTML template code into a String variable.         var html = try String(contentsOfFile: path)         html = html.replacingOccurrences(of: "#HEIGHT#", with: tableHeight.description)         html = html.replacingOccurrences(of: "#COLOR#", with: colorState.rawValue)         html = html.replacingOccurrences(of: "#ALIGN#", with: alignState.rawValue)         return html     } catch {         print("Error: " + error.localizedDescription)     }     return nil } 

The PDFBuilder class handles the PDF creation with one function:

static func exportHTMLToPDF(html: String, frame: CGRect) -> Data {     // Set a printable frame and inset                     let pageFrame = frame     let insetRect = pageFrame.insetBy(dx: 10.0, dy: 10.0)      // Create a UIPrintPageRenderer and set the paperRect and printableRect using above values.     let pageRenderer = UIPrintPageRenderer()     pageRenderer.setValue(pageFrame, forKey: "paperRect")     pageRenderer.setValue(insetRect, forKey: "printableRect")      // Create a printFormatter and pass the HTML code as a string.     let printFormatter = UIMarkupTextPrintFormatter(markupText: html)      // Add the printFormatter to the pageRenderer     pageRenderer.addPrintFormatter(printFormatter, startingAtPageAt: 0)      // This data var is where the PDF will be stored once created.     let data = NSMutableData()      // This is where the PDF gets drawn.     UIGraphicsBeginPDFContextToData(data, pageFrame, nil)     UIGraphicsBeginPDFPage()     pageRenderer.drawPage(at: 0, in: UIGraphicsGetPDFContextBounds())     print("bounds: " + UIGraphicsGetPDFContextBounds().debugDescription)             UIGraphicsEndPDFContext()      return data as Data }  

This project is also on Github: https://github.com/starkindustries/PrintPDFTest

Troubleshooting Steps Taken

  1. Dimensions: I tried playing around with the size of the PDF dimensions. This did not make a difference in the result.
  2. Columns: I tried using multiple columns instead of just one. Same problem; no difference.
  3. Tables: I tried using just one table instead of two. With one table and two cells, the bottom cell still gets cut off when text is centered; bottom cell does not get cut when text is left-aligned.
  4. iPhones: I tried simulating on different iPhone devices (iPhone 6s, iPhone 8, iPad Pro). Same issue.
  5. Div: If you use a div instead of a second table, the problem is magically fixed. But why does a div work and not a table?

App Notes

The top toolbar has two buttons: green and red. Green makes the text left-aligned. Red makes the text centered.

enter image description here

The bottom toolbar has five buttons: rewind, html, portrait, landscape, and fast forward. The rewind button reduces the main table's height by 10px. Fast forward increases the height by 10px. The html button shows the html view. Portrait and landscape show PDF's in their respective orientations.

enter image description here

0 Answers

Read More

Thursday, December 14, 2017

Wrong title and file name displayed for PDF in primefaces p:media component

Leave a Comment

Using <p:media value="#{commonManagedBean.pdfStreamContent}" player="pdf"/> with streamed content.

Noticed that the PDF viewer's title shows "dynamiccontent.properties" while the PDF is loading and then changes to the actual title of the PDF.

Also, on downloading the PDF from the PDF viewer, the file name is displayed as "dynamiccontent.pdf" in the browser.

public  DefaultStreamedContent getPdfStreamContent (){      super.setPdfFile(new DefaultStreamedContent());      byte[] pdfFile=getPdfList().get(0);                                                      if(null!=pdfFile){         super.getPdfFile().setStream(new ByteArrayInputStream(pdfFile));         super.getPdfFile().setContentType("application/pdf");         super.getPdfFile().setName("document.pdf");                                  return super.getPdfFile();      } } 

Generated HTML code of <p:media> component:

<object  width="950"  height="455"  data="/warcom/javax.faces.resource/dynamiccontent.properties.jsf?ln=primefaces&amp;v=6.0&amp;pfdrid=d6762c48-9cfd-4596-b955-b1075967f062&amp;pfdrt=sc&amp;pfdrid_c=true"  type="application/pdf"> </object> 

I have tried below but this does not seem to work:

new DefaultStreamedContent(getData(), "application/pdf", "test.pdf"); 

How to change the file name to other name ?

Using PrimeFaces 6.0

0 Answers

Read More

Saturday, December 9, 2017

How to Extract Table as Text from the PDF using python

Leave a Comment

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there i am capturing that page and saving into another PDF.

import PyPDF2  PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored  pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object  pg4 = pfr.getPage(126) #extract pg 127  writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4)  NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream:     writer.write(outputStream) #write pages to new PDF 

My Goal is to extract the table from the whole pdf document.

Please have a look at the sample image of a page in PDF

Please help me out Guys. Thanks in advance.

4 Answers

Answers 1

in my opinion you have 4 possibilities:

  • You may treat the pdf directly using tabula

  • You may convert the pdf to text using pdftotext, then parse text with python

  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.

  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

Your question is near similar with:

Regards

Answers 2

Just as a keyword for your further research: There is also the option to use zonal OCR. I have used this with good success in a project. But this method not suited for high-volume/high-speed, and it requires to define the extraction template for each field you need:

enter image description here

On the plus side, since it works visually, it works with any kind of table (text, image, scan).

Answers 3

You can try to convert your pdf file to excel file and then you can use openpyxl library to extract data from excel file, add that file to array and then convert array to json.

Answers 4

Try converting your file to excel file and read it using CSV reader.

import csv with open('file_name') as csvfile:    reader = csv.DictReader(csvfile, delimiter=',')    for line in reader:       standards=line['standards']       ....... 

Refer python documents here for CSV module https://docs.python.org/3.6/library/csv.html

Also you can follow this answer Extract PDF files

Read More

Friday, November 24, 2017

How to serve blob and have good filename for all users?

Leave a Comment

I have a PDF file as a blob object. I want to serve to my users, and right now I'm doing:

html = '<iframe src="' + URL.createURL(blob) + '">';  

That works fine for people that want to use their in-browser PDF tool.

But...some people have their browser set to automatically download PDFs. For those people, the name of the downloaded file is some random string based on the blob URL. That's a bad experience for them.

I know I can also do:

<a href="blobURL" download="some-filename.pdf"> 

But that's a bad experience for the people who want to use in-browser PDF readers, since it forces them to download the file.

Is there a way to make everybody have good file names and to allow everybody to read the PDF the way they want to (in their browser or in their OS's reader)?

Thanks

3 Answers

Answers 1

At least looking at Google Chrome, if the user disables the PDF Viewer (using the option "Download PDF files instead of automatically opening them in Chrome") then window.navigator.plugins will show neither "Chromium PDF Plugin" nor "Chromium PDF Viewer". If the option is left at the default setting, the viewer will show in the plugin list.

Using this method, one can utilize window.navigator.plugins to check if any of the elements' names are either of the aforementioned plugins. Then, depending upon that result, either display a <iframe> or a <a href="blobUrl" download="file.pdf">. For other browsers I imagine that different methods would have to be used. You can also check for a "Acrobat Reader" plugin, which some machines may have instead, or even just the word "PDF".

On a side note, it does look like it is possible to detect if the default Firefox PDF viewer is enabled by using http://www.pinlady.net/PluginDetect/PDFjs/ .

Answers 2

Try to append &filename=thename.pdf to the binary, metadata or http header:

Content-Disposition: attachment; filename="thename.pdf"

Answers 3

I have looked through the documentation of createObjectURL(blob), it will always return a unique and specific format of url. It is not possible to change the URL here.

The plugin thing is not consistent across browsers.

Now here is my radical idea

  1. Find or create(if not available) a js library that can create and save PDF files to server from blob. (I looked through some of them like 'jsPDF','pdfkit' but none of them use blob)

  2. Save the file to server with a valid name

  3. use the above name in the iframe.

Read More

Friday, September 29, 2017

save a web view content as pdf file

Leave a Comment

Notice: I working with swift 4 for osx. I would like to generate a pdf file from a WebView.

At the moment my WebView load a html-string and show this successfully. If I press a button, the print panel will open and is ready to print my WebView content in the correct format.

This is my print code:

var webView = WebView() var htmlString = "MY HTML STRING"  override func viewDidLoad() {     webView.mainFrame.loadHTMLString(htmlString, baseURL: nil) }  func webView(_ sender: WebView!, didFinishLoadFor frame: WebFrame!) {      let printInfo = NSPrintInfo.shared     printInfo.paperSize = NSMakeSize(595.22, 841.85)     printInfo.isHorizontallyCentered = true     printInfo.isVerticallyCentered = true     printInfo.orientation = .portrait     printInfo.topMargin = 50     printInfo.rightMargin = 0     printInfo.bottomMargin = 50     printInfo.leftMargin = 0     printInfo.verticalPagination = .autoPagination     printInfo.horizontalPagination = .fitPagination     //webView.mainFrame.frameView.printOperation(with: printInfo).run()      let printOp: NSPrintOperation = NSPrintOperation(view: webView.mainFrame.frameView.documentView, printInfo: printInfo)     printOp.showsPrintPanel = true     printOp.showsProgressPanel = false     printOp.run()  } 

enter image description here

Now I would like to have another button, which save the content directly as a pdf file.

I tried this:

let pdfData = webView.mainFrame.frameView.documentView.dataWithPDF(inside: webView.mainFrame.frameView.documentView.frame) let document = PDFDocument(data: pdfData) let fileURL = try! FileManager.default.url(for: .documentDirectory, in: .userDomainMask, appropriateFor: nil, create: false).appendingPathComponent("myPDF.pdf") document?.write(to: fileURL) 

But the result of my pdf looks horrible:

enter image description here

Have anybody an idea?

UPDATE This is a part of my web view result:

enter image description here

and that is the result of my print preview / pdf file /the color is missing, but now everywhere. the "logo"(picture) will show with color: enter image description here

1 Answers

Answers 1

The problem is that dataWithPDF uses the view's current frame to decide how wide the generated PDF should be. Since your WebView's frame in your app is probably skinnier than an 8.5/11" page, you're getting a PDF that is inadequately wide. You could adjust the WebView's frame to the right size, make the PDF, and then adjust it back, or you could create a new WebView, render it offscreen, set it to the appropriate size, and create the PDF. That's a bit of a pain in the backside, though. Wouldn't it be great if there were a way to programatically do what the "PDF" button in the Print dialog does, since the print system handles all this stuff for you automatically?

Well, turns out you can! But you have to dive into the poorly-documented world of Core Printing.

func makePDF(at url: URL, for webView: WebView, printInfo: NSPrintInfo) throws {     webView.preferences.shouldPrintBackgrounds = true      guard let printOp = webView.mainFrame.frameView.printOperation(with: printInfo) else {         throw MyPrintError.couldntGetPrintOperation // or something like this     }      let session = PMPrintSession(printOp.printInfo.pmPrintSession())     let settings = PMPrintSettings(printOp.printInfo.pmPrintSettings())      if PMSessionSetDestination(session,                                settings,                                PMDestinationType(kPMDestinationFile),                                kPMDocumentFormatPDF as CFString,                                url as CFURL) != noErr {         throw MyPrintError.couldntSetDestination // or something like this     }      printOp.showsPrintPanel = false     printOp.run() } 

The key is the PMSessionSetDestination call, which allows us to configure the print session to print to a PDF instead of to an actual printer. Then we just tell NSPrintOperation not to show the print panel, run the operation, and presto! PDF printed.

Read More

Wednesday, August 23, 2017

Limitations on opening pdf file in Android

Leave a Comment

I am trying to opening some pdf from my Android application. I am using an Intent for doing that:

Intent intent = new Intent(); intent.setDataAndType(Uri.parse(url), "application/pdf"); startActivity(intent); 

This code works well for some pdf but it fails when I try to open others.

This is the message that Android is showing to me:

There is a problem with the file.

I have to mention that the pdf that are being opened without problems are created with one Crystal Report template and the pdfs that are failing are created with another one.

As opposed, if I open the url of the pdfs that are failing on my browser (on my computer), it does not give to me any error opening them so I guess that maybe there is some limitation on Android that differs from some pdf to another (on Crystal Report template) but I cannot see it.

What limitations exist on opening a pdf file on Android? (Size, some parameters of Crystal Report that are not allowed, etc...)

I have discarded that it could be a size limitation because the pdf files that are giving problems are smaller than the files that do not give any error.

Proves I have done:

  • Opening wrong PDFs on browser ~~> OK
  • Downloading wrong PDF on mobile phone and open it ~~> OK
  • Opening wrong PDFs on APP ~~> Error
  • Opening good PDF on APP of the company that PDFs crash ~~> OK

EDIT

I have noticed that I was using http:// protocol but the PDF is on a https:// protocol, so I have changed it on Uri.parse method.

When I made this change, the app crashed and an error was shown on the log:

android.content.ActivityNotFoundException: No Activity found to handle Intent

Also, I have noticed that the PDFs that does not give to me any error, are in an url with http:// protocol instead of https:// so I guess that https:// protocol can be the problem.

Am I only able to open http:// request on an Intent?

9 Answers

Answers 1

It could be the file failed to be interpret by the Android PDF viewer app. Have you tried to copy/download the exact same file to your Android phone and open from there?

Also, I'd suggest to use IntentChooser for launching the PDF viewer, just to play safe on no PDF viewer installed / giving user option to choose app:

Intent intent = new Intent(); intent.setDataAndType(Uri.parse(url), "application/pdf"); Intent chooserIntent = Intent.createChooser(intent, "Open Report"); startActivity(chooserIntent); 

Answers 2

I found a workaround to view my PDF on my Android application (but does not allow me to download it after show on the application). If I open my PDF using Google Docs I can open it with my Intent.

This is what I had to add before my url:

https://docs.google.com/gview?embedded=true&url= 

so the final url will be like:

https://docs.google.com/gview?embedded=true&url=https://myUrl.pdf 

Here is the whole code I am using right now:

String url= "https://docs.google.com/gview?embedded=true&url=https://url.pdf"; Intent intent = new Intent(Intent.ACTION_VIEW, Uri.parse(url)); startActivity(intent); 

But it is still not enough, because I would like to open it without need of using a third party application. Also, opening PDF with Google Docs is slow and I have to wait too much until the PDF is finally opened.

If anyone knows how to open it with native Android please let me know.


What happens if I do not open it with Google Docs?

With the same code, but just using my url, instead the added Google Docs url, Android let me choose between two applications: Adobe Acrobat Reader and Google Chrome.

  • If I open it with Google Chrome it download the PDF but it is not opened automatically.

  • If I open it with Adobe Acrobat Reader, it gives to me the following error:

    The PDF cannot be shown (it cannot be opened)

Answers 3

Simple add following library for android:- //in app level build compile 'com.joanzapata.pdfview:android-pdfview:1.0.4@aar'

//inside your xml file <com.joanzapata.pdfview.PDFView     android:id="@+id/pdfview"     android:layout_width="match_parent"     android:layout_height="match_parent"/>  //inside java code   pdfView.fromAsset(pdfName) .pages(0, 2, 1, 3, 3, 3) .defaultPage(1) .showMinimap(false) .enableSwipe(true) .onDraw(onDrawListener) .onLoad(onLoadCompleteListener) .onPageChange(onPageChangeListener) .load();  for more info:- use this github link 

https://github.com/JoanZapata/android-pdfview

Answers 4

You can use Webview to open PDF from url like this:

webview.loadUrl("http://drive.google.com/viewerng/viewer?embedded=true&url=" + your url); 

Answers 5

If API >=21 you can use PDFRenderer to create a bitmap of each page, but its only viewable, not editable. Here is an example i made up on the fly, lacking navigation buttons, but those shouldn't be to hard to implement.

PdfRenderer renderer = new PdfRenderer(ParcelFileDescriptor.open(new File("/path/to/file.pdf"),              ParcelFileDescriptor.MODE_READ_ONLY));     PdfRenderer.Page page = renderer.openPage(0);     Bitmap bitmap = Bitmap.createBitmap(page.getWidth(), page.getHeight(),             Bitmap.Config.ARGB_8888);     page.render(bitmap, null, null, PdfRenderer.Page.RENDER_MODE_FOR_DISPLAY);     imageView.setImageBitmap(bitmap);     page.close();     renderer.close(); 

Edit

PdfRenderer requires a local file for the FileDescriptor. So in turn viewing through the "cloud", to my knowledge, isnt possible with this approach.

Answers 6

Using an intent to open a pdf file with https:// protocol, definitelly https:// isn´t the problem.

I see that you are trying this method defining the data type:

Intent intent = new Intent(); intent.setDataAndType(Uri.parse(url), "application/pdf"); startActivity(intent); 

but it will cause:

ActivityNotFoundException: No Activity found to handle Intent

if you use this other method probably you can´t see PDF readers in the options to open this kind of files:

  Intent intent = new Intent();   intent.setDataAndType(Uri.parse(url), "application/pdf");   Intent chooserIntent = Intent.createChooser(intent, "Open Report");   startActivity(chooserIntent); 

You must try this method without the type definition and it will work perfectly:

Intent intent = new Intent(Intent.ACTION_VIEW); intent.setData(Uri.parse(url)); startActivity(intent); 

Other cause of this problem opening the pdf file via intent would be the default application to open this kind of files, so probably you will have a corrupt or invalid application configured, so reset the defaults:

Go to Settings. Tap Applications. Tap Default Applications.

Select the app and Clear defaults

Answers 7

You can try this,

public void OpenPDFFile(String file_name) {     try {         File pdfFile = new File(getFile(), file_name);//File path         if (pdfFile.exists()) //Checking for the file is exist or not         {             Uri path = Uri.fromFile(pdfFile);             Intent objIntent = new Intent(Intent.ACTION_VIEW);             objIntent.setDataAndType(path, "application/pdf");             objIntent.setFlags(Intent.FLAG_ACTIVITY_NO_HISTORY);             try {                 startActivity(objIntent);                 Log.e("IR", "No exception");             }             catch (ActivityNotFoundException e) {                 Log.e("IR", "error: " + e.getMessage());                 Toast.makeText(DownloadedPdfActivity.this,                         "No Application Available to View PDF",                         Toast.LENGTH_SHORT).show();             }          }     }catch (Exception e){         e.printStackTrace();     } } 

or

    mWebviewPdf.getSettings().setJavaScriptEnabled(true);     mWebviewPdf.getSettings().setLoadWithOverviewMode(true);     mWebviewPdf.getSettings().setUseWideViewPort(true);     pDialog = new ProgressDialog(PdfActivity.this, android.R.style.Theme_DeviceDefault_Dialog);     pDialog.setTitle("PDF");     pDialog.setMessage("Loading...");     pDialog.setIndeterminate(false);     pDialog.setCancelable(false);     mWebviewPdf.setWebViewClient(new WebViewClient() {         @Override         public void onPageStarted(WebView view, String url, Bitmap favicon) {             super.onPageStarted(view, url, favicon);             pDialog.show();         }          @Override         public void onPageFinished(WebView view, String url) {             super.onPageFinished(view, url);             pDialog.dismiss();         }     });     **mWebviewPdf.loadUrl("https://docs.google.com/gview?embedded=true&url=" + mUrl);**    String url= "https://docs.google.com/gview?embedded=true&url=" + mUrl;  Intent intent = new Intent(Intent.ACTION_VIEW, Uri.parse(url));  startActivity(intent); 

You can show pdf in webview.

Answers 8

You can use following library for android:- //in app level build compile

compile 'com.github.barteksc:android-pdf-viewer:2.6.1'

in your xml:

<com.github.barteksc.pdfviewer.PDFView         android:id="@+id/pdfView"         android:layout_width="match_parent"         android:layout_height="match_parent"/> 

in your activity/ fragment:

pdfView.fromUri(Uri) or pdfView.fromFile(File) or pdfView.fromBytes(byte[]) or pdfView.fromStream(InputStream) // stream is written to bytearray - native code cannot use Java Streams or pdfView.fromSource(DocumentSource) or pdfView.fromAsset(String) 

For more information, refer this link

Answers 9

Your problem is that Your pdf is downloading inside the Data/data folder and when you try to open using default intent then external app doesn't have any permissions to open the file from inside the data/data folder so you need to download the file outside directory of the app and then try to reopen it will be working.

Read More

Friday, August 11, 2017

Limitations on opening pdf file in Android

Leave a Comment

I am trying to opening some pdf from my Android application. I am using an Intent for doing that:

Intent intent = new Intent(); intent.setDataAndType(Uri.parse(url), "application/pdf"); startActivity(intent); 

This code works well for some pdf but it fails when I try to open others.

This is the message that Android is showing to me:

There is a problem with the file.

I have to mention that the pdf that are being opened without problems are created with one Crystal Report template and the pdfs that are failing are created with another one.

As opposed, if I open the url of the pdfs that are failing on my browser (on my computer), it does not give to me any error opening them so I guess that maybe there is some limitation on Android that differs from some pdf to another (on Crystal Report template) but I cannot see it.

What limitations exist on opening a pdf file on Android? (Size, some parameters of Crystal Report that are not allowed, etc...)

I have discarded that it could be a size limitation because the pdf files that are giving problems are smaller than the files that do not give any error.

1 Answers

Answers 1

It could be the file failed to be interpret by the Android PDF viewer app. Have you tried to copy/download the exact same file to your Android phone and open from there?

Also, I'd suggest to use IntentChooser for launching the PDF viewer, just to play safe on no PDF viewer installed / giving user option to choose app:

Intent intent = new Intent(); intent.setDataAndType(Uri.parse(url), "application/pdf"); Intent chooserIntent = Intent.createChooser(intent, "Open Report"); startActivity(chooserIntent); 
Read More

Monday, July 24, 2017

Faking a Streaming Response in Django to Avoid Heroku Timeout

Leave a Comment

I have a Django app that uses django-wkhtmltopdf to generate PDFs on Heroku. Some of the responses exceed the 30 second timeout. Because this is a proof-of-concept running on the free tier, I'd prefer not to tear apart what I have to move to a worker/ poll process. My current view looks like this:

def dispatch(self, request, *args, **kwargs):     do_custom_stuff()     return super(MyViewClass, self).dispatch(request, *args, **kwargs) 

Is there a way I can override the dispatch method of the view class to fake a streaming response like this or with the Empy Chunking approach mentioned here to send an empty response until the PDF is rendered? Sending an empty byte will restart the timeout process giving plenty of time to send the PDF.

1 Answers

Answers 1

I solved a similar problem using Celery, something like this.

def start_long_process_view(request, pk):     task = do_long_processing_stuff.delay()     return HttpResponse(f'{"task":"{task.id}"}') 

Then you can have a second view that can check the task state.

from celery.result import AsyncResult  def check_long_process(request, task_id):     result = AsyncResult(task_id)     return HttpResponse(f'{"state":"{result.state}"') 

Finally using javascript you can just fetch the status just after the task is being started. Updating every half second will more than enough to give your users a good feedback.

If you think Celery is to much, there are light alternatives that will work just great: https://djangopackages.org/grids/g/workers-queues-tasks/

Read More

Saturday, July 22, 2017

Changing the text and background color with Apple's PDFKit framework

Leave a Comment

I'd like to change the text and background color of a displayed PDF document using Apple's PDFKit Framework to show the documents in "Night Mode" (dark background, light foreground, just like in Adobe Reader).

I know the PDFPage class has a drawWithBox:toContext: method, which can be overwritten in a subclass to add effects (like watermark, as shown in this WWDC 2017 session), but I don't know how to set the color properties.

Is there a way to do this with the PDFKit library or any other low-level API (Quartz) from Apple?

1 Answers

Answers 1

For giving text color in pdf you can use,

-(CGRect)addText:(NSString*)text withFrame:(CGRect)frame font:(NSString*)fontName fontSize:(float)fontSize andColor:(UIColor*)color{     const CGFloat *val = CGColorGetComponents(color.CGColor);      CGContextRef    currentContext = UIGraphicsGetCurrentContext();     CGContextSetRGBFillColor(currentContext, val[0], val[1], val[2], val[3]);     UIFont *font =  [UIFont fontWithName:fontName size:fontSize];     CGSize stringSize = [text sizeWithFont:font constrainedToSize:CGSizeMake(pageSize.width - 2*20-2*20, pageSize.height - 2*20 - 2*20) lineBreakMode:NSLineBreakByWordWrapping];      CGRect renderingRect = CGRectMake(frame.origin.x, frame.origin.y, textWidth, stringSize.height);      [text drawInRect:renderingRect withFont:font lineBreakMode:NSLineBreakByWordWrapping alignment:NSTextAlignmentLeft];      frame = CGRectMake(frame.origin.x, frame.origin.y, textWidth, stringSize.height);      return frame; } 

And for background color, use the following

CGContextRef currentContext = UIGraphicsGetCurrentContext();     CGContextSetFillColorWithColor(currentContext, [UIColor blueColor].CGColor );     CGContextFillRect(currentContext, CGRectMake(0, 110.5, pageSize.width, pageSize.height)); 
Read More

Friday, July 21, 2017

Xamarin Android: How to Share PDF File From Assets Folder? Via Whatsapp I get message, the file you picked was not a document

Leave a Comment

I use Xamarin Android. I have a PDF File storaged in Assest folder from Xamarin Android.

enter image description here

I want to share this file in whatsapp, but I recive a message: "the file you picked was not a document"

enter image description here

I tried two ways:

This is the first way

var SendButton = FindViewById<Button>(Resource.Id.SendButton); SendButton.Click += (s, e) =>                  {                 ////Create a new file in the exteranl storage and copy the file from assets folder to external storage folder                 Java.IO.File dstFile = new Java.IO.File(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");                 dstFile.CreateNewFile();                  var inputStream = new FileInputStream(Assets.OpenFd("my-pdf-File--2017.pdf").FileDescriptor);                 var outputStream = new FileOutputStream(dstFile);                 CopyFile(inputStream, outputStream);                  //to let system scan the audio file and detect it                 Intent intent = new Intent(Intent.ActionMediaScannerScanFile);                 intent.SetData(Uri.FromFile(dstFile));                 this.SendBroadcast(intent);                  //share the Uri of the file                 var sharingIntent = new Intent();                 sharingIntent.SetAction(Intent.ActionSend);                 sharingIntent.PutExtra(Intent.ExtraStream, Uri.FromFile(dstFile));                 sharingIntent.SetType("application/pdf");                  this.StartActivity(Intent.CreateChooser(sharingIntent, "@string/QuotationShare"));             }; 

This is the second

//Other way              var SendButton2 = FindViewById<Button>(Resource.Id.SendButton2);             SendButton2.Click += (s, e) =>             {                  Intent intent = new Intent(Intent.ActionSend);                 intent.SetType("application/pdf");                  Uri uri = Uri.Parse(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");                 intent.PutExtra(Intent.ExtraStream, uri);                  try                 {                     StartActivity(Intent.CreateChooser(intent, "Share PDF file"));                 }                 catch (System.Exception ex)                 {                     Toast.MakeText(this, "Error: Cannot open or share created PDF report. " + ex.Message, ToastLength.Short).Show();                 }             }; 

In other way, when I share via email the pdf file is sent empty, "corrupt file"

What can I do?

2 Answers

Answers 1

The solution is copying de .pdf file from assets folder to a local storage. Then We able to share de file.

First copy the file:

string fileName = "my-pdf-File--2017.pdf";  var localFolder = Android.OS.Environment.ExternalStorageDirectory.AbsolutePath; var MyFilePath = System.IO.Path.Combine(localFolder, fileName);  using (var streamReader = new StreamReader(Assets.Open(fileName))) {        using (var memstream = new MemoryStream())        {               streamReader.BaseStream.CopyTo(memstream);               var bytes = memstream.ToArray();               //write to local storage               System.IO.File.WriteAllBytes(MyFilePath, bytes);                MyFilePath = $"file://{localFolder}/{fileName}";       } } 

Then share the file, from local storage:

var fileUri = Android.Net.Uri.Parse(MyFilePath);  var intent = new Intent(); intent.SetFlags(ActivityFlags.ClearTop); intent.SetFlags(ActivityFlags.NewTask); intent.SetAction(Intent.ActionSend); intent.SetType("*/*"); intent.PutExtra(Intent.ExtraStream, fileUri); intent.AddFlags(ActivityFlags.GrantReadUriPermission);  var chooserIntent = Intent.CreateChooser(intent, title); chooserIntent.SetFlags(ActivityFlags.ClearTop); chooserIntent.SetFlags(ActivityFlags.NewTask); Android.App.Application.Context.StartActivity(chooserIntent); 

Answers 2

the file you picked was not a document

I had this issue when I trying to share a .pdf file via WhatsApp from assets folder, but it gives me the same error as your question :

the file you picked was not a document 

Finally I got a solution that copy the .pdf file in assets folder to Download folder, it works fine :

var pathFile = Android.OS.Environment.GetExternalStoragePublicDirectory(Android.OS.Environment.DirectoryDownloads);  Java.IO.File dstFile = new Java.IO.File(pathFile.AbsolutePath + "/my-pdf-File--2017.pdf"); 

Effect like this.

Read More