How to implement a PDF viewer that loads pages asynchronously

We need to allow users of our mobile app to browse a magazine with an experience that is fast, fluid and feels native to the platform (similar to iBooks/Google Books).

Some featurs we need are being able to see Thumbnails of the whole magazine, and searching for specific text.

The problem is that our magazines are over 140 pages long and we can’t force our users to have to fully download the whole ebook/PDF beforehand. We need pages to be loaded asynchronously, that is, to let users start reading without having to fully download the content.

I studied PDFKit for iOS however I didn’t find any mention in the documentation about downloading a PDF asynchronously.

Are there any solutions/libraries to implement this functionality on iOS and Android?

2 Answers

Answers 1

What you're looking for is called linearization and according to this answer.

The first object immediately after the %PDF-1.x header line shall contain a dictionary key indicating the /Linearized property of the file.

This overall structure allows a conforming reader to learn the complete list of object addresses very quickly, without needing to download the complete file from beginning to end:

The viewer can display the first page(s) very fast, before the complete file is downloaded.

The user can click on a thumbnail page preview (or a link in the ToC of the file) in order to jump to, say, page 445, immediately after the first page(s) have been displayed, and the viewer can then request all the objects required for page 445 by asking the remote server via byte range requests to deliver these "out of order" so the viewer can display this page faster. (While the user reads pages out of order, the downloading of the complete document will still go on in the background...)

You can use this native library to linearization a PDF.

However I wouldn't recommend made it has rendering the PDFs wont be fast, fluid or feel native. For those reasons, as far as I know there is no native mobile app that does linearization. Moreover, you have to create your own rendering engine for the PDF as most PDF viewing libraries do not support linearization . What you should do instead is convert the each individual page in the PDF to HTML on the server end and have the client only load the pages when required and cache. We will also save PDFs plan text separately in order to enable search. This way everything will be smooth as the resources will be lazy loaded. In order to achieve this you can do the following.

Firstly On the server end, whenever you publish a PDF, the pages of the PDF should be split into HTML files as explained above. Page thumbs should also be generated from those pages. Assuming that your server is running on python with a flask microframework this is what you do.

from flask import Flask,request from werkzeug import secure_filename import os from pyPdf import PdfFileWriter, PdfFileReader import imgkit from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.layout import LAParams import io import sqlite3 import Image  app = Flask(__name__)   @app.route('/publish',methods=['GET','POST']) def upload_file():      if request.method == 'POST':         f = request.files['file']         filePath = "pdfs/"+secure_filename(f.filename)         f.save(filePath)         savePdfText(filePath)         inputpdf = PdfFileReader(open(filePath, "rb"))          for i in xrange(inputpdf.numPages):             output = PdfFileWriter()             output.addPage(inputpdf.getPage(i))             with open("document-page%s.pdf" % i, "wb") as outputStream:                 output.write(outputStream)                 imgkit.from_file("document-page%s.pdf" % i, "document-page%s.jpg" % i)                 saveThum("document-page%s.jpg" % i)                 os.system("pdf2htmlEX --zoom 1.3  pdf/"+"document-page%s.pdf" % i)       def saveThum(infile):         save = 124,124         outfile = os.path.splitext(infile)[0] + ".thumbnail"         if infile != outfile:             try:                 im = Image.open(infile)                 im.thumbnail(size, Image.ANTIALIAS)                 im.save(outfile, "JPEG")             except IOError:                 print("cannot create thumbnail for '%s'" % infile)      def savePdfText(data):         fp = open(data, 'rb')         rsrcmgr = PDFResourceManager()         retstr = io.StringIO()         codec = 'utf-8'         laparams = LAParams()         device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)         # Create a PDF interpreter object.         interpreter = PDFPageInterpreter(rsrcmgr, device)         # Process each page contained in the document.         db = sqlite3.connect("pdfText.db")         cursor = db.cursor()         cursor.execute('create table if not exists pagesTextTables(id INTEGER PRIMARY KEY,pageNum TEXT,pageText TEXT)')         db.commit()         pageNum = 1         for page in PDFPage.get_pages(fp):             interpreter.process_page(page)             data =  retstr.getvalue()             cursor.execute('INSERT INTO pagesTextTables(pageNum,pageText) values(?,?) ',(str(pageNum),data ))             db.commit()             pageNum = pageNum+1      @app.route('/page',methods=['GET','POST'])     def getPage():         if request.method == 'GET':             page_num = request.files['page_num']             return send_file("document-page%s.html" % page_num, as_attachment=True)      @app.route('/thumb',methods=['GET','POST'])     def getThum():         if request.method == 'GET':             page_num = request.files['page_num']             return send_file("document-page%s.thumbnail" % page_num, as_attachment=True)      @app.route('/search',methods=['GET','POST'])     def search():         if request.method == 'GET':             query = request.files['query ']                    db = sqlite3.connect("pdfText.db")             cursor = db.cursor()            cursor.execute("SELECT * from pagesTextTables Where pageText LIKE '%"+query +"%'")            result = cursor.fetchone()            response = Response()            response.headers['queryResults'] = result             return response

Here is an explanation of what the flask app is doing.

The /publish route is responsible for the publishing of your magazine, turning very page to HTML, saving the PDFs text to an SQlite db and generating thumbnails for those pages. I've used pyPDF for splitting the PDF to individual pages, pdfToHtmlEx to convert the pages to HTML, imgkit to generate those HTML to images and PIL to generate thumbs from those images. Also, a simple Sqlite db saves the pages' text.
The /page, /thumb and /search routes are self explanatory. They simply return the HTML, thumb or search query results.

Secondly, on the client end you simply download the HTML page whenever the user scrolls to it. Let me give you an example for android OS. Firstly, you'd want to Create some Utils to handle the GET requestrs

public static byte[] GetPage(int mPageNum){ return CallServer("page","page_num",Integer.toString(mPageNum)) }  public static byte[] GetThum(int mPageNum){ return CallServer("thumb","page_num",Integer.toString(mPageNum)) }  private  static byte[] CallServer(String route,String requestName,String requestValue) throws IOException{          OkHttpClient client = new OkHttpClient.Builder().connectTimeout(30, TimeUnit.SECONDS).writeTimeout(30, TimeUnit.SECONDS).readTimeout(30, TimeUnit.SECONDS).build();         MultipartBody.Builder mMultipartBody = new MultipartBody.Builder().setType(MultipartBody.FORM).addFormDataPart(requestName,requestValue);          RequestBody mRequestBody = mMultipartBody.build();         Request request = new Request.Builder()                 .url("yourUrl/"+route).post(mRequestBody)                 .build();         Response response = client.newCall(request).execute();         return response.body().bytes();     }

The helper utils above simple handle the queries to the server for you, they should be self explanatory. Next, you simple create an RecyclerView with a WebView viewHolder or better yet an advanced webview as it will give you more power with customization.

    public static class ViewHolder extends RecyclerView.ViewHolder {         private AdvancedWebView mWebView;         public ViewHolder(View itemView) {             super(itemView);          mWebView = (AdvancedWebView)itemView;}     }     private class ContentAdapter extends RecyclerView.Adapter<YourFrament.ViewHolder>{         @Override         public ViewHolder onCreateViewHolder(ViewGroup container, int viewType) {              return new ViewHolder(new AdvancedWebView(container.getContext()));         }          @Override         public int getItemViewType(int position) {              return 0;         }          @Override         public void onBindViewHolder( ViewHolder holder, int position) { handlePageDownload(holder.mWebView);         }        private void handlePageDownload(AdvancedWebView mWebView){....}          @Override         public int getItemCount() {             return numberOfPages;         }     }

That should be about it.

Answers 2

I am sorry to say, But there is no any library or SDK available which provides asynchronously pages loading functionality. It is next to impossible on the mobile device to open PDF file without downloading the full pdf file.

Solution:

I have already done R&D for the same and fulfilled your requirement in the project. I am not sure iBooks and Google books used below mechanism or not. But is working fine as per your requirements.

Divide your pdf into n number of part (E.g Suppose you have 150 pages in pdf then every pdf contain 15 pages -> It will take some effort from web end.)
Once first part download successfully then display it to the user and other part downloading asynchronously.
After downloading all part of the pdf file, Use below code the merge Pdf file.

How to Merge PDF file

UIGraphicsBeginPDFContextToFile(oldFile, paperSize, nil);

for (pageNumber = 1; pageNumber <= count; pageNumber++) {     UIGraphicsBeginPDFPageWithInfo(paperSize, nil);      //Get graphics context to draw the page     CGContextRef currentContext = UIGraphicsGetCurrentContext();      //Flip and scale context to draw the pdf correctly     CGContextTranslateCTM(currentContext, 0, paperSize.size.height);     CGContextScaleCTM(currentContext, 1.0, -1.0);      //Get document access of the pdf from which you want a page     CGPDFDocumentRef newDocument = CGPDFDocumentCreateWithURL ((CFURLRef) newUrl);      //Get the page you want     CGPDFPageRef newPage = CGPDFDocumentGetPage (newDocument, pageNumber);      //Drawing the page     CGContextDrawPDFPage (currentContext, newPage);      //Clean up     newPage = nil;     CGPDFDocumentRelease(newDocument);     newDocument = nil;     newUrl = nil;  }  UIGraphicsEndPDFContext();

Reference: How to merge PDF file.

Update: Main advantage of this mechanism is Logic remain same for all device Android and iOS Device.

Coding Question

Thursday, May 24, 2018

How to implement a PDF viewer that loads pages asynchronously

2 Answers

Answers 1

Answers 2

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook