Thursday, July 13, 2017

PhantomJS (Selenium) Cannot Load PDFs from direct urls

Leave a Comment

I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.

So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...

My Code :

from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By   def execute(script, args):     driver.execute('executePhantomScript', {'script': script, 'args' : args })  driver = webdriver.PhantomJS('phantomjs')  # hack while the python interface lags driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')  driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')  try:     WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin'))) except Exception as TimeoutException:     print("I waited for far too long and still couldn't fine the view.")     pass  # set page format # inside the execution script, webpage is "this" pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };''' execute(pageFormat, [])  # render current page render = '''this.render("test2.pdf")''' execute(render, []) 

I'm not sure what's happening and why is it happening. Need some assistance.

EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.

EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web

Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".

This will show a little table, click on "view" and it should show a table on full page.

Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.

3 Answers

Answers 1

Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.

import time  driver = webdriver.Chrome() driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/') pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a") for pdfLink in pdfLinks:     script = "arguments[0].setAttribute('download',arguments[1]);"     driver.execute_script(script, pdfLink, pdfLink.text)     time.sleep(1)     pdfLink.click()     time.sleep(3)  driver.quit() 

Answers 2

If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.

import requests url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf' r = requests.get(url)  with open('The_Scarlet_Letter_T.pdf', 'wb') as f:     f.write(r.content)  # If large file with requests.get(url, stream=True) as r:     with open('The_Scarlet_Letter_T.pdf', 'wb') as f:         for chunk in r.iter_content(chunk_size=1024):             if chunk:                 f.write(chunk) 

Answers 3

I recommend you look at the pdfkit library.

import pdfkit pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf') 

It makes downloading pdfs very simple with python. You will also need to download this for the library to work.

You could also try the code from this link shown below

#!/usr/bin/env python from contextlib import closing from selenium.webdriver import Firefox # pip install selenium from selenium.webdriver.support.ui import WebDriverWait  # use firefox to get page with javascript generated content with closing(Firefox()) as browser:      browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')      button = browser.find_element_by_name('button')      button.click()      # wait for the page to load      WebDriverWait(browser, timeout=10).until(          lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))      # store it to string variable      page_source = browser.page_source print(page_source) 

which you will need to edit to make work for your pdf.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment