Wednesday, June 14, 2017

scraping website with login page

Leave a Comment

I currently login to the time from website using the following script.

browser = webdriver.Chrome('E:/Shared Folders/Users/runnerjp/chromedriver/chromedriver.exe') browser.get("https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F")  time.sleep(3) username = browser.find_element_by_id("EmailAddress") password = browser.find_element_by_id("Password") username.send_keys("usr") password.send_keys("pass") login_attempt = browser.find_element_by_xpath("//input[@type='submit']") time.sleep(3) login_attempt.submit() 

it works but I find using Chrome web driver is hammering my CPU. Is there an alternative code I could use that does not mean I need to physically load the page to sign in?

9 Answers

Answers 1

All of the answers here have some merit, but it depends on the type of website being scraped and how it authenticates the logon.
If the webpage generates some or all of its content through javascript/ajax requests etc, then using selenium is the only way to go, as this allows the execution of javascript. However to keep cpu usage to a minimum you can use a "headless" browser such as phantomjs. phantomjs uses the same html engine and javascript engine as chrome, so you could test your code with chrome, and switch at the end.

If the content of the page is "static" then you can use the requests module. However the method of doing this will depend on whether the webpage uses the "basic" authentication baked into the http protocol (most things don't) in which case:

import requests requests.get('https://api.github.com/user', auth=('user', 'pass')) 

as suggested by CodeMonkey

but if it uses something else you'll have to analyse the login form to see what address the post request is sent to, and build a request using that address, and putting the username/password into fields with the ID of the elements on the form.

Answers 2

Use requests instead. You can use it to log in:

import requests requests.get('https://api.github.com/user', auth=('user', 'pass')) 

More information here: http://docs.python-requests.org/en/master/user/authentication/

Answers 3

you can use TestCafe.

enter image description here

TestCafe is free, open source framework for web functional testing (e2e testing). TestCafe’s based on Node.js and doesn’t use WebDriver at all.

TestCafe-powered tests are executed on the server side. To obtain DOM-elements, TestCafe provides powerfull flexible system of Selectors. TestCafe can execute JavaScript on tested webpage using the ClientFunction feature (see our Documentation).

TestCafe tests are really very fast, see for yourself. But the high speed test run does not affect the stability thanks to a build-in smart wait system.

Installation of TestCafe is very easy:

1) Check that you have Node.js on your PC (or install it).

2) To install TestCafe open cmd and type in:

npm install -g testcafe 

Writing test is not a rocket-science. Here is a quick start: 1) Copy-paste the following code to your text editor and save it as “test.js”

import { Selector } from ‘testcafe’;  fixture `Getting Started`     .page `http://devexpress.github.io/testcafe/example`;  test(‘My first test’, async t => {     await t         .typeText(‘#developer-name’, ‘John Smith’)         .click(‘#submit-button’)         .expect(Selector(‘#article-header’).innerText).eql(‘Thank you, John Smith!‘); }); 

2) Run test in your browser (e.g. chrome) by typing the following command in cmd:

testcafe chrome test.js 

3) Get the descriptive result in the console output.

TestCafe allows you to test against various browsers: local, remote (on devices, be it browser for Raspberry Pi or Safari for iOS), cloud (e.g. Sauce Labs) or headless (e.g. Nightmare). This means that you can easily use TestCafe with your Continious Integration infrastructure.

You can use the same to scrape data and save to file easily 

Answers 4

Using headless browser will consume significantly less CPU and memory, try using PhantomJS insist of Chrome. there is a nice blog post about using PhantomJS with selenium here:

https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/

Answers 5

Another alternative is 'grab' module:

from grab import Grab  g = Grab() g.go('https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F') g.doc.set_input('EmailAddress','some@email.com') g.doc.set_input('Password','somepass') g.doc.submit()  print g.doc.body 

Answers 6

You can use mechanize, it took me 3.22 seconds in my old notebook to login and parse the site.

from mechanize import Browser import time    #just to check elapsed time and check performance started_time = time.time()  browser = Browser() url = 'https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F' browser.open(url) browser.select_form(nr = 0) browser["EmailAddress"] = 'putyouremailhere' browser["Password"] = 'p4ssw0rd'  logged = browser.submit() redirected_url = logged.read() print redirected_url  #you can delete this section: elapsed_time = time.time() - started_time print elapsed_time,' seconds' 

I hope it helps! :)

Answers 7

there are several ways you can follow to do it:

  1. if you really need whole selenium functionality (Javascript, etc.) try using a headless browser driver, (i.e. ghostdriver) however, you won't save as much cpu time as you would choosing second way (below)
  2. instead of selenium, which is quite heavy, you may use some lightweight tools, such as robobrowser (py3), mechanize or browserplus. You can save a lot of cpu time that way, however they doesn't support javascript and lack some advanced features that selenium does.

Answers 8

Yes, other then selenium or chromium, and rather I should say other then headless browser, you should use the concept of http (making calls to url).

requests and urllib modules will help here.

For this you need to identify the parameters and type of method. Once you identify the things required to make call to url, you can use request or urllib. You also need to track what kind of response you are getting or you will get.

Here is good documentation for Requests

Example using requests :

Case: Here we are submitting a form which have 2 fields id and pwd, method specified in form is post and names specified in forms are user_id and user_pwd for id and pwd respecively. On click of a button it is calling 'some url'

dataToSend = {'user_id':'id you want to pass', 'user_pwd':'specify pwd here'} # Here you can specify headers and cookie, specify if required  response = requests.post(url, data=dataToSend, headers={'content-type':'specify if required', 'user-agent':'chrome...'})  if(response.status_code == 200):      contentReceived = response.content      # Here you need to observe the received content, most of the time content will be in json format, so you need to decode here.      if(contentReceived == 'Response is same that you have expected'):           print "Successfully"      else:           print "Failed" else:      print "Failed" 

Refer my other answers on how to use requests, cookie and selenium.

Answers 9

I recommend you https://scrapy.org/. It uses twisted under the hood so it is very efficient.

If you need to execute JavaScript there is also scrapy-splash package: https://github.com/scrapy-plugins/scrapy-splash.

There is special page in Scrapy FAQ about login pages: https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment