Showing posts with label web-scraping. Show all posts
Showing posts with label web-scraping. Show all posts

Monday, June 25, 2018

In selenium how to access warnings window for information not found?

Leave a Comment

In python 3 and selenium I have this program to enter codes in a website and store the information returned:

from selenium import webdriver  profile = webdriver.FirefoxProfile() browser = webdriver.Firefox(profile)  # Site that is accessed browser.get('https://www.fazenda.sp.gov.br/SigeoLei131/Paginas/ConsultaDespesaAno.aspx?orgao=')  # Select year 2018 browser.find_element_by_xpath('/html/body/form/div[3]/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[1]/td[2]/select/option[1]').click()  # Enter the code 07022473000139 browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_rblDoc_0"]').click() browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_txtCPF"]').send_keys('07022473000139') browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnPesquisar"]').click()  # Stores the information found company = browser.find_element_by_xpath('/html/body/form/div[3]/table/tbody/tr[2]/td/table/tbody/tr[3]/td/div/table/tbody/tr[2]/td[1]').text value = browser.find_element_by_xpath('/html/body/form/div[3]/table/tbody/tr[2]/td/table/tbody/tr[3]/td/div/table/tbody/tr[2]/td[2]').text  # Go back one screen to do another search browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnVoltar"]').click() 

I have a list of codes to search the site in the field: "CNPJ/CPF/Inscrição genérica/UGE favorecida :"

In this list I have already discovered that some codes do not exist in the site database (I do not know how many do not exist). And when I type a code that does not exist (like '07022473000136') a window opens on the site with the message "Não existe Credor com o filtro informado" and I can only continue by pressing the OK button

I did not find this warning message in the site code, so I still do not know how to handle it

Please, would anyone know how to test whether or not the code exists in selenium? And if not, how do I press the OK button to continue?

-/-

Below is a new test, to look for various codes that are in a dataframe. This program worked:

from selenium import webdriver from selenium.common.exceptions import NoAlertPresentException from selenium.webdriver.support.select import Select import pandas as pd  deputados_socios_empresas = pd.read_csv("resultados/empresas_deputados.csv",sep=',',encoding = 'utf-8', converters={'cnpj': lambda x: str(x), 'cpf': lambda x: str(x), 'documento': lambda x: str(x)})  profile = webdriver.FirefoxProfile() browser = webdriver.Firefox(profile) # Important command to pause, during loop browser.implicitly_wait(10)  # Site that is accessed browser.get('https://www.fazenda.sp.gov.br/SigeoLei131/Paginas/ConsultaDespesaAno.aspx?orgao=')  # List to store the data pagamentos = []  for num, row in deputados_socios_empresas.iterrows():     # Variable with code to search     empresa = (row['cnpj']).strip()      # Search for each code in four years     for vez in [2015, 2016, 2017, 2018]:         ano = str(vez)          # Select year         Select(browser.find_element_by_name('ctl00$ContentPlaceHolder1$ddlAno')).select_by_visible_text(ano)          # Fill in the code to search         browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_rblDoc_0"]').click()         browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_txtCPF"]').send_keys(empresa)         browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnPesquisar"]').click()          try:             found = True             alert = browser.switch_to.alert             alert.accept()             found = False             # Message shows that the code was not found that year             print("CNPJ " + empresa + " não encontrado no ano " + ano)             browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnVoltar"]').click()         except NoAlertPresentException:             pass          if found:             results = browser.find_element_by_xpath("//table[@id='ctl00_ContentPlaceHolder1_gdvCredor']//tr[2]")             cia = results.find_element_by_xpath("td[1]").text             valor = results.find_element_by_xpath("td[2]").text              #Message shows that the code was found that year             print("CNPJ " + empresa + " encontrado no ano " + ano)              # Fills dictionary with found data             dicionario = {"cnpj": empresa,                           "ano": ano,                           "empresa": cia,                           "valor": valor,                          }             pagamentos.append(dicionario)               # Go back one screen to do another search             browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnVoltar"]').click()  # Create the dataframe df_pagamentos = pd.DataFrame(pagamentos) 

4 Answers

Answers 1

First of all you should not be using such bad xpaths. I think you copied them from auto generate xpath option which is not good.

Next what you need is to check the presence of the alert in case of failure. Below is the updated code that should work for you

from selenium import webdriver from selenium.common.exceptions import NoAlertPresentException from selenium.webdriver.support.select import Select  profile = webdriver.FirefoxProfile() browser = webdriver.Firefox(profile)  # Site that is accessed browser.get('https://www.fazenda.sp.gov.br/SigeoLei131/Paginas/ConsultaDespesaAno.aspx?orgao=')  # Select year 2018 Select(browser.find_element_by_name('ctl00$ContentPlaceHolder1$ddlAno')).select_by_visible_text("2018")  # Enter the code 07022473000139 browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_rblDoc_0"]').click() browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_txtCPF"]').send_keys('07022473000136') # browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_txtCPF"]').send_keys('07022473000139')  browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnPesquisar"]').click()  try:     found = True     alert = browser.switch_to.alert     alert.accept()     found = False     print("no data was found") except NoAlertPresentException:     pass  if found:     results = browser.find_element_by_xpath("//table[@id='ctl00_ContentPlaceHolder1_gdvCredor']//tr[2]")     company = results.find_element_by_xpath("td[1]").text     value = results.find_element_by_xpath("td[2]").text      print(company, value)     # Go back one screen to do another search browser.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnVoltar"]').click() 

Answers 2

You can use action chains to send keystrokes to the browser and not a particular page element.

actions = ActionChains(driver) actions.send_keys(Keys.ENTER) actions.perform() 

Also see: http://selenium-python.readthedocs.io/api.html?highlight=send_keys#module-selenium.webdriver.common.action_chains

Answers 3

You need to add this import statement

from selenium.common.exceptions import UnexpectedAlertPresentException 

and after these lines

driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_rblDoc_0"]').click() driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_txtCPF"]').send_keys('07022473000136') driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_btnPesquisar"]').click() 

add

try:      driver.switch_to.alert.accept()   except UnexpectedAlertPresentException:     print('passing') 

which will accept the alert present in the browser.

Edit 1: Tarun's answer is a much better implementation. Please use it as an answer.

Answers 4

To do the operation in a slightly organized manner, you can try the below way. Your defined xpaths are error prone. However, driver.switch_to_alert().accept() is not a big issue to deal with, what does important is where to place it. I've tried with three searches, of which the one in the middle should encounters the issue and handles it in the proper way.

This is what i would do in this situation:

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import UnexpectedAlertPresentException  url = 'https://www.fazenda.sp.gov.br/SigeoLei131/Paginas/ConsultaDespesaAno.aspx?orgao='  def get_values(driver,link):     for keyword in ["07022473000139","07022473000136","07022473000139"]:         driver.get(link)         wait = WebDriverWait(driver, 10)         item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name='ctl00$ContentPlaceHolder1$ddlAno']")))         select = Select(item)         select.select_by_visible_text("2018")         wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,'#ctl00_ContentPlaceHolder1_rblDoc_0'))).click()         wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,'#ctl00_ContentPlaceHolder1_txtCPF'))).send_keys(keyword)         wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,'#ctl00_ContentPlaceHolder1_btnPesquisar'))).click()          try:             company = wait.until(EC.visibility_of_element_located((By.XPATH, "//tr[contains(.//th,'Credor')]/following::a"))).text             value = wait.until(EC.visibility_of_element_located((By.XPATH, "//tr[contains(.//th[2],'Valor')]/following::td[2]"))).text             print(f'{company}\n{value}\n')          except UnexpectedAlertPresentException:             driver.switch_to_alert().accept()             print("Nothing found"+"\n")  if __name__ == '__main__':     driver = webdriver.Chrome()     try:         get_values(driver,url)     finally:         driver.quit() 
Read More

Tuesday, June 19, 2018

Trouble running a parser created using scrapy with selenium

Leave a Comment

I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again.

As this is my first time to work with selenium along with scrapy, I don't have any idea to move on successfully. Any fix will be highly appreciated.

If I try like this then it works smoothly (there is nothing wrong with selectors):

class IncomeTaxSpider(scrapy.Spider):     name = "taxspider"      start_urls = [         'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',     ]      def __init__(self):         self.driver = webdriver.Chrome()         self.wait = WebDriverWait(self.driver, 10)      def parse(self,response):         self.driver.get(response.url)          while True:             for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):                 name = elem.find_element_by_css_selector("div[id^='arrowex']").text                 print(name)              try:                 self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()                 self.wait.until(EC.staleness_of(elem))             except TimeoutException:break 

But my intention is to make my script run this way:

import scrapy from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException  class IncomeTaxSpider(scrapy.Spider):     name = "taxspider"      start_urls = [         'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',     ]      def __init__(self):         self.driver = webdriver.Chrome()         self.wait = WebDriverWait(self.driver, 10)      def click_nextpage(self,link):         self.driver.get(link)         elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))          #It keeeps clicking on the same link over and over again          self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()           self.wait.until(EC.staleness_of(elem))       def parse(self,response):         while True:             for item in response.css("h1.faqsno-heading"):                 name = item.css("div[id^='arrowex']::text").extract_first()                 yield {"Name": name}              try:                 self.click_nextpage(response.url) #initiate the method to do the clicking             except TimeoutException:break 

These are the titles visible on that landing page (to let you know what I'm after):

INDIA INCLUSION FOUNDATION INDIAN WILDLIFE CONSERVATION TRUST VATSALYA URBAN AND RURAL DEVELOPMENT TRUST 

I'm not willing to get the data from that site so any alternative approach other than what I've tried above is useless to me. My only intention is to have any solution related to the way I tried in my second approach.

2 Answers

Answers 1

In case you need pure Selenium solution:

driver.get("https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx")  while True:     for item in wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div[id^='arrowex']"))):         print(item.text)     try:         driver.find_element_by_xpath("//input[@text='Next' and not(contains(@class, 'disabledImageButton'))]").click()     except NoSuchElementException:         break 

Answers 2

Whenever the page gets loaded using the 'Next Page' arrow (using Selenium) it gets reset back to Page '1'. Not sure about the reason for this (may be the java script) Hence changed the approach to use the input field to enter the page number needed and hitting ENTER key to navigate.

Here is the modified code. Hope this may be useful for you.

import scrapy from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.keys import Keys  class IncomeTaxSpider(scrapy.Spider):     name = "taxspider"     start_urls = [         'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',     ]     def __init__(self):         self.driver = webdriver.Firefox()         self.wait = WebDriverWait(self.driver, 10)      def click_nextpage(self,link, number):         self.driver.get(link)         elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))          #It keeeps clicking on the same link over and over again      inputElement = self.driver.find_element_by_xpath("//input[@id='ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_txtPageNumber']")     inputElement.clear()     inputElement.send_keys(number)     inputElement.send_keys(Keys.ENTER)         self.wait.until(EC.staleness_of(elem))       def parse(self,response):         number = 1         while number < 10412: #Website shows it has 10411 pages.             for item in response.css("h1.faqsno-heading"):                 name = item.css("div[id^='arrowex']::text").extract_first()                 yield {"Name": name}                 print (name)              try:                 number += 1                 self.click_nextpage(response.url, number) #initiate the method to do the clicking             except TimeoutException:break 
Read More

Wednesday, June 6, 2018

Get reactInstance from Javascript in Chrome Extension

Leave a Comment

I'm currently facing a problem while developing a Chrome Extension. This extension is used on a ReactJS based website. I need to scrap some data from the page. He is an example of the page.

<div class="UserWallet">    <tbody>       <tr class"Trans"><td>...</td></tr>       ...    <tbody> </div> 

When I use the Chrome inspector, I can see that my div class="UserWallet"> has a property __reactInternalInstance. I found a function findReact(element) used to get the React Instance. This function is used in an other Chrome Extension called Steemit-More-Info. I have the exact same function and a use the same HTML element as parameter but my function is not working. When I do $(".UserWallet)", the result doesn't contains the property __reactInternalInstance. But in the other extension, it's working with the same JQuery code and the same findReact function.

Here is the code for findReact.

var findReact = function(dom) {    for (var key in dom) {       if (key.startsWith("__reactInternalInstance$")) {         var compInternals = dom[key]._currentElement;         var compWrapper = compInternals._owner;         var comp = compWrapper._instance;         return comp;       }     }     return null; }; 

Has anyone ever faced that problem? Is there a special library that I need to include in my extension to be able to scrap the reactInstance?

Thank you,

cedric_g

2 Answers

Answers 1

A workaround idea when you can modify the webpage.

Maybe you can put the nedded data in a specific tag like this:

<div class="UserWallet">     <span       style={{ display: 'none' }}       id="my-user-wallet-data"     >{ JSON.stringify(myData) }</span>   ...  </div> 

And then you can easily get this element by using

var wallet = document.getElementById('my-user-wallet-data')

And so you can get the inner Text of this element

Answers 2

As stated here:

Heads up that in React 16 this hack won't work because internal property names changed.

Your website you're trying to scrape might use React 16 so that your approach won't work. There is one hacky way I could think is to hook the react-devtools, and get the instance that the react-devtools already populate for you.

The following code is to get the MarkdownEditor components (the last one) from React homepage, run it in the console and you can get the result

var reactDevToolsHook = window.__REACT_DEVTOOLS_GLOBAL_HOOK__._fiberRoots; var instArray = [...reactDevToolsHook[Object.keys(reactDevToolsHook)[0]]]; var mdInst = instArray[8]; console.log(mdInst.current.child.stateNode) 

This is the console output

Console output

and the output from React Devtools output from React Devtools

NOTE: you have to install React Developer Tools to make this trick work

Read More

Friday, April 20, 2018

Unable to make a split screen scroll to the bottom

Leave a Comment

I've written a script in python in combination with selenium to make the screen of a webpage scroll downward. The content is within the left-sided window. If I scroll down, more items are visible. I've tried with the below approach but It doesn't seem to work. Any help on this will be highly appreciated.

Check out this: website link.

What I've tried so far with:

import time from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys  driver = webdriver.Chrome() wait = WebDriverWait(driver, 10) driver.get("find_the_link_above")  elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#pannello-espositori .luogo")))  for item in range(3):     elem.send_keys(Keys.END)     time.sleep(3)  driver.quit() 

When I execute the above script, it throws an exception can't focus element.

4 Answers

Answers 1

Try to use the following method for that:

def scroll_down():     """A method for scrolling down the page."""      # Get scroll height.     last_height = driver.execute_script("return document.querySelector('#pannello-espositori').scrollHeight;")      while True:          # Scroll down to the bottom.         driver.execute_script("window.iScrollElenco.scrollBy(0, -arguments[0]);", last_height)          # Wait to load the page.         time.sleep(2)          # Calculate new scroll height and compare with last scroll height.         new_height = driver.execute_script("return document.querySelector('#pannello-espositori').scrollHeight;")         if new_height == last_height:              break          last_height = new_height 

Use this method when you want to scroll down content (using the height of the left side panel) in the left side panel.

Hope it helps you! Let me know about the result.

Answers 2

Try this. You can see scrolling effect by scrolling up to the elements in the left panel.

This solution would scroll up to first 100 elements.

from selenium import webdriver import time  def scroll_element_into_view(element):     driver.execute_script(         "arguments[0].scrollIntoView(true);",         element)     time.sleep(0.2) #increase/decrease time as you want delay in your view  driver = webdriver.Chrome() driver.maximize_window() driver.set_page_load_timeout(5) try:     driver.get("http://catalogo.marmomac.it/it/cat")     time.sleep(3)     total_elems= driver.find_elements_by_css_selector(".scroller .elemento")     print len(total_elems)     for i in range(len(total_elems)):         scroll_element_into_view(total_elems[i]) except Exception as e:     print e finally:     driver.quit() 

As you have mentioned, after scroll it would load more elements.Below script would handle that too. Here we can use total count which already shown at top of the panel.

for ex count is : 1669

  1. First it will scroll from 1 to 100 element
  2. Again find total elements which is now 150
  3. So it will scroll from 101 to 150
  4. Again find total elements which is now 200
  5. So it will scroll from 150 to 200

this process would continue till 1669 element. (Store previous count in one variable and update it after every loop)

try:     driver.get("http://catalogo.marmomac.it/it/cat")     time.sleep(3)     total_elems=0     total_count = int(driver.find_element_by_css_selector(".totali").text)     while total_elems<total_count:         elems= driver.find_elements_by_css_selector(".scroller .elemento")         found_elms= len(elems)         for i in range(total_elems,found_elms):             scroll_element_into_view(elems[i])         total_elems=found_elms except Exception as e:     print e finally:     driver.quit() 

Answers 3

Have you tried something like:
Option 1

execute_script("arguments[0].scrollIntoView();",element) 

Option 2

from selenium.webdriver.common.action_chains import ActionChains  element = driver.find_element_by_id("my-id")  actions = ActionChains(driver) actions.move_to_element(element).perform() 

Answers 4

The selector "#pannello-espositori .luogo" gives part of the text in the first element in the panel. If you scroll down, this element is no longer visible and might not be able to get any more commands. You can locate the entire list and use ActionChains to scroll to the last element each time. It will be scrolled into view and will reload the list

actions = ActionChains(driver) for item in range(3):     elements = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#pannello-espositori .elemento")))     actions.move_to_element(elements[-1]).perform() 
Read More

Saturday, March 24, 2018

Trouble using lambda function within my scraper

Leave a Comment

I've written a script to parse the name and price of certain items from craigslist. The xpath I've defined within my scraper are working ones. The thing is when I try to scrape the items in usual way then applying try/except block I can avoid IndexError when the value of certain price is none. I even tried with customized function to make it work and found success as well.

However, In this below snippet I would like to apply lambda function to kick out IndexError error. I tried but could not succeed.

Btw, when I run the code It neither fetches anything nor throws any error either.

import requests from lxml.html import fromstring  page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text tree = fromstring(page)  # I wish to fix this function to make a go get_val = lambda item,path:item.text if item.xpath(path) else ""  for item in tree.xpath('//li[@class="result-row"]'):     link = get_val(item,'.//a[contains(@class,"hdrlnk")]')     price = get_val(item,'.//span[@class="result-price"]')     print(link,price) 

1 Answers

Answers 1

First of all, your lambda function get_val returns the text of the item if the path exists, and not the text of the searched node. This is probably not what you want. If want want to return the text content of the (first) element matching the path, you should write:

get_val = lambda item, path: item.xpath(path)[0].text if item.xpath(path) else "" 

Please note that xpath returns a list. I assume here that you have only one element in that list.

The output is something like that:

... Residential Plot @ Sarjapur Check Post ₨1000 Prestige dolce vita apartments in whitefield, Bangalore  Brigade Golden Triangle, ₨12500000 Nikoo Homes, ₨6900000 

But I think you want a link, not the text. If this is the case, read below.

Ok, how to get a link? When you have an anchor a, you get its href (the link) in the table of attibutes: a.attrib["href"].

So as I understand, in the case of the price, you want the text, but in the case of the anchor, you want the value of one specific attributes, href. Here's the real use of lambdas. Rewrite your function like that:

def get_val(item, path, l):     return l(item.xpath(path)[0]) if item.xpath(path) else "" 

The parameter l is a function that is applied to the node. l may return the text of the node, or the href of an anchor:

link = get_val(item,'.//a[contains(@class,"hdrlnk")]', lambda n: n.attrib["href"]) price = get_val(item,'.//span[@class="result-price"]', lambda n: n.text) 

Now the output is:

... https://bangalore.craigslist.co.in/reb/d/residential-plot-sarjapur/6522786441.html ₨1000 https://bangalore.craigslist.co.in/reb/d/prestige-dolce-vita/6522754197.html  https://bangalore.craigslist.co.in/reb/d/brigade-golden-triangle/6522687904.html ₨12500000 https://bangalore.craigslist.co.in/reb/d/nikoo-homes/6522687772.html ₨6900000 
Read More

Sunday, March 4, 2018

R httr post-authentication download works in interactive mode but fails in function

Leave a Comment

the code below works fine in interactive mode but fails when used in a function. it's pretty simply two authentications POST commands followed by the data download. my goal is to get this working inside a function, not just in interactive mode.

this question is sort of a sequel to this question.. icpsr recently updated their website. the minimal reproducible example below requires a free account, available at

https://www.icpsr.umich.edu/rpxlogin?path=ICPSR&request_uri=https%3a%2f%2fwww.icpsr.umich.edu%2ficpsrweb%2findex.jsp

i tried adding Sys.sleep(1) and various httr::GET/httr::POST calls but nothing worked.

my_download <-     function( your_email , your_password ){          values <-             list(                 agree = "yes",                 path = "ICPSR" ,                 study = "21600" ,                 ds = "" ,                 bundle = "rdata",                 dups = "yes",                 email=your_email,                 password=your_password             )           httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)         httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)          tf <- tempfile()         httr::GET(              "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" ,              query = values ,              httr::write_disk( tf , overwrite = TRUE ) ,              httr::progress()         )      }  # fails  my_download( "email@address.com" , "some_password" )  # stepping through works debug( my_download ) my_download( "email@address.com" , "some_password" ) 

EDIT the failure simply downloads this page as if not logged in (and not the dataset), so it's losing the authentication for some reason. if you are logged in to icpsr, use private browsing to see the page--

https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2?study=21600&ds=1&bundle=rdata&path=ICPSR

thanks!

0 Answers

Read More

Saturday, February 10, 2018

Generate a correct scrapy hidden input form values for asp doPostBack() function

Leave a Comment

tldr; My attempts to overwritte the hidden field needed by server to return me a new page of geocaches failed (__EVENTTARGET attributes) , so server return me an empty page.

Ps : My original post was closed du to vote abandon, so i repost here after a the massive edit i produce on the first post.


I try to scrap some webpages which contain cache on a famous geocaching site using Scrapy 1.5.0.

Because you need an account if you want to run this code, i create a new temporary and free account on the website to make some test : dumbuser with password stackoverflow


A) The actual working part of the process :

  • First, i enter the website by login page (needed to search page) : https://www.geocaching.com/account/login
  • After successful login, i search item (geocaches) in some geographic places (for exemple France, Haute-Normandie).

This first search works without problem, and i have no difficulties to parse the first geocaches.

B) The problem part of the process : requesting next pages

When i try to simulate a click to go to the next page of geocaches. For example going to page 1 to page 2.

enter image description here

The website use ASP with synchronised state between client and server, so we need to go to page1 then page2 then page3 then etc. during the scrap in order to maintain the __VIEWSTATE variable (an hidden input) generated by server between each FORM query.

The link of each number (see the image) call a link with javascript function javascript:__doPostBack(...), which inject content into already existing hidden field before submitting the entire form.

As you can see in the __doPostBack function :

<script type="text/javascript"> //<![CDATA[ var theForm = document.forms['aspnetForm']; if (!theForm) {     theForm = document.aspnetForm; } function __doPostBack(eventTarget, eventArgument) {     if (!theForm.onsubmit || (theForm.onsubmit() != false)) {         theForm.__EVENTTARGET.value = eventTarget;         theForm.__EVENTARGUMENT.value = eventArgument;         theForm.submit();     } } //]]> </script> 

Exemple : So when you click on page 2 link, javascript run is javascript:__doPostBack('ctl00$ContentBody$pgrTop$lbGoToPage_2',''). The form is submitted with

  • __EVENTTARGET = ctl00$ContentBody$pgrTop$lbGoToPage_2
  • __EVENTARGUMENT = ''

C) First try to imitate this behavior :

In order to scrap many pages (limited here to five first pages) i try here to yield five formRequest.from_response query which simply overwrite manually this __EVENTTARGET __EVENTARGUMENT attribute :

def parse_pages(self,response):      self.parse_cachesList(response)      ## EXTRACT NUMBER OF PAGES     links = response.xpath('//td[@class="PageBuilderWidget"]/span/b[3]')     print(links.extract_first())      ## Try to extract page 1 to 5 for exemple     for page in range(1,5):         yield scrapy.FormRequest.from_response(             response,             formxpath="//form[@id='aspnetForm']",             formdata= {'__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_'+str(page), '__EVENTARGUMENT': '',                       '__LASTFOCUS': ''},             dont_click=True,             callback=self.parse_cachesList,             dont_filter=True         ) 

D) Consequence :

The page returned by server is empty, so there is something wrong in my strategy.

When i look at the generated html code returned by server after form post, the __EVENTTARGET is never overwritten by scrapy :

<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/> <input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/> 

E) Question :

Could you help me to understand why scrapy don't replace/overwrite the __EVENTTARGET value here ? Where is the problem in my strategy to simulate users who click to follow each new pages ?

Complete code is downloadable here : code


UPDATE 1 :

Using fiddler, i finally found that the problem is linked to an input : ctl00$ContentBody$chkAll=Check All This input is automatically copied by scrapy.FormRequest.from_response method. If i remove this attribute from POST request, it works. So, how can i remove this field, i try empty without result :

result = scrapy.FormRequest.from_response(             response,             formname="aspnetForm",             formxpath="//form[@id='aspnetForm']",             formdata={'ctl00$ContentBody$chkAll':'',                       '__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},             dont_click=True,             callback=self.parse_cachesList,             dont_filter=True,             meta={'proxy': 'http://localhost:8888'}             ) 

1 Answers

Answers 1

Solved using lot of patience, and fiddler tool to debug and resend the POST query to the server !

Like update 1 say in my original question, the problem comes from the input ctl00$ContentBody$chkAll in the form.

The way to remove an input into the POST form sent by FormRequest is simple, i found it in the commit here. Set the attribute to None in the formdata dictionnary.

    result = scrapy.FormRequest.from_response(         response,         formname="aspnetForm",         formxpath="//form[@id='aspnetForm']",         formdata={'ctl00$ContentBody$chkAll':None,         '__EVENTTARGET':'ctl00$ContentBody$pgrTop$lbGoToPage_2',},         dont_click=True,         callback=self.parse_cachesList,         dont_filter=True         ) 
Read More

Monday, January 1, 2018

Scrapy Very Basic Example

Leave a Comment

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.

They were trying to run the command:

scrapy crawl mininova.org -o scraped_data.json -t json 

I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

2 Answers

Answers 1

You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.

The tutorial implies that Scrapy is, in fact, a separate program.

Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.

For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .

tutorial/     scrapy.cfg     tutorial/         __init__.py         items.py         pipelines.py         settings.py         spiders/             __init__.py             ... 

The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.

Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:

scrapy crawl <website-name> -o <output-file> -t <output-type> 

Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:

scrapy runspider my_spider.py 

Answers 2

TL;DR: see Self-contained minimum example script to run scrapy.

First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.

If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:

scrapy crawl myspider 

But, Scrapy also provides an API to run crawling from a script.

There are several key concepts that should be mentioned:

  • Settings class - basically a key-value "container" which is initialized with default built-in values
  • Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
  • Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:

The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and dispatches events or messages in a program. It works by calling some internal or external “event provider”, which generally blocks until an event has arrived, and then calls the relevant event handler (“dispatches the event”). The reactor provides basic interfaces to a number of services, including network communications, threading, and event dispatching.

Here is a basic and simplified process of running Scrapy from script:

  • create a Settings instance (or use get_project_settings() to use existing settings):

    settings = Settings()  # or settings = get_project_settings() 
  • instantiate Crawler with settings instance passed in:

    crawler = Crawler(settings) 
  • instantiate a spider (this is what it is all about eventually, right?):

    spider = MySpider() 
  • configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:

Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by connecting a handler to the signals.spider_closed signal.

def callback(spider, reason):     stats = spider.crawler.stats.get_stats()     # stats here is a dictionary of crawling stats that you usually see on the console              # here we need to stop the reactor     reactor.stop()  crawler.signals.connect(callback, signal=signals.spider_closed) 
  • configure and start crawler instance with a spider passed in:

    crawler.configure() crawler.crawl(spider) crawler.start() 
  • optionally start logging:

    log.start() 
  • start the reactor - this would block the script execution:

    reactor.run() 

Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:

import json  from scrapy.crawler import Crawler from scrapy.contrib.loader import ItemLoader from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst from scrapy import log, signals, Spider, Item, Field from scrapy.settings import Settings from twisted.internet import reactor   # define an item class class DmozItem(Item):     title = Field()     link = Field()     desc = Field()   # define an item loader with input and output processors class DmozItemLoader(ItemLoader):     default_input_processor = MapCompose(unicode.strip)     default_output_processor = TakeFirst()      desc_out = Join()   # define a pipeline class JsonWriterPipeline(object):     def __init__(self):         self.file = open('items.jl', 'wb')      def process_item(self, item, spider):         line = json.dumps(dict(item)) + "\n"         self.file.write(line)         return item   # define a spider class DmozSpider(Spider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",         "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"     ]      def parse(self, response):         for sel in response.xpath('//ul/li'):             loader = DmozItemLoader(DmozItem(), selector=sel, response=response)             loader.add_xpath('title', 'a/text()')             loader.add_xpath('link', 'a/@href')             loader.add_xpath('desc', 'text()')             yield loader.load_item()   # callback fired when the spider is closed def callback(spider, reason):     stats = spider.crawler.stats.get_stats()  # collect/log stats?      # stop the reactor     reactor.stop()   # instantiate settings and provide a custom configuration settings = Settings() settings.set('ITEM_PIPELINES', {     '__main__.JsonWriterPipeline': 100 })  # instantiate a crawler passing in settings crawler = Crawler(settings)  # instantiate a spider spider = DmozSpider()  # configure signals crawler.signals.connect(callback, signal=signals.spider_closed)  # configure and start the crawler crawler.configure() crawler.crawl(spider) crawler.start()  # start logging log.start()  # start the reactor (blocks execution) reactor.run() 

Run it in a usual way:

python runner.py 

and observe items exported to items.jl with the help of the pipeline:

{"desc": "", "link": "/", "title": "Top"} {"link": "/Computers/", "title": "Computers"} {"link": "/Computers/Programming/", "title": "Programming"} {"link": "/Computers/Programming/Languages/", "title": "Languages"} {"link": "/Computers/Programming/Languages/Python/", "title": "Python"} ... 

Gist is available here (feel free to improve):


Notes:

If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):

pipelines = {     'mypackage.pipelines.FilterPipeline': 100,     'mypackage.pipelines.MySQLPipeline': 200 } settings.set('ITEM_PIPELINES', pipelines, priority='cmdline') 

or, use an existing settings.py with all the custom settings preconfigured:

from scrapy.utils.project import get_project_settings  settings = get_project_settings() 

Other useful links on the subject:

Read More

Monday, November 13, 2017

Python - Manipulate and read browser from current browser

Leave a Comment

I am struggling to find a method in python which allows you to read data in a currently used web browser. Effectively, I am trying to download a massive dataframe of data on a locally controlled company webpage and implement it into a dataframe. The issue is that the website has a fairly complex authentication token process which I have not been able to bypass using Selenium using a slew of webdrivers, Requests, urllib, and cookielib using a variety of user parameters. I have given up on this front entirely as I am almost positive that there is more to the authentication process than can be achieved easily with these libraries.

However, I did manage to bypass the required tokenization process when I quickly tested opening a new tab in a current browser which was already logged in using WebBrowser. Classically, WebBrowser does not offer a read function meaning that even though the page can be opened the data on the page cannot be read into a pandas dataframe. This got me thinking I could use Win32com, open a browser, login, then run the rest of the script, but again, there is no general read ability of the dispatch for internet explorer meaning I can't send the information I want to pandas. I'm stumped. Any ideas?

I could acquire the necessary authentication token scripts, but I am sure that it would take a week or two before anything would happen on that front. I would obviously prefer to get something in the mean time while I wait for the actual auth scripts from the company.

Update: I received authentication tokens from the company, however it requires using a python package on another server I do not have access too, mostly because its an oddity that I am using Python in my department. Thus the above still applies - need a method for reading and manipulating an open browser.

1 Answers

Answers 1

1) Start browser with Selenium. 2) Script should start waiting for certain element that inform you that you got required page. 4) You can use this new browser window to logint to page. 5) Script detects that you are logged 3) Script processes page.

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC  # start webdriver chrome = webdriver.Chrome()  # initialise waiter with 300 seconds to wait. waiter = WebDriverWait(chrome , 300)  # Will wait for appear of #logout element. # I assume it shows that you are logged in. wait.until(EC.presence_of_element_located(By.ID, "logout"))  # Extract data etc. 

It might be easier to log in if you use user's profile.

options = webdriver.ChromeOptions()  options.add_argument("user-data-dir=FULL_PATH__TO_PROFILE") chrome = webdriver.Chrome(chrome_options=options) 

Maybe you should even get your page. You might have session continued so you might be already logged in

chrome.get("https://your_page_here") 
Read More

Friday, October 27, 2017

How to locate the four elements using selenium in python

Leave a Comment

I am trying to post several parameters to this url and press 'submit' to download a csv file generated.

As the picture below shows, I think 5 steps are needed at least.

1 select 'proprietary' at 'availability'

2 pick 'EPIC exposures' instead of the default 'pointed observations'

3 click button 'submit' and wait one or two minutes

4 in the next page, when the search is done and all columns appear, press 'save table as csv'

For example, the first one:

 find_element_by_class_name('ufa-gwt-DropdownList-TextBox').send_keys("proprietary")  find_element_by_xpath(".//*[@title='Observation Availability'][@type='text']").send_keys("Proprietary") 

Its structure is like:

<input title="Observation Availability" style="width: 100%;" readonly="" class="ufa-gwt-DropdownList-TextBox" type="text"> 

picture

A new page appears after clicking 'submit'. log Then clicking 'save table as' is needed. Clicking 'back to search' will go back.

3 Answers

Answers 1

Unfortunately, I don't think you're going to be able to do this via requests. As far as I can tell, there is no POST being made when you click "Submit". It appears as though all the data is being generated by JavaScript, which requests can't deal with.

You could try using something like Selenium to automate a browser (which can handle the JS) and then scrape data from there.

Answers 2

Try this. You need to process the rest according to your need. Here is the gist part. It produces below results:

import requests   url = "http://nxsa.esac.esa.int/nxsa-sl/servlet/observations-metadata?RESOURCE_CLASS=OBSERVATION&ADQLQUERY=SELECT%20DISTINCT%20OBSERVATION.OBSERVATION_OID,OBSERVATION.MOVING_TARGET,OBSERVATION.OBSERVATION_ID,EPIC_OBSERVATION_IMAGE.ICON,EPIC_OBSERVATION_IMAGE.ICON_PREVIEW,RGS_FLUXED_OBSERVATION_IMAGE.ICON,RGS_FLUXED_OBSERVATION_IMAGE.ICON_PREVIEW,EPIC_MOVING_TARGET_OBSERVATION_IMAGE.ICON,EPIC_MOVING_TARGET_OBSERVATION_IMAGE.ICON_PREVIEW,RGS_FLUXED_MOVING_TARGET_OBSERVATION_IMAGE.ICON,RGS_FLUXED_MOVING_TARGET_OBSERVATION_IMAGE.ICON_PREVIEW,OM_OBSERVATION_IMAGE.ICON_PREVIEW_V,OM_OBSERVATION_IMAGE.ICON_PREVIEW_B,OM_OBSERVATION_IMAGE.ICON_PREVIEW_L,OM_OBSERVATION_IMAGE.ICON_PREVIEW_U,OM_OBSERVATION_IMAGE.ICON_PREVIEW_M,OM_OBSERVATION_IMAGE.ICON_PREVIEW_S,OM_OBSERVATION_IMAGE.ICON_PREVIEW_W,OM_OBSERVATION_IMAGE.ICON_V,OM_OBSERVATION_IMAGE.ICON_B,OM_OBSERVATION_IMAGE.ICON_L,OM_OBSERVATION_IMAGE.ICON_U,OM_OBSERVATION_IMAGE.ICON_M,OM_OBSERVATION_IMAGE.ICON_S,OM_OBSERVATION_IMAGE.ICON_W,OBSERVATION.REVOLUTION,OBSERVATION.PROPRIETARY_END_DATE,OBSERVATION.RA_NOM,OBSERVATION.DEC_NOM,OBSERVATION.POSITION_ANGLE,OBSERVATION.START_UTC,OBSERVATION.END_UTC,OBSERVATION.DURATION,OBSERVATION.TARGET,PROPOSAL.TYPE,PROPOSAL.CATEGORY,PROPOSAL.AO,PROPOSAL.PI_FIRST_NAME,PROPOSAL.PI_SURNAME,TARGET_TYPE.DESCRIPTION,OBSERVATION.LII,OBSERVATION.BII,OBSERVATION.ODF_VERSION,OBSERVATION.PPS_VERSION,OBSERVATION.COORD_OBS,OBSERVATION.COORD_TYPE%20FROM%20FIELD_NOT_USED%20%20WHERE%20OBSERVATION.PROPRIETARY_END_DATE%3E%272017-10-18%27%20%20AND%20%20(PROPOSAL.TYPE=%27Calibration%27%20OR%20PROPOSAL.TYPE=%27Int%20Calibration%27%20OR%20PROPOSAL.TYPE=%27Co-Chandra%27%20OR%20PROPOSAL.TYPE=%27Co-ESO%27%20OR%20PROPOSAL.TYPE=%27GO%27%20OR%20PROPOSAL.TYPE=%27HST%27%20OR%20PROPOSAL.TYPE=%27Large%27%20OR%20PROPOSAL.TYPE=%27Large-Joint%27%20OR%20PROPOSAL.TYPE=%27Triggered%27%20OR%20PROPOSAL.TYPE=%27Target-Opportunity%27%20OR%20PROPOSAL.TYPE=%27TOO%27%20OR%20PROPOSAL.TYPE=%27Triggered-Joint%27)%20%20%20ORDER%20BY%20OBSERVATION.OBSERVATION_ID&PAGE=1&PAGE_SIZE=100&RETURN_TYPE=JSON" res = requests.get(url) data = res.json() result = data['data']  for item in result:     ID = item['OBSERVATION__OBSERVATION_ID']        Surname = item['PROPOSAL__PI_SURNAME']     Name = item['PROPOSAL__PI_FIRST_NAME']     print(ID,Surname,Name) 

Partial results (ID and Name):

0740071301 La Palombara Nicola 0741732601 Kaspi Victoria 0741732701 Kaspi Victoria 0741732801 Kaspi Victoria 0742150101 Grosso Nicolas 0742240801 Roberts Timothy 

Btw, when you reach the target page you will notice two tabs there. This results are derived from (OBSERVATIONS) tab. The link i used above can be found in the chrome developer tools as well.

Answers 3

Since no one has posted a solution yet, here you go. You won't get far with requests, so selenium is your best choice here. If you want to use the below script without any modification, check that:

  • you are on linux or macos, or change dl_dir = '/tmp' to some directory you want
  • you have chromedriver installed, or change the driver to firefox in code (and adapt the download dir configuration according to what firefox wants)

Here is the environment tested with:

$ python -V Python 3.5.3 $ chromedriver --version ChromeDriver 2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2) $ pip list --format=freeze | grep selenium selenium==3.6.0 

I commented almost each and every line so let the code do the talk:

import os import time from selenium import webdriver from selenium.webdriver.common import by from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver.support import ui, expected_conditions as EC   def main():     dl_dir = '/tmp'  # temporary download dir so I don't spam the real dl dir with csv files     # check what files are downloaded before the scraping starts (will be explained later)     csvs_old = {file for file in os.listdir(dl_dir) if file.startswith('NXSA-Results-') and file.endswith('.csv')}      # I use chrome so check if you have chromedriver installed     # pass custom dl dir to browser instance     chrome_options = webdriver.ChromeOptions()     prefs = {'download.default_directory' : '/tmp'}     chrome_options.add_experimental_option('prefs', prefs)     driver = webdriver.Chrome(chrome_options=chrome_options)     # open page     driver.get('http://nxsa.esac.esa.int/nxsa-web/#search')      # wait for search ui to appear (abort after 10 secs)     # once there, unfold the filters panel     ui.WebDriverWait(driver, 10).until(         EC.element_to_be_clickable((by.By.XPATH, '//td[text()="Observation and Proposal filters"]'))).click()     # toggle observation availability dropdown     driver.find_element_by_xpath('//input[@title="Observation Availability"]/../../td[2]/div/img').click()     # wait until the dropdown elements are available, then click "proprietary"     ui.WebDriverWait(driver, 10).until(         EC.element_to_be_clickable((by.By.XPATH, '//div[text()="Proprietary" and @class="gwt-Label"]'))).click()     # unfold display options panel     driver.find_element_by_xpath('//td[text()="Display options"]').click()     # deselect "pointed observations"     driver.find_element_by_id('gwt-uid-241').click()     # select "epic exposures"     driver.find_element_by_id('gwt-uid-240').click()      # uncomment if you want to go through the activated settings and verify them     # when commented, the form is submitted immediately     #time.sleep(5)      # submit the form     driver.find_element_by_xpath('//button/span[text()="Submit"]/../img').click()     # wait until the results table has at least one row     ui.WebDriverWait(driver, 10).until(EC.presence_of_element_located((by.By.XPATH, '//tr[@class="MPI"]')))     # click on save     driver.find_element_by_xpath('//span[text()="Save table as"]').click()     # wait for dropdown with "CSV" entry to appear     el = ui.WebDriverWait(driver, 10).until(EC.element_to_be_clickable((by.By.XPATH, '//a[@title="Save as CSV, Comma Separated Values"]')))     # somehow, the clickability does not suffice - selenium still whines about the wrong element being clicked     # as a dirty workaround, wait a fixed amount of time to let js finish ui update     time.sleep(1)     # click on "CSV" entry     el.click()      # now. selenium can't tell whether the file is being downloaded     # we have to do it ourselves     # this is a quick-and-dirty check that waits until a new csv file appears in the dl dir     # replace with watchdogs or whatever     dl_max_wait_time = 10  # secs     seconds = 0     while seconds < dl_max_wait_time:         time.sleep(1)         csvs_new = {file for file in os.listdir(dl_dir) if file.startswith('NXSA-Results-') and file.endswith('.csv')}         if csvs_new - csvs_old:  # new file found in dl dir             print('Downloaded file should be one of {}'.format([os.path.join(dl_dir, file) for file in csvs_new - csvs_old]))             break         seconds += 1      # we're done, so close the browser     driver.close()   # script entry point if __name__ == '__main__':     main() 

If everything is fine, the script should output:

Downloaded file should be one of ['/tmp/NXSA-Results-1509061710475.csv'] 
Read More

Wednesday, January 4, 2017

How get the next page after login with PhatomJs?

Leave a Comment

I have found so many questions about this on here, but not sure why they are not answered.

I am trying to crawl a web page after logging in with this code : source

var steps=[]; var testindex = 0; var loadInProgress = false;//This is set to true when a page is still loading  /*********SETTINGS*********************/ var webPage = require('webpage'); var page = webPage.create(); page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'; page.settings.javascriptEnabled = true; page.settings.loadImages = false;//Script is much faster with this field set to false phantom.cookiesEnabled = true; phantom.javascriptEnabled = true; /*********SETTINGS END*****************/  console.log('All settings loaded, start with execution'); page.onConsoleMessage = function(msg) {     console.log(msg); }; /**********DEFINE STEPS THAT FANTOM SHOULD DO***********************/ steps = [      //Step 1 - Open Amazon home page     function(){         console.log('Step 1 - Abrindo página de login');         page.open("http://parceriascury.housecrm.com.br", function(status){          });     },     //Step 3 - Populate and submit the login form     function(){         console.log('Step 3 - Preenchendo o form');         page.evaluate(function(){             document.getElementById("login").value="xxxxx";             document.getElementById("senha").value="xxxxx";             document.getElementById("frmlandingpage").submit();         });     },     //Step 4 - Wait Amazon to login user. After user is successfully logged in, user is redirected to home page. Content of the home page is saved to AmazonLoggedIn.html. You can find this file where phantomjs.exe file is. You can open this file using Chrome to ensure that you are logged in.     function(){         console.log("Step 4 - Wait Amazon to login user. After user is successfully logged in, user is redirected to home page. Content of the home page is saved to AmazonLoggedIn.html. You can find this file where phantomjs.exe file is. You can open this file using Chrome to ensure that you are logged in.");          var fs = require('fs');          var result = page.evaluate(function() {             return document.documentElement.outerHTML;         });         fs.write('C:\\phantomjs\\logado_cury_10.html',result,'w');     }, ]; /**********END STEPS THAT FANTOM SHOULD DO***********************/  //Execute steps one by one interval = setInterval(executeRequestsStepByStep,5000);  function executeRequestsStepByStep(){     if (loadInProgress == false && typeof steps[testindex] == "function") {         //console.log("step " + (testindex + 1));         steps[testindex]();         testindex++;     }     if (typeof steps[testindex] != "function") {         console.log("test complete!");         phantom.exit();     } }  /**  * These listeners are very important in order to phantom work properly. Using these listeners, we control loadInProgress marker which controls, weather a page is fully loaded.  * Without this, we will get content of the page, even a page is not fully loaded.  */ page.onLoadStarted = function() {     loadInProgress = true;     console.log('Loading started'); }; page.onLoadFinished = function() {     loadInProgress = false;     console.log('Loading finished'); }; page.onConsoleMessage = function(msg) {     console.log(msg); }; 

But the response only this:

<html><head></head><body>ok</body></html> 

I need to get the content of next page with URL:

http://parceriascury.housecrm.com.br/parceiro_busca 

I can access this page directly, but not with all complements, because it needs to be logged in.

No errors and I don't know where I am making a mistake.

Edit Other solutions are welcome, i think maybe curl...But after js loading...

Sorry for my bad english.

1 Answers

Answers 1

This code could be better:

var loadInProgress = false;//This is set to true when a page is still loading  /*********SETTINGS*********************/ var page = require('webpage').create({viewportSize:{width: 1600,height: 900}, settings:{userAgent:'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', javascriptEnabled:'true', loadImages:'false' }}); var fs = require('fs'); /*********SETTINGS END*****************/ console.log('All settings loaded, start with execution');  /**  * These listeners are very important in order to phantom work properly. Using these listeners, we control loadInProgress marker which controls, weather a page is fully loaded.  * Without this, we will get content of the page, even a page is not fully loaded.  */ page.onLoadStarted = function() {     loadInProgress = true;     console.log('Loading started'); }; page.onLoadFinished = function() {     loadInProgress = false;     console.log('Loading finished'); }; page.onConsoleMessage = function(msg) {     console.log(msg); };  //Log in to your account, then view the cookie you got, now you can use these cookies to login    // the site will recognize you with your cookies.  //for freebitco.in auth phantom.cookies = [{// an array of objects   'name'     : 'btc_address',      'value'    : '1AuMxR6sPtB2Z6TkahSnpmm1H4KpYPBKqe',     'domain'   : 'freebitco.in',           'path'     : '/',   'httponly' : false,   'secure'   : true,   'expires'  : (new Date()).getTime() + (1000 * 60 * 60 * 43800) //5 years  },{ 'name'     : 'password',      'value'    : 'f574ca68a8650d1264d38da4b7687ca3bf631e6dfc59a98c89dd2564c7601f84',    'domain'   : 'freebitco.in',           'path'     : '/',   'httponly' : false,   'secure'   : true,   'expires'  : (new Date()).getTime() + (1000 * 60 * 60 * 43800) }]  //Execute steps one by one page.open("http://parceriascury.housecrm.com.br/parceiro_busca", function(status){ console.log('Step 1 has been completed - we are on the target page!'); setTimeout(step2,5000);// Maybe we don't need to wait here, we can execute step2 immediately. function step2(){ console.log("Step 2 - Content of the home page is saved to AmazonLoggedIn.html. You can find this file where phantomjs.exe file is. You can open this file using Chrome to ensure that you are logged in."); var result = page.evaluate(function(){ return document.documentElement.outerHTML; }); fs.write('C:\\phantomjs\\logado_cury_10.html',result,'w'); phantom.exit();  } }); 
Read More

Tuesday, March 15, 2016

How to hadle recaptcha on third-party site in my client application

Leave a Comment

I was curious about how people build third-party apps for sites with NO public APIs, but I could not really find any tutorials on this topic. So I decided to just give it a try. I created a simple desktop application, which uses HttpClient to send GET requests to the site I frequently use, and then parses the response and displays the data in my WPF window. This approach worked pretty well (probably because the site is fairly simple).

However, today I tried to run my application from a different place, and I kept getting 403 errors in response to my application's requests. It turned out, that the network I was using went through a VPN server, while the site I was trying to access used CloudFlare as protection layer, which apparently forces VPN users to enter reCaptcha in order to access the target site.

var baseAddress = new Uri("http://www.cloudflare.com"); using (var client = new HttpClient() { BaseAddress = baseAddress }) {    var message = new HttpRequestMessage(HttpMethod.Get, "/");    //this line returns CloudFlare home page if I use regualr network and reCaptcha page, when I use VPN    var result = await client.SendAsync(message);    //this line throws if I use VPN (403 Forbidden)    result.EnsureSuccessStatusCode(); } 

Now the question is: what is the proper way to deal with CloudFlare protection in client application? Do I have to display the reCaptcha in my application just like the web browser does? Do I have to set any particular headers in order to get a proper response instead of 403? Any tips are welcome, as this is a completely new area to me.

P.S. I write in C# because this is the laguage I'm most comfortable with, but I don't mind aswers using any other language as long as they answer the question.

2 Answers

Answers 1

I guess, one way to go about it is to handle captcha in web browser, outside the client application.

  1. Parse the response to see if it is a captcha page.
  2. If it is - open this page in browser.
  3. Let user solve the captcha there.
  4. Fetch the CloudFlare cookies form browser's cookie storage. You gonna need __cfduid (user ID) and cf_clearance (proof of solving the captcha).
  5. Attach those cookies to requests sent by client application.
  6. Use application as normal for the next 24 hours (until CloudFlare cookies expire).

Now the hard part here is (4). It's easy to manually copy-paste the cookies to make the code snippet in my question work with VPN:

var baseAddress = new Uri("http://www.cloudflare.com"); var cookieContainer = new CookieContainer(); using (var client = new HttpClient(new HttpClientHandler() { CookieContainer = cookieContainer } , true) { BaseAddress = baseAddress }) {     var message = new HttpRequestMessage(HttpMethod.Get, "/");     //I've also copy-pasted all the headers from browser     //some of those might be optional     message.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0");     message.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");     message.Headers.Add("Accept-Encoding", "gzip, deflate" });     message.Headers.Add("Accept-Language", "en-US;q=0.5,en;q=0.3");     //adding CloudFlare cookies     cookieContainer.Add(new Cookie("__cfduid", "copy-pasted-cookie-value", "/", "cloudflare.com"));     cookieContainer.Add(new Cookie("cf_clearance", "copy-pasted-cookie-value", "/", "cloudflare.com"));     var result = await client.SendAsync(message);     result.EnsureSuccessStatusCode(); } 

But I think its going to be a tricky task to automate the process of fetching the cookies, due to different browsers storing cookies in different places and/or formats. Not to metion the fact that you need to use external browser for this approach to work, which is really annoying. Still, its something to consider.

Answers 2

Answer to "build third-party apps for sites with NO public APIs" is that even though some Software Vendors don't have a public api's they have partner programs.

Good example is Netflix, they used to have a public api. Some of the Apps developed when the Public Api was enabled allowed to continue api usage.

In your scenario, your client app acts as a web crawler (downloading html content and trying to parse information). What you are trying to do is to Crawl the Cloudfare data which is not meant to be crawled by a third party app (bot). From the cloudfare side, they have done the correct thing to have a Captcha which prevents automated requests.

Further, if you try to send requests at a high frequency (requests/sec), and if the Cloudfare has Threat detection mechanisms, your ip address will be blocked. I assume that they already identified the VPN server IP address you are trying to use and blacklisted that, that's why you are getting a 403.

Basically you solely depend on security holes in Cloudfare pages you try to access via the client app. This is sort of hacking Cloudfare (doing something cloudfare has restricted) which I would not recommend.

If you have a cool idea, better to contact their developer team and discuss about that.

Read More