Saturday, March 24, 2018

Trouble using lambda function within my scraper

Leave a Comment

I've written a script to parse the name and price of certain items from craigslist. The xpath I've defined within my scraper are working ones. The thing is when I try to scrape the items in usual way then applying try/except block I can avoid IndexError when the value of certain price is none. I even tried with customized function to make it work and found success as well.

However, In this below snippet I would like to apply lambda function to kick out IndexError error. I tried but could not succeed.

Btw, when I run the code It neither fetches anything nor throws any error either.

import requests from lxml.html import fromstring  page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text tree = fromstring(page)  # I wish to fix this function to make a go get_val = lambda item,path:item.text if item.xpath(path) else ""  for item in tree.xpath('//li[@class="result-row"]'):     link = get_val(item,'.//a[contains(@class,"hdrlnk")]')     price = get_val(item,'.//span[@class="result-price"]')     print(link,price) 

1 Answers

Answers 1

First of all, your lambda function get_val returns the text of the item if the path exists, and not the text of the searched node. This is probably not what you want. If want want to return the text content of the (first) element matching the path, you should write:

get_val = lambda item, path: item.xpath(path)[0].text if item.xpath(path) else "" 

Please note that xpath returns a list. I assume here that you have only one element in that list.

The output is something like that:

... Residential Plot @ Sarjapur Check Post ₨1000 Prestige dolce vita apartments in whitefield, Bangalore  Brigade Golden Triangle, ₨12500000 Nikoo Homes, ₨6900000 

But I think you want a link, not the text. If this is the case, read below.

Ok, how to get a link? When you have an anchor a, you get its href (the link) in the table of attibutes: a.attrib["href"].

So as I understand, in the case of the price, you want the text, but in the case of the anchor, you want the value of one specific attributes, href. Here's the real use of lambdas. Rewrite your function like that:

def get_val(item, path, l):     return l(item.xpath(path)[0]) if item.xpath(path) else "" 

The parameter l is a function that is applied to the node. l may return the text of the node, or the href of an anchor:

link = get_val(item,'.//a[contains(@class,"hdrlnk")]', lambda n: n.attrib["href"]) price = get_val(item,'.//span[@class="result-price"]', lambda n: n.text) 

Now the output is:

... https://bangalore.craigslist.co.in/reb/d/residential-plot-sarjapur/6522786441.html ₨1000 https://bangalore.craigslist.co.in/reb/d/prestige-dolce-vita/6522754197.html  https://bangalore.craigslist.co.in/reb/d/brigade-golden-triangle/6522687904.html ₨12500000 https://bangalore.craigslist.co.in/reb/d/nikoo-homes/6522687772.html ₨6900000 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment