Sunday, October 8, 2017

Selenium/python: extract text from a dynamically-loading webpage after every scroll

Leave a Comment

I'm using Selenium/python to automatically scroll down a social media website and scrape posts. I'm currently extracting all the text in one "hit" after scrolling a certain number of times (code below), but instead I want to extract just the newly-loaded text after each scroll.

For example, if the page initially contained the text "A, B, C", then after the first scroll it displayed "D, E, F", I'd want to store "A, B, C", then scroll, then store "D, E, F" etc.

The specific items I'm wanting to extract are the dates of the posts and the message text, which can be obtained with the css selectors '.message-date' and '.message-body', respectively (e.g., dates = driver.find_elements_by_css_selector('.message-date')).

Can anyone advise on how to extract just the newly-loaded text after each scroll?

Here's my current code (which extracts all the dates/messages after I finish scrolling):

from selenium import webdriver import sys import time from selenium.webdriver.common.keys import Keys  #load website to scrape driver = webdriver.PhantomJS() driver.get("https://stocktwits.com/symbol/USDJPY?q=%24USDjpy")  #Scroll the webpage ScrollNumber=3 #max scrolls print(str(ScrollNumber)+ " scrolldown will be done.") for i in range(1,ScrollNumber):  #scroll down X times     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")     time.sleep(3) #Delay between 2 scrolls down to be sure the page loaded     ## I WANT TO SAVE/STORE ANY NEWLY LOADED POSTS HERE RATHER      ## THAN EXTRACTING IT ALL IN ONE GO AT THE END OF THE LOOP  # Extract messages and dates. ## I WANT TO EXTRACT THIS DATA ON THE FLY IN THE ABOVE ## LOOP RATHER THAN EXTRACTING IT HERE dates = driver.find_elements_by_css_selector('.message-date') messages = driver.find_elements_by_css_selector('.message-body') 

6 Answers

Answers 1

You can store the number of messages in a variable and use xpath and position() to get the newly added posts

dates = [] messages = [] num_of_posts = 1 for i in range(1, ScrollNumber):     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")     time.sleep(3)     dates.extend(driver.find_elements_by_xpath('(//div[@class="message-date"])[position()>=' + str(num_of_posts) + ']'))     messages.extend(driver.find_elements_by_xpath('(//div[contains(@class, "message-body")])[position()>=' + str(num_of_posts) + ']'))     num_of_posts = len(dates) 

Answers 2

I had the same issue with facebook posts. For that I save the post ID (or whatever value that's unique for the post, even a Hash) in a List and then when you made the query again, you need to check if that ID is in your list or not.

Also, you can remove the DOM that is parsed, so only the new ones will exists.

Answers 3

As others have said, if you can do what you need to do via hitting the API directly, thats your best bet. If you absolutely must use Selenium, see my solution below.

I do something similar to the below for my needs.

  • I'm leveraging :nth-child() aspect of CSS paths to individually find elements as they load.
  • I'm also using selenium's explicit wait functionality (via the explicit package, pip install explicit) to efficiently wait for elements to load.

The script is quit fast (no calls to sleep()), however, the webpage itself has so much junk going on in the background that it often takes a while for selenium to return control to the script.

from __future__ import print_function  from itertools import count import sys import time  from explicit import waiter, CSS from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.wait import WebDriverWait as Wait   # The CSS selectors we will use POSTS_BASE_CSS = 'ol.stream-list > li'              # All li elements POST_BASE_CSS = POSTS_BASE_CSS + ":nth-child({0})"  # li child element at index {0} POST_DATE_CSS = POST_BASE_CSS + ' div.message-date'     # li child element at {0} with div.message-date POST_BODY_CSS = POST_BASE_CSS + ' div.message-body'     # li child element at {0} with div.message-date    class Post(object):     def __init__(self, driver, post_index):         self.driver = driver         self.date_css = POST_DATE_CSS.format(post_index)         self.text_css = POST_BODY_CSS.format(post_index)      @property     def date(self):         return waiter.find_element(self.driver, self.date_css, CSS).text      @property     def text(self):         return waiter.find_element(self.driver, self.text_css, CSS).text   def get_posts(driver, url, max_screen_scrolls):     """ Post object generator """     driver.get(url)     screen_scroll_count = 0      # Wait for the initial posts to load:     waiter.find_elements(driver, POSTS_BASE_CSS, CSS)      for index in count(1):         # Evaluate if we need to scroll the screen, or exit the generator         # If there is no element at this index, it means we need to scroll the screen         if len(driver.find_elements_by_css_selector('ol.stream-list > :nth-child({0})'.format(index))) == 0:             if screen_scroll_count >= max_screen_scrolls:                 # Break if we have already done the max scrolls                 break              # Get count of total posts on page             post_count = len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS))              # Scroll down             driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")             screen_scroll_count += 1              def posts_load(driver):                 """ Custom explicit wait function; waits for more posts to load in """                 return len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS)) > post_count              # Wait until new posts load in             Wait(driver, 20).until(posts_load)          # The list elements have sponsored ads and scripts mixed in with the posts we         # want to scrape. Check if they have a div.message-date element and continue on         # if not         includes_date_css = POST_DATE_CSS.format(index)         if len(driver.find_elements_by_css_selector(includes_date_css)) == 0:             continue          yield Post(driver, index)   def main():     url = "https://stocktwits.com/symbol/USDJPY?q=%24USDjpy"     max_screen_scrolls = 4     driver = webdriver.Chrome()     try:         for post_num, post in enumerate(get_posts(driver, url, max_screen_scrolls), 1):             print("*" * 40)             print("Post #{0}".format(post_num))             print("\nDate: {0}".format(post.date))             print("Text: {0}\n".format(post.text[:34]))      finally:         driver.quit()  # Use try/finally to make sure the driver is closed   if __name__ == "__main__":     main() 

Full disclosure: I'm the creator of the explicit package. You could easily rewrite the above using explicit waits directly, at the expense of readability.

Answers 4

This does exactly what you want. But, I wouldn't scrape the site this way...it'll just get slower and slower the longer it runs. RAM usage will spiral out of control too.

import time from hashlib import md5  import selenium.webdriver.support.ui as ui from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC  URL = 'https://stocktwits.com/symbol/USDJPY?q=%24USDjpy' CSS = By.CSS_SELECTOR  driver.get(URL)   def scrape_for(max_seconds=300):     found = set()     end_at = time.time() + max_seconds     wait = ui.WebDriverWait(driver, 5, 0.5)      while True:         # find elements         elms = driver.find_elements(CSS, 'li.messageli')          for li in elms:             # get the information we need about each post             text = li.find_element(CSS, 'div.message-content')             key = md5(text.text.encode('ascii', 'ignore')).hexdigest()              if key in found:                 continue              found.add(key)              try:                 date = li.find_element(CSS, 'div.message-date').text             except NoSuchElementException as e:                 date = None              yield text.text, date          if time.time() > end_at:             raise StopIteration          driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')         wait.until(EC.invisibility_of_element_located(                        (CSS, 'div#more-button-loading')))      raise StopIteration   for twit in scrape_for(60):     print(twit)  driver.quit() 

Answers 5

Just do sleep after scrolling , be sure to work with selenium as like as real machine its necessary to wait till page load new content . I recommend you to do it with wait function in selenium or just easily add sleep function in your code to load the contents .

time.sleep(5) 

Answers 6

This isn't really what you asked for (it isn't a selenium solution), it's actually a solution using the API that the page is using in the background. Using selenium instead of the API is an overkill in my opinion.

Here a script using the API:

import re  import requests  STREAM_ID_REGEX = re.compile(r"data-stream='symbol-(?P<id>\d+)'") CSRF_TOKEN_REGEX = re.compile(r'<meta name="csrf-token" content="(?P<csrf>[^"]+)" />')  URL = 'https://stocktwits.com/symbol/USDJPY?q=%24USDjpy' API_URL = 'https://stocktwits.com/streams/poll?stream=symbol&substream=top&stream_id={stream_id}&item_id={stream_id}'   def get_necessary_info():     res = requests.get(URL)      # Extract stream_id     match = STREAM_ID_REGEX.search(res.text)     stream_id = match.group('id')      # Extract CSRF token     match = CSRF_TOKEN_REGEX.search(res.text)     csrf_token = match.group('csrf')      return stream_id, csrf_token   def get_messages(stream_id, csrf_token, max_messages=100):     base_url = API_URL.format(stream_id=stream_id)      # Required headers     headers = {         'x-csrf-token': csrf_token,         'x-requested-with': 'XMLHttpRequest',     }      messages = []     more = True     max_value = None     while more:         # Pagination         if max_value:             url = '{}&max={}'.format(base_url, max_value)         else:             url = base_url          # Get JSON response         res = requests.get(url, headers=headers)         data = res.json()          # Add returned messages         messages.extend(data['messages'])          # Check if there are more messages         more = data['more']         if more:             max_value = data['max']          # Check if we have enough messages         if len(messages) >= max_messages:             break      return messages   def main():     stream_id, csrf_token = get_necessary_info()     messages = get_messages(stream_id, csrf_token)      for message in messages:         print(message['created_at'], message['body'])   if __name__ == '__main__':     main() 

And the first lines of the output:

Tue, 03 Oct 2017 03:54:29 -0000 $USDJPY (113.170) Daily Signal remains LONG from 109.600 on 12/09. SAR point now at 112.430. website for details Tue, 03 Oct 2017 03:33:02 -0000 JPY: Selling JPY Via Long $USDJPY Or Long $CADJPY Still Attractive  - SocGen https://www.efxnews.com/story/37129/jpy-selling-jpy-long-usdjpy-or-long-cadjpy-still-attractive-socgen#.WdMEqnCGMCc.twitter Tue, 03 Oct 2017 01:05:06 -0000 $USDJPY buy signal on 03 OCT 2017 01:00 AM UTC by AdMACD Trading System (Timeframe=H1) http://www.cosmos4u.net/index.php/forex/usdjpy/usdjpy-buy-signal-03-oct-2017-01-00-am-utc-by-admacd-trading-system-timeframe-h1 #USDJPY #Forex Tue, 03 Oct 2017 00:48:46 -0000 $EURUSD nice H&S to take price lower on $USDJPY just waiting 4 inner trendline to break. built up a lot of strength Tue, 03 Oct 2017 00:17:13 -0000 $USDJPY The Instrument reached the 100% from lows at 113.25 and sold in 3 waves, still can see 114.14 area.#elliottwave $USDX ... 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment