Monday, March 19, 2018

Proxy Pooling System for Scrapy to temporarily stop using slow/timing out proxies

Leave a Comment

I've been looking around trying to find a decent pooling system for Scrapy but I can't find anything that has everything I need/want.

I'm looking for a solution to:

Rotate proxies

  • I'd like them randomly switch between proxies but never selecting the same proxy twice in a row. (Scrapoxy has this)

Impersonate Known Browsers

  • Impersonate Chrome, Firefox, Internet Explorer, Edge, Safari... etc (Scrapoxy has this)

Blacklist Slow Proxies

  • If the proxy times out or is slow it should be blacklisted through a series of rules... (Scrapoxy only has blacklisting for number of instances / startups)

  • If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.

  • If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
  • If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
  • If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
  • If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
  • If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
  • If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
  • If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)

Anyone know of any such solution (the main feature being the blacklisting of slow/timed out proxies...

1 Answers

Answers 1

As your polling rules are very specifics, you may code your own, please see the code bellow which implement some part of your rules (you have to implement some other):

#!/usr/bin/env python # -*- coding: UTF-8 -*-  import pexpect,time from random import shuffle  #this func is use to test a single proxy def test_proxy(ip,port,max_timeout=1):     child = pexpect.spawn("telnet " + ip + " " +str(port))     time_send_request=time.time()     try:         i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds     except pexpect.TIMEOUT:         i=-1     if i==0:         time_request_ok=time.time()         return {"status":True,"tim#!/usr/bin/env python # -*- coding: UTF-8 -*-e_to_answer":time_request_ok-time_send_request}     else:         return {"status":False,"time_to_answer":max_timeout}   #this func is use to test all the current proxy and update status and apply your custom rules def update_proxy_list_status(proxy_list):     for i in range(0,len(proxy_list)):         print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))         proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])         proxy_list[i]["status_ok"]= proxy_status["status"]           print proxy_status          #here it is time to treat your own rule to update respective proxy dict          #~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.         #~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.         #~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.         #~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.         #~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.         #~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour         #~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad         #~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)                  if proxy_status["status"]==True:             #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)             #...             pass         else:             #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)             #...             pass              return proxy_list   #this func select a good proxy and do the job def main():      #first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/     proxy_list=[         {"ip":"167.99.2.12","port":8080}, #bad proxy         {"ip":"167.99.2.17","port":8080},         {"ip":"66.70.160.171","port":1080},         {"ip":"192.99.220.151","port":8080},         {"ip":"142.44.137.222","port":80}         # [...]     ]        #this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)     previous_proxy_ip=""      the_job=True     while the_job:          #here we update each proxy status         proxy_list = update_proxy_list_status(proxy_list)          #we keep only proxy considered as ok         good_proxy_list = [d for d in proxy_list if d['status_ok']==True]          #here you can shuffle the list         shuffle(good_proxy_list)          #select a proxy (not same last previous one)         current_proxy={}         for i in range(0,len(good_proxy_list)):             if good_proxy_list[i]["ip"]!=previous_proxy_ip:                 previous_proxy_ip=good_proxy_list[i]["ip"]                 current_proxy=good_proxy_list[i]                 break          #use this selected proxy to do the job         print ("the current proxy is: "+str(current_proxy))          #UPDATE SCRAPY PROXY          #DO THE SCRAPY JOB         print "DO MY SCRAPY JOB with the current proxy settings"          #wait some seconds         time.sleep(5)  main() 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment