Monday, February 5, 2018

Run Scrapy from Flask

Leave a Comment

I have this folder structure:

app.py # flask app app/    datafoo/           scrapy.cfg           crawler.py           blogs/                 pipelines.py                  settings.py                 middlewares.py                 items.py                 spiders/                                             allmusic_feed.py                         allmusic_data/                                       delicate_tracks.jl 

scrapy.cfg:

[settings] default = blogs.settings 

allmusic_feed.py:

   class AllMusicDelicateTracks(scrapy.Spider): # one amongst many spiders         name = "allmusic_delicate_tracks"         allowed_domains = ["allmusic.com"]         start_urls = ["http://web.archive.org/web/20160813101056/http://www.allmusic.com/mood/delicate-xa0000000972/songs",                      ]         def parse(self, response):              for sel in response.xpath('//tr'):                 item = AllMusicItem()                 item['artist'] = sel.xpath('.//td[@class="performer"]/a/text()').extract_first()                  item['track'] = sel.xpath('.//td[@class="title"]/a/text()').extract_first()                 yield item 

crawler.py:

from twisted.internet import reactor from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings    def blog_crawler(self, mood):          item, jl = mood  # ITEM = SPIDER         process = CrawlerProcess(get_project_settings())         process.crawl(item, domain='allmusic.com')         process.start()          allmusic = []         allmusic_tracks = []         allmusic_artists = []         try:             # jl is file where crawled data is stored             with open(jl, 'r+') as t:                 for line in t:                     allmusic.append(json.loads(line))         except Exception as e:             print (e, 'try another mood')          for item in allmusic:             allmusic_artists.append(item['artist'])             allmusic_tracks.append(item['track'])         return zip(allmusic_tracks, allmusic_artists) 

app.py :

@app.route('/tracks', methods=['GET','POST']) def tracks(name):     from app.datafoo import crawler      c = crawler()     mood = ['allmusic_delicate_tracks', 'blogs/spiders/allmusic_data/delicate_tracks.jl']     results = c.blog_crawler(mood)     return results 

if simply run the app with python app.py, I get the following error:

ValueError: signal only works in main thread 

when I run the app with gunicorn -c gconfig.py app:app --log-level=debug --threads 2, it just hangs there:

127.0.0.1 - - [29/Jan/2018:03:40:36 -0200] "GET /tracks HTTP/1.1" 500 291 "http://127.0.0.1:8080/menu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 

lastly, running with gunicorn -c gconfig.py app:app --log-level=debug --threads 2 --error-logfile server.log, I get:

server.log

[2018-01-30 13:41:39 -0200] [4580] [DEBUG] Current configuration:   proxy_protocol: False   worker_connections: 1000   statsd_host: None   max_requests_jitter: 0   post_fork: <function post_fork at 0x1027da848>   errorlog: server.log   enable_stdio_inheritance: False   worker_class: sync   ssl_version: 2   suppress_ragged_eofs: True   syslog: False   syslog_facility: user   when_ready: <function when_ready at 0x1027da9b0>   pre_fork: <function pre_fork at 0x1027da938>   cert_reqs: 0   preload_app: False   keepalive: 5   accesslog: -   group: 20   graceful_timeout: 30   do_handshake_on_connect: False   spew: False   workers: 16   proc_name: None   sendfile: None   pidfile: None   umask: 0   on_reload: <function on_reload at 0x10285c2a8>   pre_exec: <function pre_exec at 0x1027da8c0>   worker_tmp_dir: None   limit_request_fields: 100   pythonpath: None   on_exit: <function on_exit at 0x102861500>   config: gconfig.py   logconfig: None   check_config: False   statsd_prefix:    secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}   reload_engine: auto   proxy_allow_ips: ['127.0.0.1']   pre_request: <function pre_request at 0x10285cde8>   post_request: <function post_request at 0x10285ced8>   forwarded_allow_ips: ['127.0.0.1']   worker_int: <function worker_int at 0x1027daa28>   raw_paste_global_conf: []   threads: 2   max_requests: 0   chdir: /Users/me/Documents/Code/Apps/app   daemon: False   user: 501   limit_request_line: 4094   access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"   certfile: None   on_starting: <function on_starting at 0x10285c140>   post_worker_init: <function post_worker_init at 0x10285c848>   child_exit: <function child_exit at 0x1028610c8>   worker_exit: <function worker_exit at 0x102861230>   paste: None   default_proc_name: app:app   syslog_addr: unix:///var/run/syslog   syslog_prefix: None   ciphers: TLSv1   worker_abort: <function worker_abort at 0x1027daaa0>   loglevel: debug   bind: ['127.0.0.1:8080']   raw_env: []   initgroups: False   capture_output: False   reload: False   limit_request_field_size: 8190   nworkers_changed: <function nworkers_changed at 0x102861398>   timeout: 120   keyfile: None   ca_certs: None   tmp_upload_dir: None   backlog: 2048   logger_class: gunicorn.glogging.Logger [2018-01-30 13:41:39 -0200] [4580] [INFO] Starting gunicorn 19.7.1 [2018-01-30 13:41:39 -0200] [4580] [DEBUG] Arbiter booted [2018-01-30 13:41:39 -0200] [4580] [INFO] Listening at: http://127.0.0.1:8080 (4580) [2018-01-30 13:41:39 -0200] [4580] [INFO] Using worker: threads [2018-01-30 13:41:39 -0200] [4580] [INFO] Server is ready. Spawning workers [2018-01-30 13:41:39 -0200] [4583] [INFO] Booting worker with pid: 4583 [2018-01-30 13:41:39 -0200] [4583] [INFO] Worker spawned (pid: 4583) [2018-01-30 13:41:39 -0200] [4584] [INFO] Booting worker with pid: 4584 [2018-01-30 13:41:39 -0200] [4584] [INFO] Worker spawned (pid: 4584) [2018-01-30 13:41:39 -0200] [4585] [INFO] Booting worker with pid: 4585 [2018-01-30 13:41:39 -0200] [4585] [INFO] Worker spawned (pid: 4585) [2018-01-30 13:41:40 -0200] [4586] [INFO] Booting worker with pid: 4586 [2018-01-30 13:41:40 -0200] [4586] [INFO] Worker spawned (pid: 4586) [2018-01-30 13:41:40 -0200] [4587] [INFO] Booting worker with pid: 4587 [2018-01-30 13:41:40 -0200] [4587] [INFO] Worker spawned (pid: 4587) [2018-01-30 13:41:40 -0200] [4588] [INFO] Booting worker with pid: 4588 [2018-01-30 13:41:40 -0200] [4588] [INFO] Worker spawned (pid: 4588) [2018-01-30 13:41:40 -0200] [4589] [INFO] Booting worker with pid: 4589 [2018-01-30 13:41:40 -0200] [4589] [INFO] Worker spawned (pid: 4589) [2018-01-30 13:41:40 -0200] [4590] [INFO] Booting worker with pid: 4590 [2018-01-30 13:41:40 -0200] [4590] [INFO] Worker spawned (pid: 4590) [2018-01-30 13:41:40 -0200] [4591] [INFO] Booting worker with pid: 4591 [2018-01-30 13:41:40 -0200] [4591] [INFO] Worker spawned (pid: 4591) [2018-01-30 13:41:40 -0200] [4592] [INFO] Booting worker with pid: 4592 [2018-01-30 13:41:40 -0200] [4592] [INFO] Worker spawned (pid: 4592) [2018-01-30 13:41:40 -0200] [4595] [INFO] Booting worker with pid: 4595 [2018-01-30 13:41:40 -0200] [4595] [INFO] Worker spawned (pid: 4595) [2018-01-30 13:41:40 -0200] [4596] [INFO] Booting worker with pid: 4596 [2018-01-30 13:41:40 -0200] [4596] [INFO] Worker spawned (pid: 4596) [2018-01-30 13:41:40 -0200] [4597] [INFO] Booting worker with pid: 4597 [2018-01-30 13:41:40 -0200] [4597] [INFO] Worker spawned (pid: 4597) [2018-01-30 13:41:40 -0200] [4598] [INFO] Booting worker with pid: 4598 [2018-01-30 13:41:40 -0200] [4598] [INFO] Worker spawned (pid: 4598) [2018-01-30 13:41:40 -0200] [4599] [INFO] Booting worker with pid: 4599 [2018-01-30 13:41:40 -0200] [4599] [INFO] Worker spawned (pid: 4599) [2018-01-30 13:41:40 -0200] [4600] [INFO] Booting worker with pid: 4600 [2018-01-30 13:41:40 -0200] [4600] [INFO] Worker spawned (pid: 4600) [2018-01-30 13:41:40 -0200] [4580] [DEBUG] 16 workers [2018-01-30 13:41:47 -0200] [4583] [DEBUG] GET /menu [2018-01-30 13:41:54 -0200] [4584] [DEBUG] GET /tracks 

NOTE:

in this SO answer I've learned that in order to integrate Flask and Scrapy you can either use:

1. Python subprocess

2. Twisted-Klein + Scrapy

3. ScrapyRT

but I haven't had any luck adapting my specific code to these solutions.

I reckon a subprocess would be simpler and suffice, because user experience rarely requires a scraping thread, but am not sure.

could anyone please point me in the right direction here?

1 Answers

Answers 1

Here's a minimal example how you can do it with ScrapyRT.

This is the project structure:

project/ ├── scraping │   ├── example │   │   ├── __init__.py │   │   ├── items.py │   │   ├── middlewares.py │   │   ├── pipelines.py │   │   ├── settings.py │   │   └── spiders │   │       ├── __init__.py │   │       └── quotes.py │   └── scrapy.cfg └── webapp     └── example.py 

scraping directory contains the Scrapy project. This project contains one spider quotes.py to scrape some quotes from quotes.toscrape.com:

# -*- coding: utf-8 -*- from __future__ import unicode_literals  import scrapy   class QuotesSpider(scrapy.Spider):     name = 'quotes'     start_urls = ['http://quotes.toscrape.com/']      def parse(self, response):         for quote in response.xpath('//div[@class="quote"]'):             yield {                 'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),                 'text': quote.xpath('normalize-space(./span[@class="text"])').extract_first()             } 

In order to start ScrapyRT and listen to requests for scraping, go to the Scrapy project's directory scraping and issue scrapyrt command:

$ cd ./project/scraping $ scrapyrt 

ScrapyRT will now listen on localhost:9080.

webapp directory contains simple Flask app that scrapes quotes on demand (using the spider above) and simply displays them to user:

from __future__ import unicode_literals  import json import requests  from flask import Flask  app = Flask(__name__)  @app.route('/') def show_quotes():     params = {         'spider_name': 'quotes',         'start_requests': True     }     response = requests.get('http://localhost:9080/crawl.json', params)     data = json.loads(response.text)     result = '\n'.join('<p><b>{}</b> - {}</p>'.format(item['author'], item['text'])                        for item in data['items'])     return result 

To start the app:

$ cd ./project/webapp $ FLASK_APP=example.py flask run 

Now when you point the browser on localhost:5000, you'll the list of quotes freshly scraped from quotes.toscrape.com.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment