Python Crawler(1) - Scrappy Introduce - 快马扬鞭须努力！

sillycat

浏览: 2566792 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Python Crawler(1) - Scrappy Introduce

博客分类：

Scripts

Python Crawler(1) - Scrappy Introduce

>python --version
Python 2.7.13

>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)

>pip install scrapy

https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name="quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

            next_page = response.css('li.next a::attr("href")').extract_first()
            if next_page is not None:
                yield response.follow(next_page, self.parse)

Command to check
>scrapy runspider quotes_spider.py -o quotes.json

https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial

First Spider under spiders, quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Save file %s' % filename)

Run the Project
>scrape crawl quotes

A shortcut to the start_requests
start_urls = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
]

This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x104c3db90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x104c3d110>
[s]   spider     <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>response.css('title::text').extract()
[u'Quotes to Scrape’]

>response.css('title::text').extract_first()
u'Quotes to Scrape’

>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’

>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

>for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}

Change the Python Script to Parse the data in Spider
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('small.author::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract(),
        }

Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json

>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘

Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

Or alternatively

if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

Author Spider
import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'
    start_urls = [ 'http://quotes.toscrape.com/' ]

    def parse(self, response):
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

>scrapy crawl author -o authors.json

Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954

分享到：

Python Crawler(2)Items and Pipelines | Charts and Console(1)UI Console and REST ...

2017-08-30 03:11
浏览 820
评论(0)
分类:Web前端
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Python Crawler(1) - Scrappy Introduce

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Python Crawler(1) - Scrappy Introduce

评论

发表评论

相关推荐

NodeJS12 and Zlib

Traefik 2020(1)Introduction and Installation

Private Registry 2020(1)No auth in registry Nginx AUTH for UI

Buffer in NodeJS 12 and NodeJS 8

NodeJS ENV Similar to JENV and PyENV

Prometheus HA 2020(3)AlertManager Cluster

Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings

GraphQL 2019(3)Connect to MySQL

GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud

GraphQL 2019(1)Apollo Basic

Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit

Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree

Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF

Serverless with NodeJS and TencentCloud 2020(1)Running with Component

NodeJS MySQL Library and npmjs

Python Library 2019(1)requests and aiohttp

NodeJS Installation 2019

Monitor Tool 2019(2)Monit on Multiple Instances and Email Alerts

Sqlite Database 2019(1)Sqlite3 Installation and Docker phpsqliteadmin

Supervisor 2019(2)Ubuntu and Multiple Services

最近访客更多访客>>