`

Crawl a website with scrapy

阅读更多

 

Introduction

In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.

For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.

We will consider that you have a working MongoDB server, and that you have installed the pymongo and scrapypython packages, both installable with pip.

If you have never toyed around with Scrapy, you should first read this short tutorial.

First step, identify the URL pattern(s)

In this example, we’ll see how to extract the following information from each isbullsh.it blogpost :

  • title
  • author
  • tag
  • release date
  • url

We’re lucky, all posts have the same URL pattern: http://isbullsh.it/YYYY/MM/title. These links can be found in the different pages of the site homepage.

What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.

Building the spider

We create a Scrapy project, following the instructions from their tutorial. We obtain the following project structure:

isbullshit_scraping/
├── isbullshit
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── isbullshit_spiders.py
└── scrapy.cfg

We begin by defining, in items.py, the item structure which will contain the extracted information:

from scrapy.item import Item, Field

class IsBullshitItem(Item):
    title = Field()
    author = Field()
    tag = Field()
    date = Field()
    link = Field()

Now, let’s implement our spider, in isbullshit_spiders.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = ['http://isbullsh.it'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
    	# r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs
    	Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost')]
    	# r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs
    		
    def parse_blogpost(self, response):
        ...

Our spider inherits from CrawlSpider, which “provides a convenient mechanism for following links by defining a set of rules”. More info here.

We then define two simple rules:

  • Follow links pointing to http://isbullsh.it/page/X
  • Extract information from pages defined by a URL of pattern http://isbullsh.it/YYYY/MM/title, using the callback method parse_blogpost.

Extracting the data

To extract the title, author, etc, from the HTML code, we’ll use the scrapy.selector.HtmlXPathSelector object, which uses the libxml2 HTML parser. If you’re not familiar with this object, you should read theXPathSelector documentation.

We’ll now define the extraction logic in the parse_blogpost method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):

def parse_blogpost(self, response):
    hxs = HtmlXPathSelector(response)
    item = IsBullshitItem()
    # Extract title
    item['title'] = hxs.select('//header/h1/text()').extract() # XPath selector for title
    # Extract author
    item['tag'] = hxs.select("//header/div[@class='post-data']/p/a/text()").extract() # Xpath selector for tag(s)
    ...
    return item

Note: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a Scrapy shell. That only works if the data position is coherent throughout all the pages you crawl.

Store the results in MongoDB

Each time that the parse_blogspot method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.

First, we need to add a couple of things to settings.py:

ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBPipeline',]

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "isbullshit"
MONGODB_COLLECTION = "blogposts"

Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).

Here is our pipelines.py file :

import pymongo

from scrapy.exceptions import DropItem
from scrapy.conf import settings
from scrapy import log
class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
        
    def process_item(self, item, spider):
    	valid = True
        for data in item:
          # here we only check if the data is not null
          # but we could do any crazy validation we want
       	  if not data:
            valid = False
            raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
        if valid:
          self.collection.insert(dict(item))
          log.msg("Item wrote to MongoDB database %s/%s" %
                  (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
                  level=log.DEBUG, spider=spider) 
        return item

Release the spider!

Now, all we have to do is change directory to the root of our project and execute

$ scrapy crawl isbullshit

The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.

Pretty neat, hm?

Conclusion

This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out Selenium. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.

Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and getting into trouble.

Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)

Please, don’t be a d*ick.

分享到:
评论

相关推荐

    xici_ip_CRAWL_scrapy_

    在"Xici_ip_CRAWL_scrapy_"这个项目中,我们可以推测这是一个针对西刺(Xici)网站的代理IP信息爬虫。西刺网站是一个提供免费和付费代理IP的服务平台,对于需要大量IP进行网络请求的业务,如数据抓取、负载均衡等,...

    scrapy-crawl-once:Scrapy中间件,仅允许抓取新内容

    pip install scrapy-crawl-once 用法 要启用它,请修改settings.py: SPIDER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 100, # ... } DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_crawl...

    Website Scraping with Python: Using BeautifulSoup and Scrapy

    Website Scraping with Python: Using BeautifulSoup and Scrapy By 作者: Gábor László Hajba ISBN-10 书号: 1484239245 ISBN-13 书号: 9781484239247 Edition 版本: 1st ed. 出版日期: 2018-09-15 pages 页数: ...

    scrapy 爬百度,bing大图

    # raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported") # spname = args[0] for spname in args: self.crawler_process.crawl(spname, **opts.spargs) self.crawler...

    bili_danmu_爬虫_CRAWL_scrapy_

    【标题】"bili_danmu_爬虫_CRAWL_scrapy_" 指的是一个使用Scrapy框架编写的爬虫程序,其主要任务是抓取B站(哔哩哔哩,又称Bilibili)上的弹幕信息。Scrapy是一个用Python编写的高效且强大的网页抓取和数据提取框架...

    baike_爬虫_CRAWL_scrapy_

    在"baike_爬虫_CRAWL_scrapy_"这个项目中,我们可以推测作者旨在利用Scrapy从百度百科上抓取相关数据。下面将详细介绍Scrapy框架以及如何构建一个类似的爬虫。 首先,Scrapy的核心组件包括Spiders(蜘蛛)、Item...

    crawl_workspace

    【crawl_workspace】是一个关于网络爬虫工作空间的项目,它包含了一系列用于实现高效爬取、数据处理和通信的模块。这个项目的重点在于构建一个全面的爬虫生态系统,以支持大规模的网页抓取任务。 首先,我们来看...

    详解向scrapy中的spider传递参数的几种方法(2种)

    有时需要根据项目的实际需求向...scrapy crawl myspider -a category=electronics 然后在spider里这样写: import scrapy class MySpider(scrapy.Spider): name = 'myspider' def __init__(self, category=None,

    用Pyinstaller打包Scrapy项目例子

    例如,你可能需要在`crawl.py`中导入Scrapy的相关模块,并调用`scrapy crawl your_spider_name`命令来启动爬虫。 然后,进入你的Scrapy项目的根目录,运行Pyinstaller命令来创建exe文件。这个命令会创建一个包含了...

    基于Scrapy的爬虫解决方案.docx

    运行爬虫时,只需在项目目录下执行 `scrapy crawl example`,Scrapy 将按照设定的配置和代码执行爬取。如果需要调试爬虫,可以通过设置日志级别、使用调试工具或者调整代码来定位和解决问题。 在实际使用中,Scrapy...

    Learning Scrapy azw3 kindle格式 0分

    This book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy ...

    一个简单的scrapy示例

    scrapy crawl example ``` Scrapy将按照我们设定的规则抓取数据,并将其保存到`output.json`文件中。 总的来说,本示例展示了如何使用Scrapy创建一个简单的爬虫,抓取网页数据并以JSON格式存储。Scrapy提供了许多...

    scrapy 文档--HTML版本

    提供了`scrapy crawl`、`scrapy genspider`、`scrapy view`等命令,方便启动爬虫、创建新爬虫、查看网页源码等。 9. **Scrapy与数据库集成**: 可以通过Item Pipeline将抓取到的数据保存到各种数据库,如MySQL、...

    scrapy 爬虫框架

    - 执行爬虫:`scrapy crawl spidername` 2. **爬虫(Spiders)**:爬虫是 Scrapy 中用于定义爬取规则的核心组件,主要包括: - **定义爬虫类**:每个爬虫都是一个继承自 `scrapy.Spider` 的类。 - **编写爬虫...

    scrapy 入门

    with open('output.txt', 'a') as f: f.write(item['name'] + '\t' + item['score'] + '\t' + item['detail'] + '\n') return item ``` ### 6. 运行Scrapy 最后,运行`scrapy crawl myspider`启动爬虫。Scrapy...

    scrapy_Python的爬虫框架Scrapy_scrapy_

    Scrapy提供了一系列命令行工具,如`startproject`用于创建新项目,`genspider`生成新的Spider模板,`crawl`运行Spider,以及`check`和`validate`检查项目的规范性。 10. **实战应用** 使用Scrapy可以实现各种复杂...

    Python爬虫框架scrapy获取汽车之家二手车数据

    scrapy crawl car_spider 以上步骤展示了如何用Scrapy创建一个真实的爬虫项目,但请记得在实际应用中要遵循合法合规的原则。同时,对于汽车之家这样的大型网站,其robots.txt文件可能禁止爬虫访问某些页面,因此在...

    scrapy

    scrapy crawl my_spider ``` Scrapy的高级特性还包括: - **Scrapy Shell**:交互式环境,用于测试和调试选择器。 - **Request/Response对象**:用于构造和处理HTTP请求和响应。 - **Link Extractors**:用于从...

    百度百科爬虫Scrapy

    5. 运行爬虫:`scrapy crawl spider_name` **三、爬取百度百科示例** 在本示例中,Scrapy被用来爬取百度百科上的计算机类词汇。这可能涉及到分析网页结构,找到词汇链接,然后通过这些链接获取每个词汇的详细页面...

    scrapy笔记

    - `scrapy crawl 名字`:运行指定的爬虫。 3. **XPath选择器**: XPath是一种在XML文档中查找信息的语言。在Scrapy中,`response.xpath()`用于提取网页内容。如果HTML元素的`class`属性有空格,如`class="f-list-...

Global site tag (gtag.js) - Google Analytics