Python Crawler(2)Items and Pipelines
We can do Items.py as follow:
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
class AuthorItem(scrapy.Item):
name = scrapy.Field()
desc = scrapy.Field()
birth = scrapy.Field()
We can do Pipelines.py as follow:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.names = set()
def process_item(self, item, spider):
name = item['name'] + ' - Unique' if name in self.names:
raise DropItem("Duplicate item found: %s" % item['name'])
else:
self.names.add(name)
item['name'] = name
return item
We can also as multiple pipelines
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
Or in the settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.DuplicatesPipeline': 300,
}
The pipeline actions will start from the low numbers.
Big sample Project
https://github.com/gnemoug/distribute_crawler
Deployment
https://scrapyd.readthedocs.io/en/latest/install.html
https://github.com/istresearch/scrapy-cluster
Install the scrapyd
>pip install scrapyd
Directly talk with Server side
https://scrapyd.readthedocs.io/en/latest/api.html
Clients
https://github.com/scrapy/scrapyd-client
Deploy
https://github.com/scrapy/scrapyd-client#scrapyd-deploy
Install the client
>pip install scrapyd-client
Start the Server
>scrapyd
Visit the console
http://localhost:6800/
Deploy my simple Things
>scrapyd-deploy
Packing version 1504042554
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1504042554", "spiders": 2, "node_name": "ip-10-10-21-215.ec2.internal"}
List Target
>scrapyd-deploy -l
default http://localhost:6800/
Possible Cluster Solution in the future
https://github.com/istresearch/scrapy-cluster
Try with Python 3.6 later
References:
https://github.com/gnemoug/distribute_crawler
https://www.douban.com/group/topic/38361104/
http://wiki.jikexueyuan.com/project/scrapy/item-pipeline.html
https://segmentfault.com/a/1190000009229896
分享到:
相关推荐
在`PythonCrawler-Scrapy-Mysql-File-Template-master`中,你可以看到一个简单的Scrapy项目结构,包括`items.py`(定义数据模型)、`pipelines.py`(数据处理)和`spiders`目录(存放爬虫)。通过阅读和修改这些文件...
cra草Web Crawler在python 3.x上使用Scrapy包如何安装软件包? python3 -m pip install Scrapy ... 在Items.py设置参数在spiders / crawler.py创建您的spiders / crawler.py器代码在settings.py和pipelines.py设置数
and different status codesBuild robust scraping pipelines with SQS and RabbitMQScrape assets such as images media and know what to do when Scraper fails to runExplore ETL techniques of building a ...
2. **Python爬虫库**: - **BeautifulSoup**:用于解析HTML和XML文档,提取数据。 - **requests**:用于发送HTTP请求,获取网页内容。 - **Scrapy**:一个高级的爬虫框架,提供了完整的爬取、数据处理和中间件...
ITEM_PIPELINES = {'my_crawler.pipelines.MySQLStorePipeline': 300} ``` 现在,你可以运行爬虫: ```bash cd my_crawler scrapy crawl books ``` 这将开始爬取指定的URL,提取书籍信息,并将这些数据保存到...
接着,在`lagou_crawler/pipelines.py`中创建一个简单的Pipeline,例如将数据保存到CSV文件: ```python import csv from scrapy.exporters import CsvItemExporter class LagouPipeline(object): def __init__...
2. `items.py`:定义爬取数据的结构,如文章标题、作者、发布日期等字段。 3. `pipelines.py`:定义数据处理流程,如清洗、验证和存储数据。 4. `settings.py`:配置文件,设置爬虫的行为,如下载延迟、中间件、请求...
在`qsbk_crawler`目录下的`items.py`文件中,定义一个Item类: ```python import scrapy class QsbkItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field() ``` 然后,在`settings.py`中启用默认的...
通常,一个Scrapy项目会包含spiders、items、pipelines等子目录,Django项目会有manage.py、app、templates等文件夹,还有可能包括配置文件、部署脚本等其他组件。 **项目说明和部署**: 根据描述,项目包含详细的...
Write Scrapy spiders with simple Python and do web crawls Push your data into any database, search engine or analytics system Configure your spider to download files, images and use proxies Create...
Write Scrapy spiders with simple Python and do web crawls Push your data into any database, search engine or analytics system Configure your spider to download files, images and use proxies Create...
2. `items.py`:定义要抓取的数据结构。 3. `pipelines.py`:定义Item Pipeline。 4. `settings.py`:项目配置。 5. `middlewares.py`:自定义下载器中间件。 **爬取itcast.cn教师数据** 在Scrapy项目中,我们需要...
在`Python-Crawler-master`文件夹中,可能包含以下内容: - `spiders`目录:存放爬虫代码,每个爬虫可能对应一个单独的Python文件。 - `middlewares`目录:存放中间件代码,用于处理请求和响应,如添加User-Agent、...
##### 7-4 修改ItemsPipelines脚本 - **Items定义**:存储爬取数据的数据结构。 - **Pipelines实现**:数据清洗、存储等操作。 ##### 7-5 编写spiders脚本 - **Spider类**:定义爬取逻辑的基类。 - **爬虫编写**:...
2. **Item**:Item是Scrapy中定义数据结构的方式,可以看作是数据模型。在这个项目中,可能定义了一个Item类来表示图片信息,如URL、图片大小、类别等。 3. **Item Pipeline**:Pipeline是处理Item的流程,可以用来...
2. **创建Spider**: 在项目中定义一个Spider,如`stock_spider.py`,并设定起始URL,通常是股票数据的列表页面。 3. **解析HTML**: 使用Scrapy的内置选择器,如XPath或CSS选择器,来解析HTML,提取股票信息。例如,...
2. **pipelines**: 数据处理管道,可以对爬取到的数据进行清洗、存储等操作。 3. **settings.py**: 爬虫项目的全局配置,如默认的下载中间件、请求延迟等。 4. **items.py**: 定义数据模型,描述要抓取的数据结构。 ...
ITEM_PIPELINES = {'job_crawler.pipelines.JobPipeline': 300} ``` 现在,启动爬虫: ```bash scrapy crawl job ``` Scrapy将按照我们定义的规则抓取数据,并通过pipeline保存到CSV文件中。需要注意的是,爬取...
总的来说,"crawler"是一个Python实现的网络爬虫项目,使用Scrapy框架构建,具有“履带式”的网页遍历功能,其源代码和相关资源保存在"crawler-master"文件中,为学习和实践网络爬虫提供了很好的实例。