Python Crawler(3)Services
Local Machine Service
Start the Service
>scrapyd
Call to start the services
>curl http://localhost:6800/schedule.json -d project=default -d spider=author
{"status": "ok", "jobid": "3b9c84c28dae11e79ba4a45e60e77f99", "node_name": "ip-10-10-21-215.ec2.internal"}
More API
http://scrapyd.readthedocs.io/en/stable/api.html#api
Call to Pass a Parameter
>curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1
List Projects
>curl http://localhost:6800/listprojects.json
{"status": "ok", "projects": ["default", "tutorial"], "node_name": "ip-10-10-21-215.ec2.internal”}
List Spiders
>curl http://localhost:6800/listspiders.json?project=default
{"status": "ok", "spiders": ["author", "quotes"], "node_name": "ip-10-10-21-215.ec2.internal"}
UI of Status
http://localhost:6800/
http://scrapyd.readthedocs.io/en/stable/overview.html
Clustered Solution ?
https://github.com/rmax/scrapy-redis
References:
http://scrapyd.readthedocs.io/en/stable/overview.html#how-scrapyd-works
分享到:
相关推荐
This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to ...
我们采用的解决方案 sqlite3:在python和nodejs中基于带有express / body-parser的nodejs的json api 可见,时间表如何使用如果你想得到有趣的照片 python crawler/flickr_api/geo_photos.py获取带有特定关键字的照片...
AWS CDK 是一个开源软件开发框架,用于构建和部署云应用程序,而AWS Glue是Amazon Web Services (AWS) 提供的一种完全托管的数据集成服务。 **AWS Cloud Development Kit (CDK)** AWS CDK 是一种定义云基础设施的...
在这个特定的"terraform-aws-glue-crawler"项目中,虽然标签提到了Python,但实际的Terraform配置文件可能并不直接包含Python代码,而是使用Terraform的AWS Glue资源类型来管理和配置Glue Crawler。 【使用...
Glue包含了ETL工具,如爬虫(Crawler)用于自动发现数据模式,开发环境(Development Endpoints)用于编写和测试Python或Java脚本,以及调度服务(Jobs)来执行这些脚本。Glue作业可以处理大数据,支持连接到不同的...
git clone https://github.com/vitorfs/woid.git安装要求: pip install -r requirements/dev.txt 应用迁移: python manage.py migrate 加载初始数据: python manage.py loaddata services.json 最后,运行开发...