- 浏览: 2551271 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Python Crawler(1) - Scrappy Introduce
>python --version
Python 2.7.13
>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)
>pip install scrapy
https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
Command to check
>scrapy runspider quotes_spider.py -o quotes.json
https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial
First Spider under spiders, quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Save file %s' % filename)
Run the Project
>scrape crawl quotes
A shortcut to the start_requests
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x104c3db90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x104c3d110>
[s] spider <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
>response.css('title::text').extract()
[u'Quotes to Scrape’]
>response.css('title::text').extract_first()
u'Quotes to Scrape’
>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’
>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
>for quote in response.css("div.quote"):
... text = quote.css("span.text::text").extract_first()
... author = quote.css("small.author::text").extract_first()
... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}
Change the Python Script to Parse the data in Spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json
>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘
Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Or alternatively
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Author Spider
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = [ 'http://quotes.toscrape.com/' ]
def parse(self, response):
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
>scrapy crawl author -o authors.json
Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954
>python --version
Python 2.7.13
>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)
>pip install scrapy
https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
Command to check
>scrapy runspider quotes_spider.py -o quotes.json
https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial
First Spider under spiders, quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Save file %s' % filename)
Run the Project
>scrape crawl quotes
A shortcut to the start_requests
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x104c3db90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x104c3d110>
[s] spider <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
>response.css('title::text').extract()
[u'Quotes to Scrape’]
>response.css('title::text').extract_first()
u'Quotes to Scrape’
>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’
>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
>for quote in response.css("div.quote"):
... text = quote.css("span.text::text").extract_first()
... author = quote.css("small.author::text").extract_first()
... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}
Change the Python Script to Parse the data in Spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json
>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘
Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Or alternatively
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Author Spider
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = [ 'http://quotes.toscrape.com/' ]
def parse(self, response):
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
>scrapy crawl author -o authors.json
Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954
发表评论
-
NodeJS12 and Zlib
2020-04-01 07:44 475NodeJS12 and Zlib It works as ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 336Traefik 2020(1)Introduction and ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 435Private Registry 2020(1)No auth ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 384Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 477NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 421Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 337Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 246GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 450GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 326GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 313Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 317Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 292Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 311Serverless with NodeJS and Tenc ... -
NodeJS MySQL Library and npmjs
2020-02-07 06:21 288NodeJS MySQL Library and npmjs ... -
Python Library 2019(1)requests and aiohttp
2019-12-18 01:12 261Python Library 2019(1)requests ... -
NodeJS Installation 2019
2019-10-20 02:57 573NodeJS Installation 2019 Insta ... -
Monitor Tool 2019(2)Monit on Multiple Instances and Email Alerts
2019-10-18 10:57 264Monitor Tool 2019(2)Monit on Mu ... -
Sqlite Database 2019(1)Sqlite3 Installation and Docker phpsqliteadmin
2019-09-05 11:24 368Sqlite Database 2019(1)Sqlite3 ... -
Supervisor 2019(2)Ubuntu and Multiple Services
2019-08-19 10:53 370Supervisor 2019(2)Ubuntu and Mu ...
相关推荐
本教程"PythonCrawler-master"旨在教授如何利用Python进行网页数据的抓取和处理。教程涵盖了网络爬虫的基础知识,包括HTML解析、HTTP请求、数据存储等核心内容,同时也涉及了一些高级技巧,如模拟登录、反爬虫策略和...
Python-Crawler-master是一个关于Python爬虫的项目,主要利用Python的多线程技术来实现对电影天堂网站资源的高效抓取。在这个项目中,开发者旨在提供一个实用且高效的爬虫框架,帮助用户获取到电影天堂网站上的丰富...
python-crawler-master很好的学习资源
Python爬虫示例之distribute_crawler-master.Python爬虫示例之distribute_crawler-master.Python爬虫示例之distribute_crawler-master.Python爬虫示例之distribute_crawler-master.Python爬虫示例之distribute_...
**PythonCrawler-Scrapy-Mysql-File-Template 框架详解** 本文将深入探讨一个基于Python的开源爬虫框架——Scrapy,以及如何利用它来构建爬虫项目,将抓取的数据存储到MySQL数据库或文件中。Scrapy是一个强大的、...
这个"python-crawler-master.zip"压缩包显然包含了一个完整的Python爬虫项目,适合初学者学习和实践。让我们详细了解一下Python爬虫的基本概念、重要性以及如何进行开发。 Python爬虫是一种自动化程序,用于遍历...
python爬虫案例
学习 Python 爬虫需要掌握以下几个方面的知识:首先,需要了解 Python 基础知识,包括变量、数据类型、控制结构、函数、模块等。 Python 是一种易于学习的语言,对于初学者来说,学习 Python 基础知识并不困难。其次...
豆瓣电影1. 分析分析流程图分析结果代码实现流程分析具体代码豆瓣电影1. 分析分析流程图分析结果结果概要分析目标 | 分析结果请求URL分析 | https:/
介绍用来解析多层嵌套的json数据;JsonPath 是一种信息抽取类库,是从JSON文档中抽取指定信息的工具,提供多种语言实现版本,包括:Javascript
app自动化测试工具,能够自动点击ui界面实行测试分析,是移动测试的利器
python库。 资源全名:spidy_web_crawler-1.6.0-py3-none-any.whl
资源分类:Python库 所属语言:Python 资源全名:lightnovel_crawler-2.7.5-py3-none-any.whl 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
在这个"python-video-crawler.rar"压缩包中,包含的是一个Python实现的视频采集项目,它可以帮助我们从国内几个知名的视频站点抓取相关的视频信息。这个工具对于数据分析、内容监控或者研究网络视频趋势的开发者来说...
资源分类:Python库 所属语言:Python 资源全名:crawler-py-2.0.6.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源分类:Python库 所属语言:Python 资源全名:wg-gesucht-crawler-cli-0.1.8.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
该项目名为“weibo-comment-crawler-master”,主要涉及的是利用编程技术爬取微博的评论数据,进行分析,并对评论的情感进行评估。以下将详细介绍这个过程涉及的主要知识点。 首先,爬虫技术是整个项目的基础。在...
go语言单并发版爬虫--crawler-v1-v4 crawler-v1-v4 资源中包含一个完成的go语言并发爬虫案例,其中v1为goroutine+多worker,v2为request+worker双队列,v3将其主模块优化为同时支持v1和v2,v4继续新增多城市访问+...
在这个项目"python-extractplo-crawler-amazon-python"中,我们将深入探讨如何利用Python编程语言构建一个亚马逊商品信息的爬虫。这个爬虫可以帮助我们自动化地从亚马逊网站抓取商品数据,例如商品名称、价格、评级...
work_crawler-Setup-2.13.0漫画下载器,下载漫画的工具