from
http://menno.b10m.net/blog/blosxom/perl
该文章是用来解析取得到的html的资料,有用到xpath的概念
Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko
Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast.
Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post
this blog entry in which I'll show how to effectively scrape Yahoo! Search.
First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to
fetch the following things:
* title (the linked text)
* url (the actual link)
* description (the text beneath the link)
So let's start our first little script:
[code]
use Data::Dumper;#该模块用来输出相关的结构
use URI;
use Web::Scraper;
my $yahoo = scraper {
process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
process "div.yschabstr", 'description' => "TEXT";
result 'description', 'title', 'url';
};
print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl"));
[/code]
Now what happens here? The important stuff can be found in the process statements. Basically, you may
translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title',
and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in
description.
The result looks something like this:
$VAR1 = {
'url' => 'http://www.perl.com/',
'title' => 'Perl.com',
'description' => 'Central resource for Perl developers. It contains
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited
by Clay Irving.'
};
Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a
loop!
The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this:
process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href';
process "div.yschabstr", 'description[]' => "TEXT";
And when we run it now, the result looks like this:
$VAR1 = {
'url' => [
'http://www.perl.com/',
'http://www.perl.org/',
'http://www.perl.com/download.csp',
...
],
'title' => [
'Perl.com',
'Perl Mongers',
'Getting Perl',
...
],
'description' => [
'Central resource for Perl developers. It contains
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by
Clay Irving.',
'Nonprofit organization, established to support the
Perl community.',
'Instructions on downloading a Perl interpreter for
your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...',
...
]
};
That looks a lot better! We now get all the search results and could loop through the different arrays to get
the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we
want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we
grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each
list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go
for the XPath selectors (heck, we can do both, so why not?).
To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can
grab the path within seconds.
use Data::Dumper;
use URI;
use Web::Scraper;
my $yahoo = scraper {
process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
process "div.yschabstr", 'description' => "TEXT";
result 'description', 'title', 'url';
};
result 'results';
};
print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );
You see that we switched our title, url and description fields back to the old notation (without []), for we
don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we
open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]').
The result is exactly what we wanted:
$VAR1 = [
{
'url' => 'http://www.perl.com/',
'title' => 'Perl.com',
'description' => 'Central resource for Perl developers. It
contains the Perl Language, edited by Tom Christiansen, and the Perl Reference,
edited by Clay Irving.'
},
{
'url' => 'http://www.perl.org/',
'title' => 'Perl Mongers',
'description' => 'Nonprofit organization, established to support
the Perl community.'
},
{
'url' => 'http://www.perl.com/download.csp',
'title' => 'Getting Perl',
'description' => 'Instructions on downloading a Perl interpreter
for your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...'
},
...
];
Again Tatsuhiko impresses me with a Perl module. Well done! Very well done!
Update: Tatsuhiko had some wise words on this article:
A couple of things:
You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is
stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less
code :)
The use of nested scraper in your example seems pretty good, but using hash reference could be also useful,
like:
my $yahoo = scraper {
process "a.yschttl", 'results[]', {
title => 'TEXT', url => '@href',
};
};
This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't
need the description. TIMTOWTDI :)
分享到:
相关推荐
Untangle your web scraping complexities and access web data with ease using Python scripts Key FeaturesHands-on recipes to advance your web scraping skills to expert levelAddress complex and ...
《Python实战:网络爬虫详解》 Python是一种广泛应用于数据科学、机器学习和网络爬虫领域的强大编程语言。网络爬虫是获取大量网络数据的重要工具,尤其在机器学习项目中,高质量的数据集往往是成功的关键。...
4. **抓取网站**:设置好Sitemap和选择器后,点击"Start Scraping"按钮,Web Scraper将按照设定开始抓取数据。抓取结果会显示在插件的输出面板,你可以导出为CSV、JSON等格式,方便进一步处理。 ### 选择器...
标题 "Webscraping v1.0_Archdaily_python_webscraper_" 提示我们这是一个基于Python的网络爬虫项目,版本1.0,专门用于抓取Archdaily网站的数据。网络爬虫是自动化提取大量信息自网页的技术,尤其适用于数据挖掘、...
### Web Scraping with Python #### 一、简介与概述 《Web Scraping with Python》是一本专注于使用Python进行网络数据抓取的专业书籍。本书由Richard Lawson编写,于2015年由Packt Publishing出版。作者Richard ...
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as ...
Web Scraping with Python Collecting More Data from the Modern Web(2nd) 英文epub 第2版 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
Website Scraping with Python: Using BeautifulSoup and Scrapy By 作者: Gábor László Hajba ISBN-10 书号: 1484239245 ISBN-13 书号: 9781484239247 Edition 版本: 1st ed. 出版日期: 2018-09-15 pages 页数: ...
### Python Web Scraping 第二版 —— 从网络获取数据 #### 书籍概述与目标读者 《Python Web Scraping 第二版》是一本专为熟悉Python编程语言的程序员、安全专家以及网站管理员编写的实用指南。本书旨在教授读者...
此外,Web Scraper还为您提供了完全自动化Web Scraper Cloud中数据提取的功能。 抓取数据后,将其下载为CSV文件,然后可以将其进一步导入Excel,Google表格等。Web Scraper提供哪些功能? Web抓取工具是一个简单的...
Learn Web Scraping With Python In A Day: The Ultimate Crash Course to Learning the Basics of Web Scraping With Python In No Time
Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS
For those who are not familiar with programming or the deeper workings of the web, web scraping often looks like a black art: the ability to write a program that sets off on its own to explore the ...
【压缩包子文件的文件名称列表】:scraper-master 这个文件名暗示了这是一个Git仓库的克隆或者下载,其中“master”分支的代码可能包含以下组成部分: 1. `src/main/java`: 这里存放Java源代码,包括Web Scraper的...