`
womendu
  • 浏览: 1513726 次
  • 性别: Icon_minigender_2
  • 来自: 北京
文章分类
社区版块
存档分类
最新评论

[转] Scraping Yahoo! Search with Web::Scraper

阅读更多
from http://menno.b10m.net/blog/blosxom/perl 该文章是用来解析取得到的html的资料,有用到xpath的概念 Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast. Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to effectively scrape Yahoo! Search. First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to fetch the following things: * title (the linked text) * url (the actual link) * description (the text beneath the link) So let's start our first little script: [code] use Data::Dumper;#该模块用来输出相关的结构 use URI; use Web::Scraper; my $yahoo = scraper { process "a.yschttl", 'title' => 'TEXT', 'url' => '@href'; process "div.yschabstr", 'description' => "TEXT"; result 'description', 'title', 'url'; }; print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl")); [/code] Now what happens here? The important stuff can be found in the process statements. Basically, you may translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title', and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in description. The result looks something like this: $VAR1 = { 'url' => 'http://www.perl.com/', 'title' => 'Perl.com', 'description' => 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.' }; Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a loop! The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this: process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href'; process "div.yschabstr", 'description[]' => "TEXT"; And when we run it now, the result looks like this: $VAR1 = { 'url' => [ 'http://www.perl.com/', 'http://www.perl.org/', 'http://www.perl.com/download.csp', ... ], 'title' => [ 'Perl.com', 'Perl Mongers', 'Getting Perl', ... ], 'description' => [ 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.', 'Nonprofit organization, established to support the Perl community.', 'Instructions on downloading a Perl interpreter for your computer platform. ... On CPAN, you will find Perl source in the /src directory. ...', ... ] }; That looks a lot better! We now get all the search results and could loop through the different arrays to get the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go for the XPath selectors (heck, we can do both, so why not?). To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can grab the path within seconds. use Data::Dumper; use URI; use Web::Scraper; my $yahoo = scraper { process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper { process "a.yschttl", 'title' => 'TEXT', 'url' => '@href'; process "div.yschabstr", 'description' => "TEXT"; result 'description', 'title', 'url'; }; result 'results'; }; print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") ); You see that we switched our title, url and description fields back to the old notation (without []), for we don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]'). The result is exactly what we wanted: $VAR1 = [ { 'url' => 'http://www.perl.com/', 'title' => 'Perl.com', 'description' => 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.' }, { 'url' => 'http://www.perl.org/', 'title' => 'Perl Mongers', 'description' => 'Nonprofit organization, established to support the Perl community.' }, { 'url' => 'http://www.perl.com/download.csp', 'title' => 'Getting Perl', 'description' => 'Instructions on downloading a Perl interpreter for your computer platform. ... On CPAN, you will find Perl source in the /src directory. ...' }, ... ]; Again Tatsuhiko impresses me with a Perl module. Well done! Very well done! Update: Tatsuhiko had some wise words on this article: A couple of things: You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less code :) The use of nested scraper in your example seems pretty good, but using hash reference could be also useful, like: my $yahoo = scraper { process "a.yschttl", 'results[]', { title => 'TEXT', url => '@href', }; }; This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't need the description. TIMTOWTDI :)
分享到:
评论

相关推荐

    Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Py

    Untangle your web scraping complexities and access web data with ease using Python scripts Key FeaturesHands-on recipes to advance your web scraping skills to expert levelAddress complex and ...

    利用Python实现网络爬虫 Hands-On-Web-Scraping-with-Python-master.zip

    《Python实战:网络爬虫详解》 Python是一种广泛应用于数据科学、机器学习和网络爬虫领域的强大编程语言。网络爬虫是获取大量网络数据的重要工具,尤其在机器学习项目中,高质量的数据集往往是成功的关键。...

    Webscraping v1.0_Archdaily_python_webscraper_

    标题 "Webscraping v1.0_Archdaily_python_webscraper_" 提示我们这是一个基于Python的网络爬虫项目,版本1.0,专门用于抓取Archdaily网站的数据。网络爬虫是自动化提取大量信息自网页的技术,尤其适用于数据挖掘、...

    Web Scraping with Python

    ### Web Scraping with Python #### 一、简介与概述 《Web Scraping with Python》是一本专注于使用Python进行网络数据抓取的专业书籍。本书由Richard Lawson编写,于2015年由Packt Publishing出版。作者Richard ...

    web scraping with python collecting more data from the modern web 2nd

    Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as ...

    Web Scraping with Python Collecting More Data from the Modern Web(2nd) epub

    Web Scraping with Python Collecting More Data from the Modern Web(2nd) 英文epub 第2版 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书

    Website Scraping with Python: Using BeautifulSoup and Scrapy

    Website Scraping with Python: Using BeautifulSoup and Scrapy By 作者: Gábor László Hajba ISBN-10 书号: 1484239245 ISBN-13 书号: 9781484239247 Edition 版本: 1st ed. 出版日期: 2018-09-15 pages 页数: ...

    Python Web Scraping Second Edition - Fetching Data From The Web

    ### Python Web Scraping 第二版 —— 从网络获取数据 #### 书籍概述与目标读者 《Python Web Scraping 第二版》是一本专为熟悉Python编程语言的程序员、安全专家以及网站管理员编写的实用指南。本书旨在教授读者...

    chrome爬虫插件 webscraper中文教程

    4. **抓取网站**:设置好Sitemap和选择器后,点击"Start Scraping"按钮,Web Scraper将按照设定开始抓取数据。抓取结果会显示在插件的输出面板,你可以导出为CSV、JSON等格式,方便进一步处理。 ### 选择器...

    Web Scraper - Free Web Scraping-crx插件

    此外,Web Scraper还为您提供了完全自动化Web Scraper Cloud中数据提取的功能。 抓取数据后,将其下载为CSV文件,然后可以将其进一步导入Excel,Google表格等。Web Scraper提供哪些功能? Web抓取工具是一个简单的...

    Learn Web Scraping With Python In A Day

    Learn Web Scraping With Python In A Day: The Ultimate Crash Course to Learning the Basics of Web Scraping With Python In No Time

    Python Web Scraping Cookbook

    Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS

    scraper:Html Web Scraper 和自动化

    【压缩包子文件的文件名称列表】:scraper-master 这个文件名暗示了这是一个Git仓库的克隆或者下载,其中“master”分支的代码可能包含以下组成部分: 1. `src/main/java`: 这里存放Java源代码,包括Web Scraper的...

Global site tag (gtag.js) - Google Analytics