`
天梯梦
  • 浏览: 13729882 次
  • 性别: Icon_minigender_2
  • 来自: 洛杉矶
社区版块
存档分类
最新评论

PHPCrawl webcrawler 爬虫

 
阅读更多

 

1. PHPCrawl

PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP

PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.

It provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.

PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2.

To get a first impression on how to use the crawler you may want to take a look at the quickstart guide or an example inside the manual section.
A complete reference and documentation of all available options and methods of the framework can be found in the classreferences-section

The current version of the phpcrawl-package and older releases can be downloaded from a sourceforge-mirror.

Note to users of phpcrawl version 0.7x or before: Although in version 0.8 some method-names and parameters have changed, it should be fully compatible to older versions of phpcrawl.

 

Installation & Quickstart

The following steps show how to use phpcrawl:
  1. Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
  2. Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.

     

    include("libs/PHPCrawler.class.php"); 

     

    There are no other includes needed.
  3. Extend the phpcrawler-class and override the handleDocumentInfo-method with your own code to process the information of every document the crawler finds on its way.

    class MyCrawler extends PHPCrawler
    {
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
      {
        // Your code comes here!
        // Do something with the $PageInfo-object that
        // contains all information about the currently 
        // received document.
    
        // As example we just print out the URL of the document
        echo $PageInfo->url."\n";
      }
    } 

     

    For a list of all available information about a page or file within the handleDocumentInfo-method see the PHPCrawlerDocumentInfo-reference.

    Note to users of phpcrawl 0.7x or before: The old, overridable method "handlePageData()", that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.
  4. Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.

    $crawler = new MyCrawler();
    $crawler->setURL("www.foo.com");
    $crawler->addContentTypeReceiveRule("#text/html#");
    // ...
    
    $crawler->go();  

     

    For a list of all available setup-options/methods of the crawler take a look at the PHPCrawler-classreference.

Tutorial: Example Script

The following code is a simple example of using phpcrawl.

The listed script just "spiders" some pages of www.php.net until a traffic-limit of 1 mb is reached and prints out some information about all found documents.

Please note that this example-script (and others) also comes in a file called "example.php" with the phpcrawl-package. It's recommended to run it from the commandline (php CLI).

 <?php

// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    
    echo $lb;
    
    flush();
  } 
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process. 

$crawler = new MyCrawler();

// URL to crawl
$crawler->setURL("www.php.net");

// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");

// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 
?> 

 

 

来源:http://cuab.de/

下载:http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/

 

2. PHP Crawler

PHP Crawler is a simple website search script for small-to-medium websites. The only requrements are PHP and MySQL, no shell access required.

 

来源/下载:http://sourceforge.net/projects/php-crawler/

 

3. Crawl Web Pages In PHP And Jquery

You All know Google used to crawl web pages and index them into there database of millions and trillions of pages they use a great software called spider to perform this Process   this spider will index all the web pages in the internet with great speed like the same way i coded out a simple mini web crawler crawls  the specific webpages and get out there links and displays it, i used PHP and Jquery to perform this actions when a people types the url and clicks crawl the software  crawls the whole web page and displays the link present in it  and you can see the demo here and download the script for free

 

Demo 

 

来源:http://www.spixup.org/2013/04/crawl-web-pages-in-php-and-jquery.html

 

 

 

分享到:
评论

相关推荐

    PHPCrawl.rar_PHP CRAWLER_PHPCrawl_crawler_web crawler in PHP_爬虫

    PHP Crawler是一款用PHP语言编写的网页爬虫框架,它使得开发者能够轻松地构建自己的网络爬虫程序,用于抓取互联网上的数据。PHP Crawler的核心功能是遍历网页链接,提取页面内容,以及按照预定义的规则进行数据筛选...

    WebCrawler Java爬虫

    6. **爬虫框架**:在Java世界中,有一些现成的爬虫框架可以帮助开发者快速搭建爬虫项目,例如WebMagic、Colly和Jsoup-Crawler。这些框架提供了更高级的功能,如自动跟踪链接、断点续爬、异常处理等,降低了开发难度...

    WebCrawler

    WebCrawler是一个基于Java开发的爬虫框架,它主要用于网络数据的抓取和处理。作为一个高级的爬虫工具,WebCrawler具备了多种特性和功能,使得开发者能够更精细、定向地进行网络爬取任务。 1. **Java爬虫技术**: ...

    分布式Web Crawler系统研究与实现.pdf

    1. Web爬虫策略:包括爬虫的爬取策略、爬虫的并发控制、爬虫的负载均衡等。爬虫策略的选择直接影响爬虫的性能和可扩展性。 2. 分布式系统架构:包括分布式系统的设计、分布式系统的通信机制、分布式系统的负载均衡...

    simple web crawler using .net

    综上,这个“simple web crawler using .NET”项目为学习和实践Web爬虫技术提供了很好的起点,涵盖了从基础的HTTP请求到复杂的HTML解析和数据存储等多个关键知识点。通过研究和运行这个项目,开发者可以掌握创建Web...

    Crawler通用爬虫.zip

    《通用爬虫技术详解——基于"Crawler通用爬虫.zip"》 爬虫技术是网络信息获取的重要手段,它能够自动化地遍历网页,提取所需的数据。"Crawler通用爬虫.zip"是一个强大的自定义模板爬虫工具,适用于任何可以通过...

    wlpc.rar_CRAWL_java网络爬虫_web crawler_爬虫

    网络爬虫,也被称为Web爬虫或Web机器人,是一种自动浏览互联网并抓取网页内容的程序。在Java中实现网络爬虫,可以让我们方便地获取大量网页数据,用于数据分析、信息提取、搜索引擎优化等多个领域。本项目“wlpc.rar...

    Windows Mobile WebCrawler便用

    Windows Mobile WebCrawler 是一款专为运行在Windows Mobile操作系统上的设备设计的网络爬虫工具。它允许用户抓取指定网站的所有链接,并将抓取到的信息整理展示在一个表格中,方便进行数据分析和管理。这款工具在...

    Java-Web-crawler-.zip_JAVA web 爬虫_crawler_java web crawler_java

    在这个“Java-Web-crawler-.zip”压缩包中,我们可以期待找到一个适合初学者的Java Web爬虫项目,旨在帮助提升编程技能。 Java Web爬虫的基础知识点包括: 1. **HTTP和HTTPS协议**:爬虫工作在Web上,因此必须理解...

    WebCrawler.zip

    【标题】"WebCrawler.zip" 是一个包含Python爬虫项目的压缩包,旨在通过网络爬取数据制作数据集,特别是针对人脸识别的应用。这个项目利用了百度AIStudio的训练营资源,目的是实现对五个人脸的识别。 【描述】中...

    webcrawler:网络爬虫

    网络爬虫建造gradle build fatJar 跑步java -jar build/libs/webcrawler-all-1.0.jar startURL depth [poolSize=10] 示例: java -jar build/libs/webcrawler-all-1.0.jar http://ya.ru/ 3 100待办事项将parent_id列...

    crawlerforSinaweibo_爬虫python_webcrawler_python_weibo_python爬虫_

    在IT领域,"爬虫"(Web Crawler)是一种自动化工具,用于遍历互联网上的网页,收集所需信息。这里提到的是用Python编程语言实现的微博爬虫,这意味着开发者使用Python的库和框架来编写代码,以获取微博平台上的公开...

    500lines之crawler爬虫(python3.7改进版)

    在Python编程语言中,爬虫是获取网络数据的重要工具,特别是在大数据分析和Web抓取领域。"500lines之crawler爬虫(python3.7改进版)"项目旨在提供一个适应Python3.7环境的爬虫解决方案。原始版本可能由于Python版本...

    crawlerforSinaweibo_爬虫python_webcrawler_python_weibo_python爬虫_源码

    【标题】:“crawlerforSinaweibo_爬虫python_webcrawler_python_weibo_python爬虫_源码” 这个标题明确指出这是一个关于Python爬虫的项目,特别针对的是新浪微博(Sina Weibo)的数据抓取。"Webcrawler"是网络爬虫...

    Crawler(网络爬虫)

    **网络爬虫(Crawler)基础** 网络爬虫是一种自动遍历互联网的程序,它能够按照一定的规则抓取网页信息并存储起来。在信息技术领域,爬虫被广泛应用于数据分析、搜索引擎索引、市场研究和自动化测试等多个场景。...

    webcrawler

    webcrawler 如果能够模拟一个没有界面的浏览器,还有什么不能做到的呢? 我选择了HtmlUnit,可以说是一个java版本的无界面浏览器, 几乎无所不能,而且很多东西都封装得特别完美

    webcrawler:一个简单的Java实现的网络爬虫,支持自动登录

    第一个网络爬虫介绍Webcrawler 是一个简单的网络爬虫。 它实现了自动登录和内容获取的基本功能。 Webcrawler 将首先尝试使用提供的用户名和密码登录 。 如果登录失败,程序将被终止。 登录后,爬虫将开始获取它可以...

Global site tag (gtag.js) - Google Analytics