`
zyaping2008
  • 浏览: 9938 次
  • 性别: Icon_minigender_1
  • 来自: 北京
最近访客 更多访客>>
社区版块
存档分类
最新评论

To build Heritrix in Eclipse

阅读更多

 

 

To build Heritrix in Eclipse

This uses Heritrix 1.14.4 (2010 Year 5 dated 10 version is the latest version of the current situation)

1. First of all download from http://sourceforge.net/projects/archive-crawler/
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip

2. In Eclipse create a java project in the works, respectively,
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip to extract.

3. Will heritrix-1.14.4-src.zip Unzip the src / java in the com, org, st three files under the src folder to the project.
4. Will heritrix-1.14.4-src.zip Unzip the src in the conf folder to the project root directory .
5. Will heritrix-1.14.4-src.zip Unzip in the lib folder to the project root directory.
6. Will heritrix-1.14.4-src.zip Unzip in src / resources / org / archive / util in tlds-alpha-by-domain.txt file to the next project org.archive.util package.
7. Will heritrix-1.14.4.zip extract the webapps folder to the project root directory.
If the folder name is not in the webapps need to make the appropriate changes Heritrix.java.

    /**
     * @throws IOException
     * @return Returns the directory under which reside the WAR files
     * we're to load into the servlet container.
     */
    public static File getWarsdir()
    throws IOException {
        return getSubDir("webapps");
    }


8. Configuration file changes, find the conf file under the heritrix.properties

// Set the user password  
heritrix.cmdline.admin = admin:admin
// Set port  
heritrix.cmdline.port = 8080


9. Jar works package on the introduction of the all the jar lib package following the introduction of engineering.
10. Org.archive.crawler.Heritrix.java found right in the project configuration options selected operating mode Classpath
Select User Entries - Advanced
Select Add Folders to add into the conf folder.
Click Start Run Run

05:22:32.875 EVENT  Starting Jetty/4.2.23
05:22:32.937 WARN!! Delete existing temp dir C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/D:/workspace/jcjcd/heritrixDemo/webapps/admin.war!/]
05:22:33.062 EVENT  Started WebApplicationContext[/,Heritrix Console]
05:22:33.156 EVENT  Started SocketListener on 127.0.0.1:8080
05:22:33.156 EVENT  Started org.mortbay.jetty.Server@1f6f0bf
Heritrix version: @VERSION@


So far we have completed the configuration under Heritrix in Eclipse.

Now we can create a job for testing.

To build Heritrix in Eclipse

1. Http://127.0.0.1:8080 in your browser and enter the user input configuration file name password.
Two. Next, we create a job, select the navigation menu in the jobs, select CreateNewJob With defaults.

3. Were filled name, description, and to be crawling the url.
4. Select modules, here we will grab the results to create a mirror image, the default is compressed, Select Writers of org.archive.crawler.writer.ARCWriterProcessor remove and re-add a org.archive.crawler.writer.MirrorWriterProcessor
5. Select Setting bottom of the page set, many items can be set here, such as the maximum number of threads, timeout and so on.
There are two must be set
http-headers HTTP headers.
user-agent: Mozilla/5.0 (compatible; heritrix / @ VERSION @ + PROJECT_URL_HERE)
from: CONTACT_EMAIL_ADDRESS_HERE

I am here simply to replace @ VERSION @ heritrix version
PROJECT_URL_HERE local ip changed to http://
CONTACT_EMAIL_ADDRESS_HERE wrote a random email address above configuration is complete select submitjob.





6. To Console Click to start the beginning of the crawl job.
Crawl under the completed projects to see jobs in the folder can be found in the folder

 

 

文章来自:http://www.codeweblog.com/to-build-heritrix-in-eclipse/

http://www.codeweblog.com/search/Heritrix/

分享到:
评论

相关推荐

    Heritrix在Eclipse中的源文件

    在Eclipse这样的集成开发环境中配置Heritrix源文件,可以方便开发者进行定制化开发、调试和理解Heritrix的工作原理。下面将详细介绍如何在Eclipse中设置Heritrix项目,并解释相关知识点。 首先,确保你已经安装了...

    Heritrix安装详细过程

    本节将详细介绍如何在Eclipse环境中搭建Heritrix,并进行必要的配置,以便能够顺利地启动Heritrix并执行抓取任务。 ##### 2.1 在Eclipse中搭建MyHeritrix工程 1. **新建Java工程** 在Eclipse中新建一个名为`...

    Heritrix的安装与配置

    这可以通过选中所有.jar文件,右键点击,然后选择"Build Path" -> "Add to Build Path"来实现。 然后,将源代码文件夹`src\Java\`下的`org`和`st`两个文件夹拖放到Eclipse的`Heritrix`工程的`src`目录下。如果出现...

    Heritrix构建特定站点爬虫

    2. **配置Heritrix**:参照上述配置指南,确保Heritrix正确安装并在Eclipse中配置好。 3. **编写扩展代码**:根据需求编写代码,例如使用Heritrix提供的API来过滤链接,确保仅抓取北京林业大学网站内的页面。 4. **...

    利用 Heritrix 构建特定站点爬虫

    - 在Eclipse中打开MyHeritrix项目,右键单击项目名选择“Build PathConfigure Build Path…”。 - 在弹出的窗口中选择“Libraries”选项卡,点击“Add JARs…”按钮。 - 选择MyHeritrix项目根目录下的`lib`...

    heritrix的安装和配置[归纳].pdf

    - 如果Eclipse的工作空间位于`D:\eclipse\search`,则新建的项目将在`D:\eclipse\search\Heritrix`目录下。 3. **配置Heritrix** - 将`heritrix-1.14.1.zip`解压后得到的`lib`目录下的所有JAR文件添加到项目的...

    heritrix爬虫,安装tomcat

    - 在Eclipse中,右键点击Heritrix项目,选择"Build Project"进行编译。 - 编译成功后,运行`org.archive.crawler.framework.CrawlJob`类的main方法启动爬虫。 5. **部署到Tomcat** - 生成Heritrix的WAR文件:在...

Global site tag (gtag.js) - Google Analytics