网络信息体系结构作业1

hanyuanbo

浏览: 187908 次
性别:
来自: 深圳

最近访客更多访客>>

DamonDomino

abcd2010

den253176

cj19920801

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

网络信息体系结构

网络应用数据结构网络协议 ASP 正则表达式

要求如下：

内容：crawler和graph link analysis
1。heritrix系统使用要求：配置、安装Heritrix，抓取指定的网站: http://www.ccer.pku.edu.cn/
2。heritrix系统代码分析要求：按Week2的web crawler系统结构，寻找Heritrix系统里面的crawler的下面两个部分：
     isUrlVisited，politeness
    分析它们的实现技术。
3。搜集web数据的graph link analysis 要求：回答以下问题，并给出方法的说明
     这个网站有多少网页？
    入度、出度分布情况如何？
    top 10的最重要页面是哪些？
提交：一个简短的技术报告文档，报告上述作业完成情况。

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1。Heritrix的配置
参考我的博客http://hanyuanbo.iteye.com/blog/777451
2。对isURLVisited和politeness如下分析：

isUrlVisited
isURLVisited主要用来处理当一个链接要进入等待队列时判断该链接是否已经被抓取过，如果已经抓取过则不进入被处理队列，否则进入。
这就要从分析存储已抓取url的结构说起。Heritrix在内部使用了Berkeley DB(Database)。Berkeley DB就是一个HashTable，它能够按“key/value”方式来保存数据。它是一套开放源代码的嵌入式数据库，为应用程序提供可伸缩的、高性能的、有事务保护功能的数据管理服务。Berkeley DB就是一个Hash Table，它能够按“key/value”方式来保存数据。使用Berkeley DB时，数据库和应用程序在相同的地址空间中运行，所以数据库操作不需要进程间的通讯。另外，Berkeley DB中的所有操作都使用一组API接口。因此，不需要对某种查询语言（比如SQL）进行解析，也不用生成执行计划，这就大大提高了运行效率。解决了多线程访问、超大容量的问题。
Heritrix中涉及存储url的主要的类分布在org.archive.crawler.util包下，之间的继承关系如下图：

用户可以在创建一个爬取任务时选择其中的一种过滤器，默认是BdbUriUniqFilter。而且这也是在Heritrix抓取过程中使用的唯一一种方式。

这里存储已经处理过的url的数据结构是Berkeley Database，叫做alreadySeen。

protected transient Database alreadySeen = null;

为了节省存储空间，alreadySeenUrl中存储的并不是url，而是url的fingerprint(64位)。为了不破坏url的局部性，分别对url的主机名和整个url计算fingerprint，然后把24位的主机名fingerprint和40位的url的fingerprint连接起来得到最后的64位的fingerprint。计算fingerprint是在createKey函数中实现。代码如下：

 /**
     * Create fingerprint.
     * Pubic access so test code can access createKey.
     * @param uri URI to fingerprint.
     * @return Fingerprint of passed <code>url</code>.
     */
    public static long createKey(CharSequence uri) {
        String url = uri.toString();
        int index = url.indexOf(COLON_SLASH_SLASH);
        if (index > 0) {
            index = url.indexOf('/', index + COLON_SLASH_SLASH.length());
        }
        CharSequence hostPlusScheme = (index == -1)? url: url.subSequence(0, index);
        long tmp = FPGenerator.std24.fp(hostPlusScheme);
        return tmp | (FPGenerator.std40.fp(url) >>> 24);
    }

setAdd函数把uri加入到数据库中，如果已经存在，则返回false，否则返回true。关键代码如下：（根据自己的理解加入了注释）

protected boolean setAdd(CharSequence uri) {
        DatabaseEntry key = new DatabaseEntry();
        LongBinding.longToEntry(createKey(uri), key);//将uri的fingerprint从long类型转换成DatabaseEntry类型，以便于Database进行存储。
        long started = 0;
        
        OperationStatus status = null;
        try {
            if (logger.isLoggable(Level.INFO)) {
                started = System.currentTimeMillis();
            }
            status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY);//检查是否已经被抓取过，并返回状态给status
            if (logger.isLoggable(Level.INFO)) {
                aggregatedLookupTime +=
                    (System.currentTimeMillis() - started);
            }
        } catch (DatabaseException e) {
            logger.severe(e.getMessage());
        }
        if (status == OperationStatus.SUCCESS) {
            count++;
            if (logger.isLoggable(Level.INFO)) {
                final int logAt = 10000;
                if (count > 0 && ((count % logAt) == 0)) {
                    logger.info("Average lookup " +
                        (aggregatedLookupTime / logAt) + "ms.");
                    aggregatedLookupTime = 0;
                }
            }
        }
        if(status == OperationStatus.KEYEXIST) {//是否已经探测过
            return false;
        } else {
            return true;
        }
    }

politeness
(1) 每个时间只有一个面向服务器的连接(one connection at a time)
Heritrix的礼貌性主要在Frontier中实现：一次对一个服务器只开一个链接，并且保证uri按一定速率处理，从而不会给被爬取的服务器造成负担。
爬虫采用广度优先遍历，使用FIFO的队列来存储待爬取的URL。因为网页的局部性，队列中相邻的URL很可能是相同主机名的，这样爬取会给服务器造成很大负担。如果用很多队列来存放URL，每个队列中URL的主机名相同，同一时间里，只允许队列中一个URL被爬取，就能避免上述问题了。

heritrix中主机名相同的URL队列是用WorkQueue来实现的，一个WorkQueue就是一个具有相同主机名的队列。在Heritrix中，还有其他的队列，代码如下：(在org.archive.crawler.frontier.WorkQueueFrontier.java中)

/** All known queues.
     */
    protected transient ObjectIdentityCache<String,WorkQueue> allQueues = null; 
    // of classKey -> ClassKeyQueue

  /**
     * Set up the various queues-of-queues used by the frontier. Override
     * in implementing subclasses to reduce or eliminate risk of queues
     * growing without bound. 
     */
    protected void initQueuesOfQueues() {
        // small risk of OutOfMemoryError: if 'hold-queues' is false,
        // readyClassQueues may grow in size without bound
        readyClassQueues = new LinkedBlockingQueue<String>();
        // risk of OutOfMemoryError: in large crawls, 
        // inactiveQueues may grow in size without bound
        inactiveQueues = new LinkedBlockingQueue<String>();
        // risk of OutOfMemoryError: in large crawls with queue max-budgets, 
        // inactiveQueues may grow in size without bound
        retiredQueues = new LinkedBlockingQueue<String>();
        // small risk of OutOfMemoryError: in large crawls with many 
        // unresponsive queues, an unbounded number of snoozed queues 
        // may exist
        snoozedClassQueues = Collections.synchronizedSortedSet(new TreeSet<WorkQueue>());
    }

在子类BdbFrontier中的初始化过程如下：

public void initialize(CrawlController c)
    throws FatalConfigurationException, IOException {
        this.controller = c;
        // fill in anything from a checkpoint recovery first (because
        // usual initialization will skip initQueueOfQueues in checkpoint)
        if (c.isCheckpointRecover()) {
            // If a checkpoint recover, copy old values from serialized
            // instance into this Frontier instance. Do it this way because 
            // though its possible to serialize BdbFrontier, its currently not
            // possible to set/remove frontier attribute plugging the
            // deserialized object back into the settings system.
            // The below copying over is error-prone because its easy
            // to miss a value.  Perhaps there's a better way?  Introspection?
            BdbFrontier f = null;
            try {
                f = (BdbFrontier)CheckpointUtils.
                    readObjectFromFile(this.getClass(),
                        c.getCheckpointRecover().getDirectory());
            } catch (FileNotFoundException e) {
                throw new FatalConfigurationException("Failed checkpoint " +
                    "recover: " + e.getMessage());
            } catch (IOException e) {
                throw new FatalConfigurationException("Failed checkpoint " +
                    "recover: " + e.getMessage());
            } catch (ClassNotFoundException e) {
                throw new FatalConfigurationException("Failed checkpoint " +
                    "recover: " + e.getMessage());
            }

            this.nextOrdinal = f.nextOrdinal;
            this.totalProcessedBytes = f.totalProcessedBytes;
            this.liveDisregardedUriCount = f.liveDisregardedUriCount;
            this.liveFailedFetchCount = f.liveFailedFetchCount;
            this.processedBytesAfterLastEmittedURI =
                f.processedBytesAfterLastEmittedURI;
            this.liveQueuedUriCount = f.liveQueuedUriCount;
            this.liveSucceededFetchCount = f.liveSucceededFetchCount;
            this.lastMaxBandwidthKB = f.lastMaxBandwidthKB;
            this.readyClassQueues = f.readyClassQueues;
            this.inactiveQueues = reinit(f.inactiveQueues,"inactiveQueues");//inactiveQueues的初始化
            this.retiredQueues = reinit(f.retiredQueues,"retiredQueues");//retiredQueues的初始化
            this.snoozedClassQueues = f.snoozedClassQueues;//snoozedClassQueues的初始化
            this.inProcessQueues = f.inProcessQueues;
            super.initialize(c);
            wakeQueues();
        } else {
            // perform usual initialization 
            super.initialize(c);
        }
    }

readyClassQueues存储着已经准备好被爬取的队列的key；
inactiveQueues存储着所有非活动状态的url队列的key；
retiredQueues存储着不再激活的url队列的key。
snoozedClassQueues：存储着所有休眠的url队列的key，它们都按唤醒时间排序；

线程返回readyClassQueues和snoozedClassQueues中已经到唤醒时间的队列中第一个url，下载相应的文档，完成之后从队列中移除该url。每爬取到一个url都需要判断应该加入哪个队列中。首先根据url的主机名判断是否存在该主机名的队列，如果不存在就新建一个队列。然后判断该队列是否在生命周期内，如果不在就设置为在生命周期内。如果队列需要保持不激活状态或者活动队列的数量超过设定的阈值，就把该队列放入inactiveQueues中，否则放在readyClassQueues中。
另外，heritrix还设定了很多参数来限制对服务器的访问频率。如最长等待时间max-delay-ms，默认30秒；重连同一服务器至少等待时间min-delay-ms，默认是3秒，重连同一服务器要等待上次连接间隔的几倍delay-factor，默认是5。

(2) robots.txt
robots.txt称为机器人协议，放在网站的根目录下。在这个文件中声明该网站中不想被robot 访问的部分，或者指定搜索引擎只收录指定的内容。这是一个君子协定，爬虫可以不遵守，但是出于礼貌最好遵守。
heritrix在预处理阶段处理robots.txt。它把针对每个user-agent的allow和disallow封装为一个RobotsDirectives类，整个robots.txt用一个Robotstxt对象来存储。
heritrix处理robots.txt有五种方法，都封装在RobotsHonoringPolicy中。这五种方法分别是：
Classic：遵守robots.txt对当前user-agent的第一部分指令。
Ignore：忽略robots.txt。
Custom：遵守robots.txt中特定操作的指令。
Most-favored：遵守最宽松的指令。
Most-favored-set：给定一些user-agent格式的集合，遵守最宽松的限制。

当策略是Most-favored或Most-favored-set时，可以选择是否伪装成另一个user agent。
RobotsExlusionPolicy类中包含heritrix最终处理robots.txt的方法，disallows用来判断userAgent能否访问某个url。它完全依据用户在新建一个爬虫任务时设置的处理robots.txt的策略来实现。

在源代码中的反应如下：
RobotsDirectives.java
Robotstxt.java
RobotsHonoringPolicy.java
RobotsExclusionPolicy.java包都存放在org.archive.crawler.datamodel包下。而且通过查看源文件即可看到类的注释。分别如下：

/**
 * Represents the directives that apply to a user-agent (or set of
 * user-agents)
 */
public class RobotsDirectives implements Serializable{
...
}

/**
 * Utility class for parsing and representing 'robots.txt' format 
 * directives, into a list of named user-agents and map from user-agents 
 * to RobotsDirectives. 
 */
public class Robotstxt implements Serializable{
...
}

/**
 * RobotsHonoringPolicy represent the strategy used by the crawler 
 * for determining how robots.txt files will be honored. 
 *
 * Five kinds of policies exist:
 * <dl>
 * <dt>classic:</dt>
 *   <dd>obey the first set of robots.txt directives that apply to your 
 *   current user-agent</dd>
 * <dt>ignore:</dt>
 *   <dd>ignore robots.txt directives entirely</dd>
 * <dt>custom:</dt>
 *   <dd>obey a specific operator-entered set of robots.txt directives 
 *   for a given host</dd>
 * <dt>most-favored:</dt>
 *   <dd>obey the most liberal restrictions offered (if *any* crawler is 
 *   allowed to get a page, get it)</dd>
 * <dt>most-favored-set:</dt>
 *   <dd>given some set of user-agent patterns, obey the most liberal 
 *   restriction offered to any</dd>
 * </dl>
 *
 * The two last ones has the opportunity of adopting a different user-agent 
 * to reflect the restrictions we've opted to use.
 *
 */
public class RobotsHonoringPolicy  extends ModuleType{
...
}

/**
 * RobotsExclusionPolicy represents the actual policy adopted with 
 * respect to a specific remote server, usually constructed from 
 * consulting the robots.txt, if any, the server provided. 
 * 
 * (The similarly named RobotsHonoringPolicy, on the other hand, 
 * describes the strategy used by the crawler to determine to what
 * extent it respects exclusion rules.)
 * 
 * The expiration of policies after a suitable amount of time has
 * elapsed since last fetch is handled outside this class, in 
 * CrawlServer itself. 
 * 
 * TODO: refactor RobotsHonoringPolicy to be a class-per-policy, and 
 * then see if a CrawlServer with a HonoringPolicy and a RobotsTxt
 * makes this mediating class unnecessary. 
 * 
 * @author gojomo
 *
 */
public class RobotsExclusionPolicy implements Serializable{
...
}

3。根据自己抓到的网站的本地mirror文件，进过分析，做出如下代码，来得到答案（可能不太准确，大致结果应该没问题）。

package com.analysis.sishendaili;

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class StringUtil {

	/**
	 * 根据正则表达式选取一个文件中的所有在<a href ..> 或者 <A href...>中的链接URL地址。
	 * 但是去掉了其中的mailto和javascript的链接，这不是一个url地址
	 * @param content
	 * @return
	 */
	public static Set<String> getURLs(String content){
		Set<String> allURLs = new HashSet<String>();
		String regex = "<[aA] href=\"[^\"]+";//出度的正则表达式
		Matcher matcher = Pattern.compile(regex).matcher(content);
		while(matcher.find()){
			String ahref = matcher.group();
			int index = ahref.indexOf("\"");
			if(index > 0 && !ahref.toLowerCase().contains("mailto") && !ahref.toLowerCase().contains("javascript")){//去掉mailto和javascript的 <a href...>
				String url = ahref.substring(index+1);
				url = StringUtil.trimLastSlash(url);
				allURLs.add(url);
			}
		}
		
		return allURLs;
	}
	
	/**
	 * 为了能够在map中找到相应的url地址，把最后的斜杠去掉。
	 * 因为有的有，有的没有，但却是同一个url。故统一去掉来判断是否是同一个url
	 * @param origin
	 * @return
	 */
	public static String trimLastSlash(String origin){
		int length = origin.length();
		if(origin.endsWith("\\") || origin.endsWith("/")){
			return origin.substring(0, length-1);
		}else{
			return origin;
		}
	}
	
	public static void main(String[] args) {
		String filename = "jobs\\ccer3-20101019015958086\\mirror\\www.ccer.pku.edu.cn\\cn\\facultySecondClassId=207.asp";
		String content = FileUtil.getDiskFileContentInOneLine(filename);
		Set<String> allURLs = getURLs(content);
		System.out.println(allURLs.size());
		for(String url : allURLs){
			System.out.println(url);
		}
	}
}

package com.analysis.sishendaili;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class FileUtil {

	/**
	 * 得到指定文件的内容
	 * @param filename
	 * @return
	 */
	public static String getDiskFileContentInOneLine(String filename) {
		StringBuffer sb = new StringBuffer();
		BufferedReader reader = null;
		try {
			reader = new BufferedReader(new FileReader(new File(filename)));
			String line = "";
			while((line = reader.readLine()) != null){
				sb.append(line);
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally{
			if(reader != null){
				try {
					reader.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
		return sb.toString();
	}
	
	public static String getDiskFileContentWithLines(String filename){
		StringBuffer sb = new StringBuffer();
		BufferedReader reader = null;
		try {
			reader = new BufferedReader(new FileReader(new File(filename)));
			String line = "";
			while((line = reader.readLine()) != null){
				sb.append(line + "\n");
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally{
			if(reader != null){
				try {
					reader.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
		return sb.toString();
	}
	
	public static void main(String[] args) {
		String filename = "jobs\\ccer3-20101019015958086\\mirror\\www.ccer.pku.edu.cn\\cn\\facultySecondClassId=207.asp";
		String content = FileUtil.getDiskFileContentInOneLine(filename);
		System.out.println(content);
	}

}

package com.analysis.sishendaili;

public class Convert {
	/**
	 * 将从网页上下载下来的东西，如果是汉字的话会出现乱码。
	 * 使用该函数将其转换为原来的汉字
	 * 编码方式有 utf-8 ISO-8859-1 gb2312 gbk
	 * @param str
	 * @return
	 */
	public static String convert(String str) {
		String result = "";
		try {
			result = new String(str.getBytes("ISO-8859-1"), "gb2312");
		} catch (Exception ex) {
			result = "";
		}
		return result;
	}

	public static void main(String[] args) {
		String msg = "Resultkeyword=2005Äê4ÔÂ19ÈÕÐ£ÄÚË«Ñ§Î»»®¿¨½»·ÑÇé¿ö.asp";
		String result = Convert.convert(msg);
		System.out.println(result);//Resultkeyword=2005年4月19日校内双学位划卡交费情况.asp
		
		msg = "ÑÐ¾¿ÉúÐÂ¿Î£º¾¼Ã³É³¤¹ØÁ¬ÑÐ¾¿£¨8.26¸üÐÂ£©";
		result = Convert.convert(msg);
		System.out.println(result);
	}
}

package com.analysis.sishendaili;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FilenameFilter;
import java.io.IOException;
import java.io.PrintStream;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class Analysis {

	public static String dotASPQues = ".asp?";
	public static String baseDir = "jobs\\ccer3-20101019015958086\\mirror\\www.ccer.pku.edu.cn\\cn\\";

	/**
	 * Heritrix在下载网页的时候，将网页的url进行了处理。
	 * 在url含有以下字符串时，将".asp?"去掉，并在结尾加上了".asp"
	 */
	@SuppressWarnings("serial")
	public static Set<String> items = new HashSet<String>(){{
		this.add("faculty");
		this.add("news");
		this.add("rdlist");
		this.add("ReadNews");
		this.add("readview");
		this.add("Result");
		this.add("review");
		this.add("SecondClass");
		this.add("SmallClass");
		this.add("Special_News");
		this.add("Special");
	}};
	
	
	/**
	 * 这个map中存储的是未处理过的网页 到 处理过的网页 的一一映射
	 * 而且是加入了 http://www.pku.edu.cn/ 和 http://www.pku.edu.cn/cn/ 网页
	 * 处理之后总共有 18901个网页
	 */
	public static Map<String,String> map = new HashMap<String,String>();
	
	public static Map<String,Integer> out = new HashMap<String,Integer>();
	
	public static Map<String,Integer> in = new HashMap<String,Integer>();//初始化in的时候，in中的url为ccer下的所有网页，但是在处理过程中，有一些不是ccer网站下的网页但却是链接到其他的网页的url，这些也被加入到了in中，这也是对于ccer来说重要的url，虽然不是ccer的网页。
	
	public static Map<String,Integer> mostImportant10Pages = new HashMap<String,Integer>();
	
	/**
	 * 得到指定path下的所有符合要求的文件
	 * @param path
	 * @return
	 */
	public static String[] getAllPages(String path){
		File file = new File(path);
		String[] pages = file.list(new FilenameFilter(){
			public boolean accept(File file, String name) {
				return name.endsWith("asp");//发现，以asp结尾的文件是网页
			}
		});
		return pages;
	}
	
	/**
	 * 将由Heritrix下载的url还原回原来的地址
	 * @param url
	 * @return
	 */
	public static String toRightFormat(String url){
		for(String item : items){
			if(url.startsWith(item) && url.contains("=")){
				int index = url.indexOf(item);
				int pos = index + item.length();
				int length = url.length();
				url = url.substring(0, pos) + dotASPQues + url.substring(pos, length - 4);//length-4 减去.asp
				break;
			}
		}
		return url;
	}
	
	/**
	 * 对path目录下的所有网页进行了第一次处理
	 * 因为Heritrix对爬下来的网页的文件名进行了重编辑 所以想要得到入度出度时 需要将其还原为原来的名字
	 * 而且有的是乱码 需要进行下处理
	 * 大概2秒中处理完毕
	 * 这个只运行一次 因为只是为了得到正确格式（即 原格式）的网页名称
	 * @throws Exception
	 */
	public static void processAllPages() throws Exception{
		String path = baseDir;
		System.setOut(new PrintStream(new File("analysis\\allPages.txt")));
		
		String[] pages = Analysis.getAllPages(path);
		for(int i=0;i<pages.length;i++){
			String url = Convert.convert(pages[i]);
			url = toRightFormat(url);
			System.out.println(url);
		}
	}
	
	/**
	 * 得到所有正确的网页URL(绝对 或者相对)
	 * @return
	 */
	public static void initialize2Maps(){
		String path = baseDir;
		String[] pages = Analysis.getAllPages(path);
		for(int i=0;i<pages.length;i++){
			String url = Convert.convert(pages[i]);
			url = toRightFormat(url);
			url = StringUtil.trimLastSlash(url);
			map.put(pages[i],url);
			in.put(url, 0);
		}
	}
	
	/**
	 * 用来得到入度 出度的主要处理函数入口
	 */
	public static void process() throws Exception{
		initialize2Maps();
		for(String file : map.keySet()){
			String key = map.get(file);
			String filename = baseDir + file;
			String content = FileUtil.getDiskFileContentInOneLine(filename);
			if(content != null && !content.trim().equals("")){
				Set<String> allURLs = StringUtil.getURLs(content);
				
				out.put(key, allURLs.size());//出度直接将其加入到out中
				
				for(String url : allURLs){//来更新入度的处理
					if(in.containsKey(url)){
						int du = in.get(url);
						in.put(url, ++du);
					}else{
						in.put(url, 1);
					}
				}
			}
		}
		
		getMostImportant10Pages();
		map_in_out_mostImportant10Pages_toDisk();
	}
	
	/**
	 * 这个是在得到了in.txt之后的处理
	 * 因为得到in.txt需要很长时间，所以之后就直接用这个文件来重处理。
	 * @throws Exception
	 */
	public static void getMostImportant10PagesAfter() throws Exception{
		String filename = "analysis\\in.txt";
		Map<String,Integer> _in = new HashMap<String,Integer>();
		BufferedReader reader = null;
		try {
			reader = new BufferedReader(new FileReader(new File(filename)));
			String line = "";
			while((line = reader.readLine()) != null){
				String[] _map = line.split("\t\t");
				_in.put(_map[0], Integer.parseInt(_map[1]));
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally{
			if(reader != null){
				try {
					reader.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
		
		for(int i=0;i<13;i++){
			int maxDu = -1;
			String maxKey = "";
			for(String key : _in.keySet()){
				int du = _in.get(key);
				if((du >= maxDu) && !mostImportant10Pages.containsKey(key)){
					maxKey = key;
					maxDu = du;
				}
			}
			mostImportant10Pages.put(maxKey, maxDu);
		}
		
		System.setOut(new PrintStream(new File("analysis\\mostImportant10Pages.txt")));
		for(String key : mostImportant10Pages.keySet()){
			int value = mostImportant10Pages.get(key);
			System.out.println(key + "\t\t" + value);
		}
		
	}
	
	/**
	 * 入度排在前10的网页
	 */
	public static void getMostImportant10Pages(){
		for(int i=0;i<10;i++){
			int maxDu = -1;
			String maxKey = "";
			for(String key : in.keySet()){
				int du = in.get(key);
				if((du >= maxDu) && !mostImportant10Pages.containsKey(key)){
					maxKey = key;
					maxDu = du;
				}
			}
			mostImportant10Pages.put(maxKey, maxDu);
		}
	}
	
	public static void map_in_out_mostImportant10Pages_toDisk() throws Exception{
		System.setOut(new PrintStream(new File("analysis\\wangyi_map.txt")));
		for(String key : map.keySet()){
			String value = map.get(key);
			System.out.println(key + "\t\t" + value);
		}
		
		System.setOut(new PrintStream(new File("analysis\\wangyi_out.txt")));
		for(String key : out.keySet()){
			int value = out.get(key);
			System.out.println(key + "\t\t" + value);
		}
		
		System.setOut(new PrintStream(new File("analysis\\wangyi_in.txt")));
		for(String key : in.keySet()){
			int value = in.get(key);
			System.out.println(key + "\t\t" + value);
		}
		
		System.setOut(new PrintStream(new File("analysis\\wangyi_mostImportant10Pages.txt")));
		for(String key : mostImportant10Pages.keySet()){
			int value = mostImportant10Pages.get(key);
			System.out.println(key + "\t\t" + value);
		}
	}
	
	public static void main(String[] args){
		try {
			process();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

上面四个文件放在了Heritrix工程下，运行Analysis即可。经过大概半个小时的处理，能够生成一个analysis的文件夹，然后在里面有in.txt out.txt map.txt 和 mostImportant10Pages.txt四个文件。
4。抓取的crawl report截图如下：