heritrix中的Frontier学习 -

lionsadness

浏览: 41262 次
性别:
来自: 广州

最近访客更多访客>>

xiaoyoue

plasterdoll

north0808

AnjSp

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

heritrix中的Frontier学习

博客分类：

Heritrix学习

Frontier图

(1)BdbFrontier链接工厂，initQueue()初始化等待队列

public class BdbFrontier extends WorkQueueFrontier implements Serializable { /** 所有待抓取的链接*/ protected transient BdbMultipleWorkQueues pendingUris; //初始化pendingUris,父类为抽象方法 protected void initQueue() throws IOException { try { this.pendingUris = createMultipleWorkQueues(); } catch(DatabaseException e) { throw (IOException)new IOException(e.getMessage()).initCause(e); } } private BdbMultipleWorkQueues createMultipleWorkQueues() throws DatabaseException { return new BdbMultipleWorkQueues(this.controller.getBdbEnvironment(), this.controller.getBdbEnvironment().getClassCatalog(), this.controller.isCheckpointRecover()); } protected BdbMultipleWorkQueues getWorkQueues() { return pendingUris; }

(2)next():为处理线程提供一个链接.Heritrix的所有处理线程(ToeThread)都是通过调用该方法获取链接的.

就是WorkQueueFrontier的next()方法。

*说明：WorkQueueFrontier的next方法实际是调用WorkQueue的peek()方法，WorkQueue的peek()方法又由BdbWorkQueue的peekItem()来实现，BdbWorkQueue的peekItem()方法又调用BdbFrontier的getWorkQueues()方法拿到BdbMultipleWorkQueues队列也就是等待队列，在调用BdbMultipleWorkQueues的get()方法调用getNextNearestItem()方法从等待队列中拿出链接并加入正在处理队列。

public CrawlURI next() throws InterruptedException, EndedException { while (true) { // loop left only by explicit return or exception long now = System.currentTimeMillis(); // Do common checks for pause, terminate, bandwidth-hold preNext(now); synchronized(readyClassQueues) { int activationsNeeded = targetSizeForReadyQueues() - readyClassQueues.size(); while(activationsNeeded > 0 && !inactiveQueues.isEmpty()) { activateInactiveQueue(); activationsNeeded--; } } WorkQueue readyQ = null; Object key = readyClassQueues.poll(DEFAULT_WAIT,TimeUnit.MILLISECONDS); if (key != null) { readyQ = (WorkQueue)this.allQueues.get(key); } if (readyQ != null) { while(true) { // loop left by explicit return or break on empty CrawlURI curi = null; synchronized(readyQ) { /**取出一个URL,最终从子类BdbFrontier的 * pendingUris中(即等待队列中)取出一个链接 */ curi = readyQ.peek(this); if (curi != null) { // check if curi belongs in different queue String currentQueueKey = getClassKey(curi); if (currentQueueKey.equals(curi.getClassKey())) { // curi was in right queue, emit noteAboutToEmit(curi, readyQ); //加入正在处理队列中 inProcessQueues.add(readyQ); return curi; //返回 } // URI's assigned queue has changed since it // was queued (eg because its IP has become // known). Requeue to new queue. curi.setClassKey(currentQueueKey); readyQ.dequeue(this);//出队列 decrementQueuedCount(1); curi.setHolderKey(null); // curi will be requeued to true queue after lock // on readyQ is released, to prevent deadlock } else { // readyQ is empty and ready: it's exhausted // release held status, allowing any subsequent // enqueues to again put queue in ready readyQ.clearHeld(); break; } } if(curi!=null) { // complete the requeuing begun earlier sendToQueue(curi); } } } else { // ReadyQ key wasn't in all queues: unexpected if (key != null) { logger.severe("Key "+ key + " in readyClassQueues but not allQueues"); } } if(shouldTerminate) { // skip subsequent steps if already on last legs throw new EndedException("shouldTerminate is true"); } if(inProcessQueues.size()==0) { // Nothing was ready or in progress or imminent to wake; ensure // any piled-up pending-scheduled URIs are considered this.alreadyIncluded.requestFlush(); } } }

(3)schedule(CandidateURI caURI):将caURI放入等待队列，其实就是BdbMultipleWorkQueues管理的

Berkeley Database数据库，用于存放等待的链接。

//将URL加入待处理队列 public void schedule(CandidateURI caUri) { // Canonicalization may set forceFetch flag. See // #canonicalization(CandidateURI) javadoc for circumstance. String canon = canonicalize(caUri); if (caUri.forceFetch()) { alreadyIncluded.addForce(canon, caUri); } else { alreadyIncluded.add(canon, caUri); } }

(4)BdbUriUniqFilter:实际上是一个过滤器,它用来检查一个要进入等待队列的链接是否已经被抓取过.

//添加URL入等待队列 protected boolean setAdd(CharSequence uri) { DatabaseEntry key = new DatabaseEntry(); LongBinding.longToEntry(createKey(uri), key); long started = 0; OperationStatus status = null; try { if (logger.isLoggable(Level.INFO)) { started = System.currentTimeMillis(); } //添加到数据库 status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY); if (logger.isLoggable(Level.INFO)) { aggregatedLookupTime += (System.currentTimeMillis() - started); } } catch (DatabaseException e) { logger.severe(e.getMessage()); } if (status == OperationStatus.SUCCESS) { count++; if (logger.isLoggable(Level.INFO)) { final int logAt = 10000; if (count > 0 && ((count % logAt) == 0)) { logger.info("Average lookup " + (aggregatedLookupTime / logAt) + "ms."); aggregatedLookupTime = 0; } } } //如果存在,返回false if(status == OperationStatus.KEYEXIST) { return false; // not added } else { return true; } }

(5)finished(CrawlURI cURI):完成一个已处理的链接.

---------------------------------------------------------------------------------------------------------------

补充：