搜索引擎机器人研究报告

banditjava

浏览: 161264 次
性别:
来自: 北京

最近访客更多访客>>

wangyy

pengcong90

superlongde

Mr_Tian_ht

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎

搜索引擎 Yahoo Google .net lucene

从该文对googlebot的分析看，googlebot似乎是想先对网站的结构和规模做出分析，然后再规划抓取行为，googlebot的行为确实很有意思。Yahoo的机器人似乎是以月为单位周期的更新，抓取新网页和索引，感觉是想以量取胜，并且对网页似乎没有进一步的分析。感觉MSNbot在整体上还略逊于另两个竞争对手。

Introduction

引言

In the previous edition - Binary Search Tree 2 - a large scale experiment on search engine behaviour was staged with more than two billion different web pages. This experiment lasted exactly one year, until April 13th. In this period the three major search engines requested more than one million pages of the tree, from more than hundred thousand different URLs. The home page of drunkmenworkhere.org grew from 1.6 kB to over 4 MB due to the visit log and the comment spam displayed there.

在上一版（Binary Search Tree 2 ）中我用了200亿以个上的web页面进行了一个搜索引擎行为的大

规模试验。这个试验一直持续到4月13日，历时整整一年。在这段时间中三大主要的搜索引擎在树上请求了100万个以上的页面和超过10万个URL。由于访问日志的增长和垃圾评论的存在，drunkmenworkhere.org的主页也从1.6KB增长为4MB多。( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

This edition presents the results of the experiment.

本文就是这次试验的结果。

Setup

安装

2,147,483,647 web pages ('nodes') were numbered and arranged in a binary search tree . In such a tree, the branch to the left of each node contains only values less than the node's value, while the right branch contains only values higher than the node's value. So the leftmost node in this tree has value 1 and the rightmost node has value 2,147,483,647.

在这次试验中二叉查找树上总共放置了 2,147,483,647个标了值的网页。对于二叉查找树，每个节点的左子树只包含比这个节点小的值，右子树上包含比这个节点大的值。所以树的左边最远的节点值为1，右边最远的节点值为2,147,483,647

The depth of the tree is the number of nodes you have to traverse from the root to the most remote leaf. Since you can arrange 2ⁿ⁺¹ - 1 numbers in a tree of depth n, the resulting tree has a depth of 30 (2³¹ = 2,147,483,648). The value at the root of the tree is 1073741824 (2³⁰ ).

树的深度是从根到最远的树叶所要经过的节点总数。因为在深度为n的树上你总共可以放置 2ⁿ⁺¹ - 1个节点，所以这棵树的深度为30 (2³¹ = 2,147,483,648)，其根部的值为1073741824 (2³⁰ )。

For each page the traffic of the three major search bots (Yahoo! Slurp , Googlebot and msnbot ) was monitored over a period of one year (between 2005-4-13 and 2006-4-13).

这个试验中监控了三大搜索爬虫（ Yahoo! Slurp , Googlebot 和 msnbot ）在一年时间里（2005-4-13 到 2006-4-13）在每个页面上的流量。

To make the content of each page more interesting for the search engines, the value of each node is written out in American English (short scale) and each page request from a search bot is displayed in reversed chronological order. To enrich the zero-content even more, a comment box was added to each page (it was removed on 2006-4-13). These measures were improvements over the initial Binary Search Tree which uses inconvenient long URLs.

为了让搜索引擎对页面的内容更感兴趣，所有的节点值都以美国英语（short scale）为语言写的并且爬虫请求的页面按照时间倒序显示。为了进一步丰富0内容（zero-content?），每个页面上都添加了一个评论框（于2006-4-13移除）。这些措施是对二叉查找树最初麻烦的长URL的改进。( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

Every node shows an image of three trees. Each tree in the image visualises which nodes are crawled by each search engine. Each line in the image represents a node, the number of times a search bot visited the node determines the length of the line. The tree images below are modified large versions of the original image, without the very long root node and with disconnected (wild) branches.

每个节点上显示了一张三棵树的图像。每棵树展现了被每个搜索引擎爬过的所有节点。图像中的每根线代表一个节点，爬虫访问节点的次数决定了线的长度。下面的图像是原始图像的修改版，去掉了很长的根节点，添加了断开的（野）分枝。

Overall results

总体结果

From the start Yahoo! Slurp was by far the most active search bot. In one year it requested more than one million pages and crawled more than hundred thousand different nodes. Although this is a large number, it still is only 0.0049% of all nodes. The overall statistics of all bots is shown in the table below.

从一开始，Yahoo! Slurp就一直是最活跃的爬虫。在一年中它请求了超过百万个页面，爬过了数十万计个节点。这虽然是个大数目，但是只占总节点的0.0049％。所有爬虫的全面统计数据见下表。

overall statistics by search engine
Yahoo! Google MSN total number of pageviews number of nodes crawled percentage of tree crawled number of indexed nodes indexed/crawled ratio

1,030,396	20,633	4,699
105,971	7,556	1,390
0.0049%	0.00035%	0.000065%
120,000	554	1
113.23%	7.33%	0.07%

搜索引擎全面统计
Yahoo! Google MSN 页面访问量（pageviews）爬过的节点数爬过的节点所占百分比被索引的节点总数被索引/被爬过比率

1,030,396	20,633	4,699
105,971	7,556	1,390
0.0049%	0.00035%	0.000065%
120,000	554	1
113.23%	7.33%	0.07%

The growth of the number of pageviews and the number of crawled nodes over the year the experiment lasted, is shown in figure 1 and 2. The way the bots crawled the tree is visualised in detail with the animations for each bot in the sections below.

图1和图2是在这个历时一年的试验中页面访问量（pageview）和爬过的节点数的增长趋势。爬虫们爬这棵树的行为方式在下面一节中会以动画的形式详细说明。

pageviews in time
Fig. 1 - The cumulative number of pageviews by the search bots in time.

图.1 - 爬虫的累计页面访问量（pageview）（时间序）

nodes crawled in time
Fig. 2 - The cumulative number of nodes crawled by the search bots in time.

图.2 - 爬虫累计爬过的节点数（时间序）

The graph below (fig. 3) shows how many nodes of each level of the tree were crawled by the bots (on a logarithmic scale). The root of the tree is at level 0, while the most remote nodes (e.g. node 1) are at level 30. Since there are 2ⁿ nodes at the level n (there is only 1 root and there are 2³⁰ nodes at level 30) crawling the entire tree would result in a straight line.

下图（图.3）显示了树的每个层次上被爬虫爬过的节点数（对数比例）。树的根节点位于第0层，最远的节点位于第30层。由于在第n层上有 2ⁿ 个节点（[第0层]只有一个节点，第30层有 2³⁰ 个节点）所以完整爬过整棵树会形成一条直线。 ( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

nodes crawled by level
Fig. 3 - The number of nodes crawled after 1 year, grouped by node level.

图.3 - 1年中爬过的节点数，按层次分组

Google closely follows this straight line, until it breaks down after the level 12. Most nodes at level 12 or less were crawled (5524 out of 8191), but only very few nodes at higher levels were crawled by Googlebot. MSN shows similar behaviour, but breaks down much earlier, at the level 9 (656 out of 1023 nodes were crawled). Yahoo, however, does not break down. At high levels it gradually fails to request all nodes.

Google 在12层以下几乎是直线发展的，然后开始下跌。12层以下的节点大部分（8191个中的5524个）被Google爬过，但是Google很少去爬较高层的节点。MSN的行为也类似，只是下跌得更早，在第9层就开始下跌（爬了1023个节点中的656个）。而Yahoo不同，它没有下跌，但是在高层上它渐渐不再访问所有节点了。

The nodes at high levels that were crawled by Yahoo, were requested quite often compared to the other bots: at level 14 to 30 each page was requested 10 times at average (see fig. 4).

和其他爬虫相比Yahoo对高层节点的访问要频繁地多：在14至30层，平均每个页面被请求多达10次（见图4)。

nodes crawled by depth
Fig. 4 - The average number of pageviews per node after 1 year, grouped by node level.

图.4 - 一年中每个节点的平均访问量（pageview），按节点层次分组。( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

Yahoo! Slurp

large version (4273x3090, 1.5MB)
animated version over 1 year (2005-04-13 - 2006-04-13, 13MB)
animated version of the first 2 hours (2006-04-14 00:40:00-02:40:00, 2.2MB)

查看大图 (4273x3090, 1.5MB)
一年来的动画 (2005-04-13 - 2006-04-13, 13MB)
最开始2小时的动画 (2006-04-14 00:40:00-02:40:00, 2.2MB)

Fig. 5 - The Yahoo! Slurp tree.

图.5 - Yahoo! Slurp的二叉树

Yahoo! Slurp was the first search engine to discover Binary Search Tree 2. In the first hours after discovery it crawled the tree vigorously, at a speed of over 2.3 nodes per second (see the short animation ). The first day it crawled approximately 30,000 nodes.

Yahoo! Slurp最早发现二叉树2 。在它发现这个一个小时后，爬虫就全力开工了，速度超过2.3个节点/秒（看小动画）。第一天它爬了大约30000个节点。

In the following month Slurp's activity was low, but after exactly one month it requested all pages it visited before, for the second time. In the animation you can see the size of the tree double on 2005-05-14. This phenomenon is repeated a month later: on 2005-06-13 the tree grows to three times it original size. The number of pageviews is then almost 90,000 while the number of crawled nodes still is 30,000. Figure 6 shows this stepwise increment in the number of pageviews during the first months.

在接下来的一个月里，Slurp的活跃度变低了，但是刚好一个月后它又一次请求了曾经访问过的所有节点。在动画里你可以看到在2005-05-14树扩张了一倍。这种现象随后又在2005-06-13重复了一次，树增长到了原来的三倍。页面访问量（pageview）为将近90,000，而爬过的节点数仍然为30,000。图6展示了在第一个月里的这种阶梯式增长趋势。

Fig. 6 - The cumulative number of pageviews by Yahoo! Slurp in time.

图. - Yahoo! Slurp的累计页面访问量（pageview），时间序。

After four months Slurp requested a large number of 'new' nodes, for the first time since the initial round. It simply requested all URLs it had. Since it had already indexed 30,000 pages, that each link to two pages at a deeper level, it requested 60,000 pages at the end of August (the number of pageviews jumps from 100,000 to 160,000 pages in fig. 6) and it doubled the number of nodes it had crawled (see the fig. 7).

4个月后，Slurp开始了初步阶段里第一次对"新"节点的大规模请求。它直接访问了它拥有的所有URL。因为他已经索引了30,000个页面而每个页面又连接了两个更深层的页面，所以到八月底它总共请求了60,000个页面（页面访问量从100,000飚升至160,000，参看图6）而且它爬过的节点数也翻了一翻。

After 5 months Yahoo! Slurp started requesting nodes more regularly. It still had periods of 'discovery' (e.g. after 10 months).

5个月后Yahoo! Slurp对节点的请求变得更有规律，但仍然有"发现期"（比如10个月后）。

Fig. 7 - The cumulative number of nodes crawled by Yahoo! Slurp in time.

图.7 - Yahoo! Slurp累计爬过的节点数，时间序。

( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

Yahoo reported 120,000 pages in it's index ( current value ). This may seem impossible since it only visited 105,971 nodes, but every node is available on two different domain names: www.drunkmenworkhere.org and drunkmenworkhere.org .

120,000 个页面被包含在Yahoo的索引中（当前值）。这看起来好像不大可能因为他仅仅访问了105,97个节点，但是事实是每个节点都有两个不同的域名：www.drunkmenworkhere.org 和 drunkmenworkhere.org 。

Note: the query submitted to Google and MSN yielded 35,600 pages on Yahoo. Yahoo is the only search engine that returns results with the query used above.

注意：向Google和MSN提交的查询比Yahoo的少 35,600 页。Yahoo是唯一一个使用上述查询返回结果的搜索引擎。

Googlebot

large version (4067x4815, 180kB)
animated version (2005-04-13 - 2006-04-13, 1.2MB)

大图 (4067x4815, 180kB)
动画版 (2005-04-13 - 2006-04-13, 1.2MB)

Fig. 8 The Googlebot tree.

图.8 Googlebot树

In comparison with Yahoo's tree, Google's tree looks more like a natural tree. This is because Google visited nodes at deeper levels less frequently than their parent nodes. Yahoo only visited the nodes at the first three levels more frequently, while Google did so for the first 12 levels (see fig. 4).

和Yahoo的树相比，Google的看起来更像一棵天然的数。这是因为Google访问深层节点的频率小于访问它们父节点的频率。Yahoo仅对前3层节点访问比较频繁，而Google是对前12层（见图.4）。

The form of the tree follows from Google's PageRank algorithm. PageRank is defined as follows:

"We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "

Google树的形式遵循了Google的PageRank算法。PageRank被defined 如下：

"我们假设页面A有指向它的页面T1...Tn（比如A被它们引用）。参数d为可以赋0到1之间值的阻尼因数。我们通常设置d为0.85。下面一节会有详细说明。同时C(A)被定义为从页面A链接出去的页面数量。页面A的PageRank可计算如下：

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "

Since most nodes in the tree are not linked to by other sites, the PageRank of a node can be calculated with this formula (ignoring links in the comments):

PR(node) = 0.15 + 0.85 (PR(parent) + PR(left child) + PR(right child))/3

由于大部分的节点没有被别的站链接，PageRank可以用如下的公式计算：

PR(node) = 0.15 + 0.85 (PR(parent) + PR(left child) + PR(right child))/3

The only unknown when applying this formula iteratively, is the PageRank of the root node of the tree. Since this node was the homepage of drunkmenworkhere.org for a year, a high rank may be assumed. The calculated PageRank tree (fig. 9) shows similar proportions as Googlebot's real tree, so the frequency of visiting a page seems to be related to the PageRank of a page.

当应用这个公式时唯一的例外是根节点，因为这个节点一年以来是drunkmenworkhere.org 的主页，理应拥有一个较高的rank。计算出来的PageRank树（图.9）和Googlebot的真实树有相似之处，所以一个页面的访问频繁程度看起来和它的PageRank有关。

PageRank of pages in a binary tree
Fig. 9 - A binary tree of depth 17 visualising calculated PageRank as length of each line, when the PageRank of the root node is set to 100.

图.9 - 一个深度为17的二叉树。计算出来的PageRank决定了每根线的长度，根的PR被设置为100。

The animation of the Googlebot tree shows some interesting erratic behaviour, that cannot be explained with PageRank.

Googlebot树的动画表现出了一些无法用PageRank解释的奇怪行为。

The rightmost branch

From the start Googlebot crawled more nodes on the right hand side of the tree. On 2005-07-04 it tries to visit the rightmost node, i.e. the node with the highest value. After selecting the right branch starting at the root for 20 levels Googlebot stopped. This produced the arc at the right end of the tree.

最右边的分枝

从一开始，Googlebot对右手边的节点爬得更多一些。在2005-07-04它试图访问最右边值最大的那个节点。在选择了从根部开始的第20层最右边的分枝以后Googlebot停了下来。这就产生了树右端的那个弧形。

Searching node 1

On 2005-06-30 Googlebot visited node 1 , the leftmost node. It did not crawl the path from the root to this node, so how did it find the page? Did it guess the URL or did it follow some external link?
A few hours later, Googlebot crawled node 2 , which is linked as a parent node by node 1. These two nodes are displayed as a tiny dot in the animation on 2005-06-30, floating above the left branch. Then, a week later, on 2005-07-06 (two days after the attempt to find rightmost node), between 06:39:39 and 06:39:59 Googlebot finds the path to these disconnected nodes by visiting the 24 missing nodes in 20 seconds. It started at the root and found it's way up to node 2, without selecting a right branch. In the large version of the Googlebot tree, this path is clearly visible. The nodes halfway the path were not requested for a second time and are represented by thin short line segments, hence the steep curve.

搜索节点 1

2005-06-30 Googlebot 访问了最左边的节点，节点1 。它并不是从根节点一路爬过来的，它究竟是怎样发现这个页面的呢？是猜到这个URL的还是用某个外部链接链过来的？

几个小时后，Googlebot爬了节点2 ，节点1的父节点。这两个节点在2005-06-30的动画上被显示为浮动在左边分枝上方的一个小点。一周过后，在2005-07-06（发现最右边节点后的第二天）的 06:39:39 和 06:39:59 之间Googlebot在20秒内访问了中间24个节点，进而发现了通往这些孤立节点的路径。它没有选择右分枝，直接从根开始找到了通往节点2的路径。在Googlebot树的大图上可以清晰地看到这条路径。Googlebot没有再次访问这条路径半道上的节点，在图上就表现为又细又短的线段，从而产生了陡峭的曲线。

Yahoo-like subtree

On 2005-07-23 Google suddenly spends some hours crawling 600 new nodes near node 1073872896 . Most of these nodes were not visited ever again.
This subtree is the reason the number of nodes crawled by Googlebot, grouped by level, increases again from level 18 to level 30 in fig. 3.( 转载请注明出处 blog.csdn.net/uoyevoli www.farproc.com)

和Yahoo类似的子树

2005-07-23 Google 突然一下子连续几个小时在节点1073872896 附近爬了600个新节点，这些节点中的大多数在随后几乎没有再被访问过。这棵子树也是Googlebot在18至30层爬过的节点数又一次增长的原因（图.3）。

Over the last six months Googlebot requested pages at a fixed rate (about 260 pages per month, fig. 10). Like Yahoo! Slurp it seems to alternate between periods of discovery (see fig. 11) and periods of refreshing it's cache.

在最后的6个月里，Googlebot以恒定的速率请求页面（大约260个页面/月，图.10）。和Yahoo! Slurp类似，它在发现新节点（图.11）和回顾旧节点之间交替运行。

Fig. 10 - The cumulative number of pageviews by Googlebot in time.

图.10 - 按时间顺序显示的Googlebot累计页面访问量（pageview）。

Fig. 11 - The cumulative number of nodes crawled by Googlebot in time.

图.11 - 按时间顺序显示的Googlebot累计爬过的节点数

Google returned 554 results when searching for nodes. The first nodes reported by Google are node 1 and 2, which are very deep inside the tree at level 29 and 30. Their higher rank is also reflected in the curve shown above (Searching node 1), which indicates a high number of pageviews. They probably appear first because of their short URLs. The other nodes at the first result page are all at level 4, probably because the first three levels are penalised because of comment spam. The current number of results can be checked here .

当搜索节点时Google返回554个结果。Google 显示的首批节点为节点1和2，这是两个深深隐藏在29和30层里的节点。这两个节点的高PR值也可以从上面（搜索节点1）预示着高PV值的曲线看出来。它们首先出现可能是因为它们的URL较短的缘故。搜索结果的第一页上的其他节点都来自第4层，这可能是因为前3层由于有较多的评论垃圾而被惩罚了。当前搜索结果的数目可以看这里。

分享到：

北京限行规定带来的烦恼 | 搜索引擎算法研究

2008-10-13 15:35
浏览 1953
评论(3)
查看更多

3 楼 jasin2008 2009-02-06

楼主啥时候有空整个关于搜索引擎方面的参考资料

2 楼 banditjava 2008-10-14

monner兄研究得比我深啊，我这段时间工作比较忙，没有来得及去捣腾nutch。

对于你这个问题，我有空可以去研究一下！

1 楼 monner 2008-10-13

兄真是高产啊~ 一来就发现又多了好多丰富的东西

--------------------------------------------
我最近还是在搞nutch舍不得放手
最近碰到了个难题
我想将nutch配置为本地文件全文索引器
于是启用了 protocol-file
可是捣腾半天
发现nutch/lucene都不支持对含有中文目录或者中文文件名的文件进行索引
只支持路径名和文件名均为英文的文件进行索引。

有什么办法解救嘛？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论