  • 浏览: 160842 次
  • 性别: Icon_minigender_1
  • 来自: 北京



从该文对googlebot的分析看,googlebot似乎是想先对网站的结构和规模做出分析,然后再规划抓取行为,googlebot的行为确 实很有意思。Yahoo的机器人似乎是以月为单位周期的更新,抓取新网页和索引,感觉是想以量取胜,并且对网页似乎没有进一步的分析。感觉MSNbot在 整体上还略逊于另两个竞争对手。



In the previous edition - Binary Search Tree 2 - a large scale experiment on search engine behaviour was staged with more than two billion different web pages. This experiment lasted exactly one year, until April 13th. In this period the three major search engines requested more than one million pages of the tree, from more than hundred thousand different URLs. The home page of drunkmenworkhere.org grew from 1.6 kB to over 4 MB due to the visit log and the comment spam displayed there.

在上一版(Binary Search Tree 2 )中我用了200亿以个上的web页面进行了一个搜索引擎行为的大

规模试验。这个试验一直持续到4月13日,历时整整一年。在这段时间中三大主要的搜索引擎在树上请求了100万个以上的页面和超过10万个URL。由于访问日志的增长和垃圾评论的存在,drunkmenworkhere.org的主页也从1.6KB增长为4MB多。( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)

This edition presents the results of the experiment.






2,147,483,647 web pages ('nodes') were numbered and arranged in a binary search tree . In such a tree, the branch to the left of each node contains only values less than the node's value, while the right branch contains only values higher than the node's value. So the leftmost node in this tree has value 1 and the rightmost node has value 2,147,483,647.

在这次试验中 二叉查找树 上总共放置了 2,147,483,647个标了值的网页。对于二叉查找树,每个节点的左子树只包含比这个节点小的值,右子树上包含比这个节点大的值。所以树的左边最远的节点值为1,右边最远的节点值为2,147,483,647


The depth of the tree is the number of nodes you have to traverse from the root to the most remote leaf. Since you can arrange 2n+1 - 1 numbers in a tree of depth n, the resulting tree has a depth of 30 (231 = 2,147,483,648). The value at the root of the tree is 1073741824 (230 ).

树的深度是从根到最远的树叶所要经过的节点总数。因为在深度为n的树上你总共可以放置 2n+1 - 1个节点,所以这棵树的深度为30 (231 = 2,147,483,648),其根部的值为1073741824 (230 )。


For each page the traffic of the three major search bots (Yahoo! Slurp , Googlebot and msnbot ) was monitored over a period of one year (between 2005-4-13 and 2006-4-13).

这个试验中监控了三大搜索爬虫( Yahoo! Slurp , Googlebot msnbot )在一年时间里(2005-4-13 到 2006-4-13)在每个页面上的流量。


To make the content of each page more interesting for the search engines, the value of each node is written out in American English (short scale) and each page request from a search bot is displayed in reversed chronological order. To enrich the zero-content even more, a comment box was added to each page (it was removed on 2006-4-13). These measures were improvements over the initial Binary Search Tree which uses inconvenient long URLs.

为了让搜索引擎对页面的内容更感兴趣,所有的节点值都以美国英语(short scale)为语言写的并且爬虫请求的页面按照时间倒序显示。为了进一步丰富0内容(zero-content?),每个页面上都添加了一个评论框(于2006-4-13移除)。这些措施是对 二叉查找树 最初麻烦的长URL的改进。( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)


Every node shows an image of three trees. Each tree in the image visualises which nodes are crawled by each search engine. Each line in the image represents a node, the number of times a search bot visited the node determines the length of the line. The tree images below are modified large versions of the original image, without the very long root node and with disconnected (wild) branches.

每个节点上显示了一张三棵树的图像 。每棵树展现了被每个搜索引擎爬过的所有节点。图像中的每根线代表一个节点,爬虫访问节点的次数决定了线的长度。下面的图像是原始图像的修改版,去掉了很长的根节点,添加了断开的(野)分枝。



Overall results

总体 结果

From the start Yahoo! Slurp was by far the most active search bot. In one year it requested more than one million pages and crawled more than hundred thousand different nodes. Although this is a large number, it still is only 0.0049% of all nodes. The overall statistics of all bots is shown in the table below.

从一开始,Yahoo! Slurp就一直是最活跃的爬虫。在一年中它请求了超过百万个页面,爬过了数十万计个节点。这虽然是个大数目,但是只占总节点的0.0049%。所有爬虫的全面统计数据见下表。


overall statistics by search engine
Yahoo! Google MSN total number of pageviews number of nodes crawled percentage of tree crawled number of indexed nodes indexed/crawled ratio
1,030,396 20,633 4,699
105,971 7,556 1,390
0.0049% 0.00035% 0.000065%
120,000 554 1
113.23% 7.33% 0.07%

Yahoo! Google MSN 页面访问量(pageviews) 爬过的节点数 爬过的节点所占百分比 被索引的节点总数 被索引/被爬过 比率
1,030,396 20,633 4,699
105,971 7,556 1,390
0.0049% 0.00035% 0.000065%
120,000 554 1
113.23% 7.33% 0.07%

The growth of the number of pageviews and the number of crawled nodes over the year the experiment lasted, is shown in figure 1 and 2. The way the bots crawled the tree is visualised in detail with the animations for each bot in the sections below.




pageviews in time
Fig. 1 - The cumulative number of pageviews by the search bots in time.

图.1 - 爬虫的累计页面访问量(pageview)(时间序)

nodes crawled in time
Fig. 2 - The cumulative number of nodes crawled by the search bots in time.

图.2 - 爬虫累计爬过的节点数(时间序)

The graph below (fig. 3) shows how many nodes of each level of the tree were crawled by the bots (on a logarithmic scale). The root of the tree is at level 0, while the most remote nodes (e.g. node 1) are at level 30. Since there are 2n nodes at the level n (there is only 1 root and there are 230 nodes at level 30) crawling the entire tree would result in a straight line.

下图(图.3)显示了树的每个层次上被爬虫爬过的节点数(对数比例)。树的根节点位于第0层,最远的节点位于第30层。由于在第n层上有 2n 个节点([第0层]只有一个节点,第30层有 230 个节点 )所以完整爬过整棵树会形成一条直线。 ( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)


nodes crawled by level
Fig. 3 - The number of nodes crawled after 1 year, grouped by node level.

图.3 - 1年中爬过的节点数,按层次分组

Google closely follows this straight line, until it breaks down after the level 12. Most nodes at level 12 or less were crawled (5524 out of 8191), but only very few nodes at higher levels were crawled by Googlebot. MSN shows similar behaviour, but breaks down much earlier, at the level 9 (656 out of 1023 nodes were crawled). Yahoo, however, does not break down. At high levels it gradually fails to request all nodes.

Google 在12层以下几乎是直线发展的,然后开始下跌。12层以下的节点大部分(8191个中的5524个)被Google爬过,但是Google很少去爬较高层 的节点。MSN的行为也类似,只是下跌得更早,在第9层就开始下跌(爬了1023个节点中的656个)。而Yahoo不同,它没有下跌,但是在高层上它渐 渐不再访问所有节点了。


The nodes at high levels that were crawled by Yahoo, were requested quite often compared to the other bots: at level 14 to 30 each page was requested 10 times at average (see fig. 4).



nodes crawled by depth
Fig. 4 - The average number of pageviews per node after 1 year, grouped by node level.

图.4 - 一年中每个节点的平均访问量(pageview),按节点层次分组。( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)



Yahoo! Slurp

Yahoo! binary tree

Fig. 5 - The Yahoo! Slurp tree.

图.5 - Yahoo! Slurp的二叉树


Yahoo! Slurp was the first search engine to discover Binary Search Tree 2. In the first hours after discovery it crawled the tree vigorously, at a speed of over 2.3 nodes per second (see the short animation ). The first day it crawled approximately 30,000 nodes.

Yahoo! Slurp最早发现 二叉树2 。在它发现这个一个小时后,爬虫就全力开工了,速度超过2.3个节点/秒(看小动画 )。第一天它爬了大约30000个节点。


In the following month Slurp's activity was low, but after exactly one month it requested all pages it visited before, for the second time. In the animation you can see the size of the tree double on 2005-05-14. This phenomenon is repeated a month later: on 2005-06-13 the tree grows to three times it original size. The number of pageviews is then almost 90,000 while the number of crawled nodes still is 30,000. Figure 6 shows this stepwise increment in the number of pageviews during the first months.

在接下来的一个月里,Slurp的活跃度变低了,但是刚好一个月后它又一次请求了曾经访问过的所有节点。在动画 里你可以看到在2005-05-14树扩张了一倍。这种现象随后又在2005-06-13重复了一次,树增长到了原来的三倍。页面访问量(pageview)为将近90,000,而爬过的节点数仍然为30,000。图6展示了在第一个月里的这种阶梯式增长趋势。


pageviews by Yahoo! Slurp
Fig. 6 - The cumulative number of pageviews by Yahoo! Slurp in time.

图. - Yahoo! Slurp的累计页面访问量(pageview),时间序。

After four months Slurp requested a large number of 'new' nodes, for the first time since the initial round. It simply requested all URLs it had. Since it had already indexed 30,000 pages, that each link to two pages at a deeper level, it requested 60,000 pages at the end of August (the number of pageviews jumps from 100,000 to 160,000 pages in fig. 6) and it doubled the number of nodes it had crawled (see the fig. 7).



After 5 months Yahoo! Slurp started requesting nodes more regularly. It still had periods of 'discovery' (e.g. after 10 months).

5个月后Yahoo! Slurp对节点的请求变得更有规律,但仍然有"发现期"(比如10个月后)。


nodes crawled by Yahoo! Slurp
Fig. 7 - The cumulative number of nodes crawled by Yahoo! Slurp in time.

图.7 - Yahoo! Slurp累计爬过的节点数,时间序。

( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)

Yahoo reported 120,000 pages in it's index ( current value ). This may seem impossible since it only visited 105,971 nodes, but every node is available on two different domain names: www.drunkmenworkhere.org and drunkmenworkhere.org .

120,000 个页面被包含在Yahoo的索引中(当前值 )。这看起来好像不大可能因为他仅仅访问了105,97个节点,但是事实是每个节点都有两个不同的域名:www.drunkmenworkhere.orgdrunkmenworkhere.org

Note: the query submitted to Google and MSN yielded 35,600 pages on Yahoo. Yahoo is the only search engine that returns results with the query used above.

注意:向Google和MSN提交的查询比Yahoo的少 35,600 页。Yahoo是唯一一个使用上述查询返回结果的搜索引擎。



Google binary tree

Fig. 8   The Googlebot tree.

图.8 Googlebot树


In comparison with Yahoo's tree, Google's tree looks more like a natural tree. This is because Google visited nodes at deeper levels less frequently than their parent nodes. Yahoo only visited the nodes at the first three levels more frequently, while Google did so for the first 12 levels (see fig. 4).



The form of the tree follows from Google's PageRank algorithm. PageRank is defined as follows:

"We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "

Google树的形式遵循了Google的PageRank算法。PageRank被defined 如下:


PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "

Since most nodes in the tree are not linked to by other sites, the PageRank of a node can be calculated with this formula (ignoring links in the comments):

PR(node) = 0.15 + 0.85 (PR(parent) + PR(left child) + PR(right child))/3


PR(node) = 0.15 + 0.85 (PR(parent) + PR(left child) + PR(right child))/3


The only unknown when applying this formula iteratively, is the PageRank of the root node of the tree. Since this node was the homepage of drunkmenworkhere.org for a year, a high rank may be assumed. The calculated PageRank tree (fig. 9) shows similar proportions as Googlebot's real tree, so the frequency of visiting a page seems to be related to the PageRank of a page.

当应用这个公式时唯一的例外是根节点,因为这个节点一年以来是drunkmenworkhere.org 的主页,理应拥有一个较高的rank。计算出来的PageRank树(图.9)和Googlebot的真实树有相似之处,所以一个页面的访问频繁程度看起来和它的PageRank有关。


PageRank of pages in a binary tree
Fig. 9 - A binary tree of depth 17 visualising calculated PageRank as length of each line, when the PageRank of the root node is set to 100.

图.9 - 一个深度为17的二叉树。计算出来的PageRank决定了每根线的长度,根的PR被设置为100。



The animation of the Googlebot tree shows some interesting erratic behaviour, that cannot be explained with PageRank.

Googlebot树的动画 表现出了一些无法用PageRank解释的奇怪行为。


The rightmost branch
From the start Googlebot crawled more nodes on the right hand side of the tree. On 2005-07-04 it tries to visit the rightmost node, i.e. the node with the highest value. After selecting the right branch starting at the root for 20 levels Googlebot stopped. This produced the arc at the right end of the tree.

 Googlebot's rightmost branch

Searching node 1
On 2005-06-30 Googlebot visited node 1 , the leftmost node. It did not crawl the path from the root to this node, so how did it find the page? Did it guess the URL or did it follow some external link?
A few hours later, Googlebot crawled node 2 , which is linked as a parent node by node 1. These two nodes are displayed as a tiny dot in the animation on 2005-06-30, floating above the left branch. Then, a week later, on 2005-07-06 (two days after the attempt to find rightmost node), between 06:39:39 and 06:39:59 Googlebot finds the path to these disconnected nodes by visiting the 24 missing nodes in 20 seconds. It started at the root and found it's way up to node 2, without selecting a right branch. In the large version of the Googlebot tree, this path is clearly visible. The nodes halfway the path were not requested for a second time and are represented by thin short line segments, hence the steep curve.

搜索节点 1
2005-06-30 Googlebot 访问了最左边的节点,节点1 。它并不是从根节点一路爬过来的,它究竟是怎样发现这个页面的呢?是猜到这个URL的还是用某个外部链接链过来的?
几个小时后,Googlebot爬了节点2 ,节点1的父节点。这两个节点在2005-06-30的动画上被显示为浮动在左边分枝上方的一个小点。一周过后,在2005-07-06(发现最右边节点后的第二天)的 06:39:3906:39:59 之间Googlebot在20秒内访问了中间24个节点,进而发现了通往这些孤立节点的路径。它没有选择右分枝,直接从根开始找到了通往节点2的路径。在Googlebot树的大图 上可以清晰地看到这条路径。Googlebot没有再次访问这条路径半道上的节点,在图上就表现为又细又短的线段,从而产生了陡峭的曲线。

Googlebot's path to node 1
Yahoo-like subtree
On 2005-07-23 Google suddenly spends some hours crawling 600 new nodes near node 1073872896 . Most of these nodes were not visited ever again.
This subtree is the reason the number of nodes crawled by Googlebot, grouped by level, increases again from level 18 to level 30 in fig. 3.( 转载请注明出处  blog.csdn.net/uoyevoli www.farproc.com)

2005-07-23 Google 突然一下子连续几个小时在 节点1073872896 附近爬了600个新节点,这些节点中的大多数在随后几乎没有再被访问过。这棵子树也是Googlebot在18至30层爬过的节点数又一次增长的原因(图.3)。

Googlebot's subtree

Over the last six months Googlebot requested pages at a fixed rate (about 260 pages per month, fig. 10). Like Yahoo! Slurp it seems to alternate between periods of discovery (see fig. 11) and periods of refreshing it's cache.

在最后的6个月里,Googlebot以恒定的速率请求页面(大约260个页面/月,图.10)。和Yahoo! Slurp类似,它在发现新节点(图.11)和回顾旧节点之间交替运行。


pageviews by Googlebot

Fig. 10 - The cumulative number of pageviews by Googlebot in time.

图.10 - 按时间顺序显示的Googlebot累计页面访问量(pageview)。

nodes crawled by Googlebot
Fig. 11 - The cumulative number of nodes crawled by Googlebot in time.

图.11 - 按时间顺序显示的Googlebot累计爬过的节点数

Google returned 554 results when searching for nodes. The first nodes reported by Google are node 1 and 2, which are very deep inside the tree at level 29 and 30. Their higher rank is also reflected in the curve shown above (Searching node 1), which indicates a high number of pageviews. They probably appear first because of their short URLs. The other nodes at the first result page are all at level 4, probably because the first three levels are penalised because of comment spam. The current number of results can be checked here .

当搜索节点时Google返回554个结果 。Google 显示的首批节点为节点1和2,这是两个深深隐藏在29和30层里的节点。这两个节点的高PR值也可以从上面(搜索节点1)预示着高PV值的曲线看出来。它 们首先出现可能是因为它们的URL较短的缘故。搜索结果的第一页上的其他节点都来自第4层,这可能是因为前3层由于有较多的评论垃圾而被惩罚了。当前搜索 结果的数目可以看这里

3 楼 jasin2008 2009-02-06  
2 楼 banditjava 2008-10-14  

1 楼 monner 2008-10-13  
兄真是高产啊~ 一来就发现又多了好多丰富的东西

于是启用了 protocol-file




    9. 人工智能助手:如智能聊天机器人,它们能与用户进行对话,解答问题,甚至完成复杂的任务,使得搜索引擎服务更加智能化。 10. 搜索引擎优化(SEO):随着技术进步,搜索引擎对网站的排名算法也不断调整,企业需要...


    **基于JAVA技术的搜索引擎研究报告及实现** 在信息技术飞速发展的今天,搜索引擎已成为互联网用户获取信息的重要工具。本研究报告聚焦于基于JAVA技术构建的搜索引擎,旨在深入探讨其原理、设计与实现,以及关键技术...


    本文首先详细介绍了基于英特网的搜索引擎的系统结构,然后从网络机器人、索引引擎、Web服务器三个方面进行详细的说明。为了更加深刻的理解这种技术,本人还亲自实现了一个自己的搜索引擎——新闻搜索引擎。 新闻搜索...




    同时,这个资料还包含了搜索引擎机器人的研究报告,可能涵盖了最新研究进展和实际应用中的挑战。 中文全文检索网和全文检索相关知识介绍,则为我们提供了搜索引擎在处理中文文本时的具体应用场景和知识。全文检索...


    报告指出,虽然深度学习在电商搜索中已有显著成果,但在亚马逊等电商平台的搜索引擎中仍处于实验阶段。 此外,报告讨论了电商搜索中的一些挑战,如同义词和归一化的处理。为解决语义词汇差异,如“理发器”、“理发...




    1. 智能搜索:用户可以通过内置的搜索引擎快速找到所需的学习资源,无论是特定的教材、研究报告还是在线课程,都可以通过关键词进行精准定位。 2. 批量下载:对于需要下载的多个文件,纳米机器人可以一次性添加到...


    2. 搜索引擎蜘蛛:搜索引擎为了更新索引,会派出机器人程序(也称为爬虫或蜘蛛)遍历互联网上的网站。它们抓取网页内容,并将这些信息存储在搜索引擎的数据库中。通过分析蜘蛛访问日志,我们可以了解哪些页面被爬取...





    ChatGPT:又一个“人形机器人”主题 -20230121 -东吴证券.pdf

    ChatGPT:又一个“人形机器人”主题研究报告 本报告对ChatGPT的技术特点、应用前景和市场潜力进行了深入分析。ChatGPT是OpenAI推出的对话式AI模型,具有语言类AI底层技术NLP的显著进步和Transformer和RLHF算法的...


    例如,简洁的导航可以帮助用户快速找到所需信息,而搜索引擎优化则能确保酒店在搜索引擎结果中排名靠前,增加曝光率。 此外,信息技术在酒店管理信息系统中的应用也日益广泛。通过集成预订系统、客房管理系统、财务...


    ### 2023年AIGC之ChatGPT行业研究报告关键知识点解析 #### 一、ChatGPT及其核心技术 **1.1 ChatGPT简介** ChatGPT是一款由OpenAI开发的人工智能对话机器人,它能够理解并生成自然语言,从而与用户进行持续、深入...


    这项技术在客服机器人、搜索引擎等领域有着广泛的应用。典型的公司如百度的DuerOS就是一个例子。 ##### 计算机视觉 计算机视觉技术使机器能够识别和解析视觉信息,如图像或视频。它在无人驾驶汽车、安防监控等方面...

    互联网传媒行业证券研究报告:ChatGPT,互联网的效率革命 20230209 -方正证券.pdf

    微软已经在必应搜索引擎中整合ChatGPT功能,并推出付费版ChatGPT PLUS,标志着商业化的开端。谷歌推出了LaMDA驱动的Bard,计划短期内向公众开放。百度则准备上线中文版的ChatGPT——“文心一意”(ERNIE Bot)。这些...

    AIGC行业深度报告 -ChatGPT,重新定义搜索“入口” -20230208 -华西证券.zip

    《AIGC行业深度报告 - ChatGPT,重新定义搜索“入口”》是华西证券在2023年2月8日发布的一份研究报告,该报告深入探讨了人工智能生成内容(AIGC)领域的新趋势,特别是ChatGPT如何颠覆传统搜索引擎的使用方式,对...


    领军企业积极跟进,商业应用... 随着微软将新的OpenAI模型整合至自身产品中,谷歌、百度等AI领军企 业也宣布推出聊天机器人,未来有望将聊天机器人整合至搜索引擎甚至 办公软件等业务当中,商业化应用有望加速落地。

Global site tag (gtag.js) - Google Analytics