`
lmx800
  • 浏览: 29851 次
  • 来自: ...
文章分类
社区版块
存档分类
最新评论

Building Nutch: Open Source Search

阅读更多

A case study in writing an open source search engine

Nutch

Search engines are as critical to Internet use as any other part of the network infrastructure, but they differ from other components in two important ways. First, their internal workings are secret, unlike, say, the workings of the DNS (domain name system). Second, they hold political and cultural power, as users increasingly rely on them to navigate online content.

When so many rely on services whose internals are closely guarded, the possibilities for honest mistakes, let alone abuse, are worrisome. Further, keeping search-engine algorithms secret means that further advances in the area become less likely. Much relevant research is kept behind corporate walls, and useful methods remain largely unknown.

To address these problems, we started the Nutch software project, an open source search engine free for anyone to download, modify, and run, either as an internal intranet search engine or as a public Web search service. As you may have just read in Anna Patterson's "Why Writing Your Own Search Engine Is Hard", writing a search engine is not easy. As such, our article focuses on Nutch's technical challenges, but of course we hope Nutch will offer improvements in both the technical and social spheres. By enabling more people to run search engines, and by making the code open, we hope search algorithms will become as transparent as their importance demands.

TECHNICAL CHALLENGES

Much of the challenge in designing a search engine is making it scale. Writing a Web crawler that can download a handful of pages is straightforward, but writing one that can regularly download the Web's nearly 5 billion pages is much harder.

Tips on how to adapt IT for business changes!

Further, a search engine must be able to process queries efficiently. Requirements vary widely with site popularity: a search engine may receive anywhere from less than one to hundreds of searches per second.

Finally, unlike many software projects, search engines can have high ongoing costs. They may require lots of hardware that consumes lots of Internet bandwidth and electricity. We discuss deployment costs in more detail in the next section, but for now it's helpful to keep in mind a few ideas:

  • The cost of one part of the search engine scales with the size of the document collection. The collection might be very small when Nutch is searching a single intranet, but could be as large as the Web itself.
  • Another part of the search engine scales with the size of the query load. Each query takes a certain amount of time to process and consumes some bandwidth.
  • With these two factors in mind, we've designed a system that can easily distribute the work of both fetching and query processing over a set of standard machines.

Figure 1 shows the system's components.

WebDB. WebDB is a persistent custom database that tracks every known page and relevant link. It maintains a small set of facts about each, such as the last-crawled date. WebDB is meant to exist for a long time, across many months of operation.

Since WebDB knows when each link was last fetched, it can easily generate a set of fetchlists. These lists contain every URL we're interested in downloading. WebDB splits the overall workload into several lists, one for each fetcher process. URLs are distributed almost randomly; all the links for a single domain are fetched by the same process, so it can obey politeness constraints.

The fetchers consume the fetchlists and start downloading from the Internet. The fetchers are "polite," meaning they don't overload a single site with requests, and they observe the Robots Exclusion Protocol. (This allows Web-site owners to mark parts of the site as off-limits to automated clients such as our fetcher.) Otherwise, the fetcher blindly marches down the fetchlist, writing down the resulting downloaded text.

Fetchers output WebDB updates and Web content. The updates tell WebDB about pages that have appeared or disappeared since the last fetch attempt. The Web content is used to generate the searchable index that users will actually query.

Note that the WebDB-fetch cycle is designed to repeat forever, maintaining an up-to-date image of the Web graph.

Indexing and Querying. Once we have the Web content, Nutch can get ready to process queries. The indexer uses the content to generate an inverted index of all terms and all pages. We divide the document set into a set of index segments, each of which is fed to a single searcher process.

We can thus distribute the current set of index segments over an arbitrary number of searcher processes, allowing us to scale easily with the query load. Further, we can copy an index segment to multiple machines and run a searcher over each one; that allows more good scaling behavior and reliability in case one or more of the searcher machines fail.

Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page.

Finally, a pool of Web servers handle interactions with users and contact the searchers for results. Each Web server interacts with many different searchers to learn about the entire document set. In this way, the Web server is simultaneously acting as an HTTP server and a Nutch-search client.

Web servers contain very little state and can be easily reproduced to handle increased load. They need to be told only about the existing pool of searcher machines. The only state they do maintain is a list of which searcher processes are available at any time; if a given segment's searcher fails, the Web server will query a different one instead.

Quality. Generating high-quality results, of course, is the most important barrier for Nutch to overcome. If it cannot find relevant pages as well as commercial engines do, Nutch isn't much use. But how can it ever compete with large, paid engineering staffs?

  • First, we believe high-quality search is a slowing target. By some measures of quality, the gap between the best search engine and its competitors has narrowed considerably. After several years of intense focus on search results, anecdotal evidence suggests gains in quality are harder to find. The everyday search user will find lots of new features on the various engines, but real differences in results quality are close to imperceptible.
  • Second, although much search work takes place behind corporate walls, there is still a fair amount of public academic work. Many of the techniques that search engines use were discovered by IR (information retrieval) researchers in the 1970s. Some people have tried to tie IR in with advances in language understanding. With the advent of the Web, many different groups experimented with link-driven methods. We think there should be more public research, but there is already a good amount to draw upon.
  • Third, we expect that Nutch will be able to incorporate academic advances faster than any other engine can. We think researchers and engineers will find Nutch very appealing. If it becomes the easiest platform for researchers to experiment on, taking advantage of the results should be extremely simple.
  • Finally, we'll rely on the traditional advantages of open source projects. More people from more places should work on Nutch, which means faster bug finding, more ideas, and better implementations. In the long term, a worldwide shared effort supported by research at a number of institutions should eventually be able to surpass the private efforts of any company.

Once an open source search solution is as good or better than proprietary implementations, there should be little reason for companies to use anything but the open source version. It will be cheaper to maintain and work as well.

Spam. A high search ranking can be extremely valuable to a Web-site owner—so valuable that many sites try to "spam" search engines with specially formulated content in an effort to raise their rankings. As with e-mail spam, the spammer can benefit at a heavy cost to everyday users.

How does this work in practice? Search engines tend to use a well-known set of guidelines to measure a page's relevance to a given query. For example, all other things being equal, a page that contains the word parrot 10 times is more about parrots than a page that has the word just once. A page with lots of incoming links from other sites is more important than a page with fewer incoming links.

Tips on how to adapt IT for business changes!

That means it can be fairly easy to trick a naive search engine. Want to make sure every parrot lover finds your page? Repeat the word parrot 600 times somewhere on your page. Want to raise your page's in-link count? Pay a type of site known as a "link farm" to add thousands of links aimed at your page.

Of course, the consequence is that search results can become choked with sites that are not truly relevant, but have "gamed" the system successfully. Good search engines don't want their results to become useless, so they do everything possible to detect these spam tricks. Spammers, in turn, modify their tricks to avoid detection. The result is an arms race between search engine and spammer.

Here are some well-known spam techniques, along with methods to defeat them:

  • Web sites write documents that contains long repetitions of certain words. Search engines counter by eliminating terms that appear consecutively more than a certain number of times.
  • Web sites do the same trick, but intersperse the repeated term along with good-looking intervening text. Search engines counter by checking whether the statistical distribution of the words in the document matches the typical English-language profile. If it's too far afield, the site is marked as a spammer.
  • Web sites that want high rankings regardless of query put spurious "invisible" text on the page. Say the site offers a page about electronics, all rendered on a white background. The very same page might contain a long essay about, say, Britney Spears, all rendered in white text. Users won't see it, but the search engine will. Search engines counter by computing the visible portion of the HTML and tossing the rest, or even by penalizing pages that use any invisible text.
  • Web sites use the "User-Agent" tag to identify the type of browser. If the browser is a piece of desktop software, the Web site returns regular content. If the browser is a crawler for a search engine, the Web site returns different content that contains thousands of repetitions of parrot. Search engines fight against this by penalizing sites that give substantially different content for different browser types.
  • Web sites use link farms to add to incoming link count. Search engines find link farms by looking for statistically unusual link structures. The link farms are thrown away before computing link counts. Pages that participate in the farm may also be penalized.

Some of these methods may rely on secrecy for their effectiveness, so some people ask how an open source engine could possibly handle spam. With full disclosure of code, won't a search engine lose the fight?

It's true that Nutch code won't hold any secrets. But these secrets are brittle anyway—spammers don't take long to defeat the latest defense. If search has to rely on secrecy to beat spam, the spammers will probably win.

In the world of e-mail spam, at least, the days of simple methods to defeat spammers seem to be over. Many of the latest techniques to defeat e-mail spam are statistics driven. With such methods, even intimate knowledge of the source code may not help spammers much. Although people may be reluctant to use such probabilistic spam detectors on e-mail for fear of deleting a single good message, the massive redundancy of Web information means false positives are not so great a tragedy.

Alternatively, the answer may lie in an analogy to cryptography. It has taken a long time for people to learn the counterintuitive notion that the most secure cryptographic systems are those that have the most public scrutiny. Most people who look at these systems are well motivated and work to improve them rather than to defeat them. They find problems before they can be exploited.

The analogy may be flawed, but it can't be tested without transparency. Nutch is currently the best shot at enabling some form of public review for defeating search engine spam.

 

分享到:
评论

相关推荐

    Nutch an Open-Source Platform for Web Search

    ### Nutch:一个开源的网络搜索引擎平台 #### 概述 Nutch 是一个由 Apache Software Foundation 托管的开源项目,旨在提供一个完整的、高质量的网络搜索系统,并为开发新型网络搜索引擎提供了一个灵活且可扩展的...

    Nutch:一个灵活可扩展的开源web搜索引擎

    5. **搜索(Search)**:用户可以通过Nutch的搜索接口提交查询,系统会根据索引返回相关的网页。Nutch支持布尔查询、短语查询等多种查询类型,并提供排序算法来确定结果的排名。 6. **更新与增量抓取**:Nutch支持...

    nutch:一个 Nutch 的克隆,试图让它工作并跟踪所需的步骤

    Apache Nutch 自述文件 有关 Nutch 的最新信息,请访问我们的网站: 和我们的维基,在: 要开始使用 Nutch,请阅读教程: 贡献 要提供补丁,请按照以下说明操作(请注意,安装不是必需的,但建议安装)。 0. ...

    nutch入门.pdf

    在应用方面,提到了如何修改源码和使用插件机制(plugin),以及如何利用Nutch API和OpenSearch API接口。插件机制是Nutch灵活性的重要体现,它允许用户通过插件来扩展Nutch的功能。编写插件部分讲解了如何开发自己...

    qiwur-nutch:基于Apache Nutch的Web爬网程序,具有众包支持和Ajax支持

    Qiwur Nutch基于Apache Nutch 2.3.0,具有出色的功能: 众包抓取支持Ajax支持人形机器人更好的系统计数器更好的网络用户界面该项目与其他两个相关项目一起工作:卫星: : qiwur-nutch-ui: : Project Satellite是...

    爬虫代码matlab-nutch:【Nutch】基因工程DNA搜索框架的架构

    爬虫代码 ...Nutch的DNA遗传信息搜索框架 启动光感受器的再生,将病毒载体引入穆勒细胞,植入干细胞的特性 启动心脏再生,将心脏细胞导入:心肌细胞是一种植入干细胞特性的病毒载体 启动胰腺再生,将病毒载

    搭建nutch开发环境步骤

    在Nutch根目录下运行Maven命令来编译和安装Nutch: ```bash mvn clean install -DskipTests ``` 这将编译Nutch源代码,并将其安装到你的本地Maven仓库。 **步骤七:创建Nutch数据库** 在Nutch的根目录下,初始化...

    nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据

    6. **存储与检索**:Nutch 支持多种存储和检索机制,如Hadoop的HDFS(Hadoop Distributed File System)用于存储爬取的数据,Solr或Elasticsearch用于提供全文检索服务。 7. **配置与优化**:Nutch 的性能可以通过...

    nutch2.1最新svn打包

    2. 配置Nutch:根据项目需求修改conf目录下的配置文件。 3. 获取依赖:确保你的环境中安装了Hadoop和Lucene,以及其他必要的库。 4. 构建Nutch:使用Ant或Maven等构建工具编译源代码。 5. 初始化和运行爬虫:根据...

    apache-nutch

    **Apache Nutch:亿级网页抓取与搜索引擎技术** Apache Nutch 是一个开源的网络爬虫框架,专门设计用于高效地抓取、索引和分析互联网上的海量数据。它基于Java开发,是Apache软件基金会的一个项目,具有高度可扩展...

    搭建nutch web开发环境

    2. 使用Maven构建Nutch: ``` mvn clean compile assembly:single ``` 这将生成一个包含所有依赖的jar文件。 **配置Nutch** 1. 在`conf`目录下,编辑`nutch-site.xml`配置文件,根据你的Hadoop集群设置相关参数...

    Nutch2.3.1 环境搭建

    Nutch2.3.1是Apache Nutch的一个稳定版本,它是一个开源的网络爬虫框架,主要用于抓取、解析和索引互联网上的网页内容。在本文中,我们将深入探讨如何搭建Nutch2.3.1的运行环境,以便进行网页抓取和分析。 **一、...

    eclipse配置nutch,eclipse配置nutch

    在创建项目时,选择“Create project from existing source”,并指向你已经下载或克隆的Nutch目录。这样,Eclipse就会将Nutch作为一个项目导入,便于后续的开发和管理。 #### 步骤2:添加源码文件夹 在项目创建后...

    windows下安装nutch

    9. **测试Nutch**:在配置完成后,可以运行Nutch的测试命令,如`bin/nutch test`,来验证Nutch是否能正常工作。这将执行一系列检查,确保所有必需的服务和组件都已就绪。 通过以上步骤,你就可以在Windows环境下...

    nutch缺失的两个jar组件

    4. 重新构建Nutch:运行Nutch的构建脚本(如`build.sh`或`ant`命令)以确保新添加的库被正确地包含在构建的Nutch实例中。 5. 测试:抓取包含RTF和MP3内容的网页,验证Nutch是否能正确解析和处理这些文件。 在实际...

    nutch使用&Nutch;入门教程

    Nutch可以与Solr或Elasticsearch等搜索引擎集成,实现快速高效的搜索功能。此外,还可以与HBase等NoSQL数据库配合,用于大规模数据存储和检索。 七、实战教程 “Nutch使用.pdf”和“Nutch入门教程.pdf”这两份文档...

    nutch2.2.1安装步骤.docx

    保存文件后,执行 `source ~/.bashrc` 或重启系统使更改生效。 为了存储和管理 Nutch 抓取的数据,你还需要在 MySQL 数据库中创建一个数据库和表。例如,创建名为 `nutch_test` 的数据库,采用 `latin1` 字符集和 `...

Global site tag (gtag.js) - Google Analytics