新一代聚类搜索引擎

kongshanxuelin

浏览: 931246 次
性别:
来自: 宁波

最近访客更多访客>>

norrain

wangenbao1

akingde

newer_fisher

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

核心代码片段

搜索引擎 Ajax 算法 Google 百度

目前包括百度，google，搜搜，Yahoo等搜索引擎提供的是通用搜索方式，我们试想一下，如果将搜索出来的结果自动分类，那应该是多么美妙的一件事情，如您搜索“Ajax”，会自动按如下分类，如下图：

目前已有此类的开源项目，Carrot2，使用起来非常简单，但由于中文的聚类算法和英文的聚类算法存在比较大的差异，所以更多的时间是花在中文算法的聚类上，Carrot2的官方地址：http://project.carrot2.org/

目前搜索引擎逐步在细分市场，目前市面上还存在多个垂直搜索引擎，人肉搜索（其实主要也是人的相关性研究），如Google的生活搜索等，的确，现在的搜索引擎产品正慢慢的更加贴近人性化设计。

Carrot2自带的一个文档聚类的例子源代码如下：

        try {
            /*
             * Initialize local controller. Normally you'd run this only once
             * for an entire application (controller is thread safe).
             */
            final LocalController controller = initLocalController();

            /*
             * Once we have a controller we can run queries. Change the query
             * to something that is relevant to the data in your index.
             */
            
            // Data for clustering, containing documents consisting of 
            // titles and bodies of documents.
            String [][] documents = new String [] [] {
                { "Data Mining - Wikipedia", "http://en.wikipedia.org/wiki/Data_mining" },
                { "KD Nuggets", "http://www.kdnuggets.com/" },
                { "The Data Mine", "http://www.the-data-mine.com/" },
                { "DMG", "http://www.dmg.org/" },
                { "Data Mining", "http://www.gr-fx.com/graf-fx.htm" },
                { "Data Mining Benchmarking Association (DMBA)", "http://www.dmbenchmarking.com/" },
                { "Data Mining", "http://www.computerworld.com/databasetopics/businessintelligence/datamining" },
                { "National Center for Data Mining (NCDM) - University of Illinois at Chicago", "http://www.ncdm.uic.edu/" },
            };
            
            // Although the query will not be used to fetch any data, if the data
            // that you're submitting for clustering is a response to some
            // search engine-like query, please provide it, as the clustering
            // algrithm may use it to improve the clustering quality.
            final String query = "data mining";
            
            // The documents are provided for clustering in the 
            // PARAM_SOURCE_RAW_DOCUMENTS parameter, which should point to
            // a List of RawDocuments.
            List documentList = new ArrayList(documents.length);
            for (int i = 0; i < documents.length; i++)
            {
                documentList.add(new RawDocumentSnippet(
                    new Integer(i),  // unique id of the document, can be a plain sequence id
                    documents[i][0], // document title
                    documents[i][1], // document body
                    "dummy://" + i,  // URL (not required for clustering)
                    0.0f)            // document score, can be 0.0 
                );
            }
            
            final HashMap params = new HashMap();
            params.put(
                ArrayInputComponent.PARAM_SOURCE_RAW_DOCUMENTS,
                    documentList);
            final ProcessingResult pResult = controller.query("direct-feed-lingo", query, params);
            final ArrayOutputComponent.Result result = (ArrayOutputComponent.Result) pResult.getQueryResult();

            /*
             * Once we have the buffered snippets and clusters, we can display
             * them somehow. We'll reuse the simple text-dumping method
             * available in {@link Test}.
             */
            Example.displayResults(result);
        } catch (Exception e) {
            // There shouldn't be any, but just in case.
            System.err.println("An exception occurred: " + e.toString());
            e.printStackTrace();
        }

6
顶

1
踩

分享到：

让Eclipse支持JQuery代码自动完成 | 【小说连载续】夕阳下染红的黑桃A

2008-10-13 08:44
浏览 3820
评论(9)
分类:企业架构
查看更多

9 楼 Jatula 2008-10-20

其实玩过这东西的人都知道,这种去分类的东西,会有一个后果就是数据准,但是要程序去拆分,人工干预很大,已经成了半自动的东西,很不实际,做做小型的还可以,做大型那就要考虑,再说搜索引擎以量和速度排第一,所以这个想法好,但不实际;

8 楼 jiyanliang 2008-10-15

我来说说我自己的观点。
其实这个和语意搜索有点类似了，或者说比较接近。
我们要进行语意搜索第一步是要建模的，靠什么建模，目前来说使用本体的比较多。
描述本体的语言有很多中，但是他们的共同点就是具有推理功能。
这里的聚类搜索我们可以看成是不同本体相互结合的产物。

7 楼稻香麦甜 2008-10-15

感觉这种分类方式只适合
有针对性的搜索人员，有些大众自己都不知道自己的搜索关键词，所以我觉得那个关键词sns还是很有效的！

6 楼 beyondsky 2008-10-15

聚类算法

5 楼 yajie 2008-10-15

我想请教一下你网页上面的google广告是怎样弄上去的?

4 楼 firstlight 2008-10-14

这个哪儿有很新，，，看看网站上的paper 早就有了

3 楼 kongshanxuelin 2008-10-14

tanguojun 写道

在好的技术需要市场，没有市场的技术都是空谈！

这种技术可以加强用户体验，不能说完全没有市场，大家可能都希望如搜索“Ajax”，帮你的结果自动分类，如Ajax书籍，Ajax公司等，这样看搜索结果更有针对性

2 楼 tanguojun 2008-10-14

在好的技术需要市场，没有市场的技术都是空谈！

1 楼 emarket 2008-10-14

孩子，这东西自己拿来玩玩还行，这种自动分类的东西太不实际，关键是没有市场：）

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论