倒排索引的简单实现

eriol

浏览: 412333 次
性别:
来自: 上海

最近访客更多访客>>

magic_guan

zx20110729

loveyt_happy

yangganboy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Java

倒排索引实现

首先看一个例子：

假设有3篇文章，file1, file2, file3，文件内容如下：

file1 (单词1，单词2，单词3，单词4....)

file2 (单词a，单词b，单词c，单词d....)

file3 (单词1，单词a，单词3，单词d....)

那么建立的倒排索引就是这个样子：

单词1 (file1,file3)

单词2 (file1)

单词3 (file1,file3)

单词a (file2, file3)

....

倒排索引的概念很简单：就是将文件中的单词作为关键字，然后建立单词与文件的映射关系。当然，你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念，几乎所有的搜索引擎都会使用到倒排索引。

下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字，查找出该词所出现的行号以及出现的次数。

import java.io.*;
import java.util.HashMap;
import java.util.Map;


public class InvertedIndex {
	
	private Map<String, Map<Integer, Integer>> index;
	private Map<Integer, Integer> subIndex;
	
	public void createIndex(String filePath) {
		index = new HashMap<String, Map<Integer, Integer>>();

		try {
			File file = new File(filePath);
			InputStream is = new FileInputStream(file);
			BufferedReader read = new BufferedReader(new InputStreamReader(is));
			
			String temp = null;
			int line = 1;
			while ((temp = read.readLine()) != null) {
				String[] words = temp.split(" ");
				for (String word : words) {
					if (!index.containsKey(word)) {
						subIndex = new HashMap<Integer, Integer>();
						subIndex.put(line, 1);
						index.put(word, subIndex);
					} else {
						subIndex = index.get(word);
						if (subIndex.containsKey(line)) {
							int count = subIndex.get(line);
							subIndex.put(line, count+1);
						} else {
							subIndex.put(line, 1);
						}
					}
				}
				line++;
			}
			read.close();
			is.close();
		} catch (IOException e) {
			System.out.println("error in read file");
		}
	}
	
	public void find(String str) {
		String[] words = str.split(" ");
		for (String word : words) {
			StringBuilder sb = new StringBuilder();
			if (index.containsKey(word)) {
				sb.append("word: " + word + " in ");
				Map<Integer, Integer> temp = index.get(word);
				for (Map.Entry<Integer, Integer> e : temp.entrySet()) {
					sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");	
				}
			} else {
				sb.append("word: " + word + " not found");
			}
			System.out.println(sb);
		}
	}
	
	public static void main(String[] args) {
		InvertedIndex index = new InvertedIndex();
		index.createIndex("news.txt");
		index.find("I love Shanghai today");
	}
}

其中，输入文件news.txt内容为：

I am eriol
I live in Shanghai and I love Shanghai
I also love travelling
life in Shanghai
is beautiful

输出结果为：

word: I in line 1 [1] , line 2 [2] , line 3 [1] , 
word: love in line 2 [1] , line 3 [1] , 
word: Shanghai in line 2 [2] , line 4 [1] , 
word: today not found

5
顶

0
踩

分享到：

memset, memcpy和strcpy的比较 | C++中的static完全解析

2011-09-06 20:33
浏览 12850
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

倒排索引的简单实现

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

倒排索引的简单实现

评论

发表评论

相关推荐

Java类加载器

stop() 和 suspend() 方法为何不推荐使用

使用阻塞缓冲区的生产者消费者

生产者消费者问题

Java中从方法中返回多个值

遍历List时删除元素

closure和callback

Java是解释型还是编译型

Java Socket网络编程

使用Java调用C/C++

super.getClass()陷阱

一些java的面试题

Java中的文件操作

JDK1.5中的线程池

实现java线程池

java thread中的run()和start()

嵌套类和内部类的区别

JVM垃圾回收机制

new class[0]

在两台计算机上使用RMI进行远程方法调用

最近访客更多访客>>