统计文章使用的汉字数并且计算出每个字重复使用的次数

ylq365

浏览: 1285225 次
性别:
来自: 上海

最近访客更多访客>>

ningzong

hao___feng

yuli001123

softech

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java，统计汉字数

直接看代码

有三个类，ChineseEntity汉字的对象包括汉字和每个汉字的重复次数，MyComparator为了数组的排序 Test111 这个是主类

先看主类Test111

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

public class Test111 {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		Map<String, String> chineseMap = new HashMap<String, String>();
		ArrayList<ChineseEntity> arrayList = new ArrayList<ChineseEntity>();
		String filePath = "/Users/alecyan/Downloads/37a60a51236722724f8e863791988205/[历史]《历史是个什么玩意儿》1.2.3.txt";
		File file = new File(filePath);
		BufferedInputStream fis = new BufferedInputStream(new FileInputStream(
				file));
		BufferedReader reader = new BufferedReader(new InputStreamReader(fis,
				"gb2312"), 5 * 1024 * 1024);
		String line = null;

		while ((line = reader.readLine()) != null) {
			for (int i = 0; i < line.length(); i++) {
				int chr1 = (char) line.charAt(i);
				if (chr1 >= 19968 && chr1 <= 171941) {// 汉字范围 \u4e00-\u9fa5 (中文)
					String newStr = Character.toString(line.charAt(i));
					String value = chineseMap.get(newStr);
					if (value == null) {
						chineseMap.put(newStr, "1");
					} else {
						int iCount = Integer.parseInt(value);
						iCount = iCount + 1;
						chineseMap.put(newStr, new Integer(iCount).toString());
					}
				}
			}
		}
		Set<String> keySet = chineseMap.keySet();
		if (keySet != null && keySet.size() > 0) {
			for (String key : keySet) {
				if (chineseMap.get(key) != null) {
					int iCount = Integer.parseInt(chineseMap.get(key));
					ChineseEntity chineseEntity = new ChineseEntity();
					chineseEntity.setChineseStr(key);
					chineseEntity.setiCount(iCount);
					arrayList.add(chineseEntity);
				}
			}
		}
		Object[] chineseEntityArray = arrayList.toArray();
		java.util.Arrays.sort(chineseEntityArray, new MyComparator());
		if (chineseEntityArray != null && chineseEntityArray.length > 0) {
			System.out.println("total chinese character:" + chineseEntityArray.length);
			for (int i = 0; i < chineseEntityArray.length; i++) {
				ChineseEntity chineseEntity = (ChineseEntity) chineseEntityArray[i];
				System.out.println(chineseEntity.getChineseStr() + "---重复次数---"
						+ chineseEntity.getiCount());
			}
		}
	}
}

类ChineseEntity

public class ChineseEntity{
	private String chineseStr;
	private int iCount;
	public String getChineseStr() {
		return chineseStr;
	}
	public void setChineseStr(String chineseStr) {
		this.chineseStr = chineseStr;
	}
	public int getiCount() {
		return iCount;
	}
	public void setiCount(int iCount) {
		this.iCount = iCount;
	}
}

类MyComparator

import java.util.Comparator;

public class MyComparator implements Comparator<Object> {
	public int compare(Object obj1, Object obj2) {
		ChineseEntity u1 = (ChineseEntity) obj1;
		ChineseEntity u2 = (ChineseEntity) obj2;
		if (u1.getiCount() > u2.getiCount()) {
			return -1;
		} else if (u1.getiCount() < u2.getiCount()) {
			return 1;
		} else {
			// 利用String自身的排序方法。
			// 如果使用次数相同就按字符串进行排序
			return u1.getChineseStr().compareTo(u2.getChineseStr());
		}

	}
}

这里我们统计了一本电子书，下面是运行的部分结果

total chinese character:3485
，---重复次数---18034
的---重复次数---7939
是---重复次数---5017
一---重复次数---3663
了---重复次数---3656
不---重复次数---3198
国---重复次数---3072
就---重复次数---2710
人---重复次数---2632
这---重复次数---2325
个---重复次数---2295
有---重复次数---1954
大---重复次数---1720
上---重复次数---1699
中---重复次数---1693
他---重复次数---1685
在---重复次数---1581
你---重复次数---1580
来---重复次数---1481
以---重复次数---1402
我---重复次数---1370
都---重复次数---1360
为---重复次数---1356
后---重复次数---1224
子---重复次数---1145
天---重复次数---1124
朝---重复次数---1102
时---重复次数---1066
么---重复次数---1041
到---重复次数---1034
说---重复次数---1018
地---重复次数---1004

0
顶

3
踩

分享到：

mac下同一个机器启动多个mysql实例 | 在xubuntu13.04下安装极点五笔用的平台是ib ...

2013-08-30 16:16
浏览 2592
评论(1)
分类:编程语言
查看更多

1 楼 hanmiao 2013-08-31

這里不僅僅是統計了漢字出現的次數，也統計了全角字符的統計次數，另外，GB2312的字符集比GB18030的字符集要小壹些，不知道直接把字符編碼調整成GB18030這段代碼是否還能正常工作？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

统计文章使用的汉字数并且计算出每个字重复使用的次数

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

统计文章使用的汉字数并且计算出每个字重复使用的次数

评论

发表评论

相关推荐

最近访客更多访客>>