`
metaphy
  • 浏览: 344558 次
  • 性别: Icon_minigender_1
  • 来自: 大西洋底
社区版块
存档分类
最新评论

有趣的统计英文单词频率的例子

 
阅读更多
统计一篇英文文档或一本小说中单词出现的次数,下面代码使用的是英文版小说"悲惨世界"做例子。 有两个需要注意的地方,一个是如何使用正则式分割单词,一个是HashMap中对元素按值排序无法直接完成,中间做了一下转化:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.regex.Pattern;

public class EnglishWordsStatics {
	public static final String EN_FOLDER_FILE = "C:/resources/Books/English/Les Miserables.txt";
	public static final String OUTPUT = "C:/resources/Books/English/Les Miserables - Words.txt";

	private HashMap<String, Integer> result = new HashMap<String, Integer>();
	private int total = 0;

	/**
	 * Handle one English fiction
	 * 
	 * @param file
	 * @throws IOException
	 */
	public void handleOneFile(File file) throws IOException {
		if (file == null)
			throw new NullPointerException();

		BufferedReader in = new BufferedReader(new FileReader(file));
		String line;

		// split by space ' ( ) * + ' . / [0-9] : ; ? [ ] ` { } |
		Pattern pattern = Pattern
				.compile("[ ,?;.!\"'|[0-9]:`\\-\\(\\)\\[\\]]+");

		while ((line = in.readLine()) != null) {
			line = line.toLowerCase();
			String[] words = pattern.split(line);

			for (String word : words) {
				if (word.length() > 0) {
					total++;
					if (!result.containsKey(word)) {
						result.put(word, 1);
					} else {
						Integer i = result.get(word);
						i++;
						result.put(word, i);
					}
				}
			}
		}
		in.close();
		System.out.println("Total words: " + total);
		System.out.println("Total different words: " + result.size());
	}

	/**
	 * Print the statics result
	 * @throws IOException 
	 */
	public void saveResult() throws IOException {
		// Sorting
		List<Node> list = new ArrayList<Node>();
		for (String word : result.keySet()) {
			Node p = new Node(word, result.get(word));
			list.add(p);
		}

		Collections.sort(list);

		FileWriter fw = new FileWriter (new File (OUTPUT));
		for (Node p : list) {
			fw.write(p.getWord() + "\t" + p.getNum()+"\n");		
		}
		fw.close();
		System.out.println ("Done");
	}

	/**
	 * @param args
	 */
	public static void main(String[] args) throws IOException {
		EnglishWordsStatics ews = new EnglishWordsStatics();
		ews.handleOneFile(new File(EN_FOLDER_FILE));
		ews.saveResult();
	}
}

/**
 * For sorting, store the words - num
 * 
 */
class Node implements Comparable<Node> {
	private String word;
	private int num;

	public Node() {
	}

	public Node(String word, int num) {
		super();
		this.word = word;
		this.num = num;
	}

	public String getWord() {
		return word;
	}

	public int getNum() {
		return num;
	}

	@Override
	public int compareTo(Node o) {
		return o.getNum() - num;
	}
}


结果如下:
Total words: 607563
Total different words: 22882
Done


部分输出:
the	43538
of	21107
and	15865
a	15365
to	14663
in	11813
he	10280
was	9251
that	8413
it	7026
his	6813
had	6564
is	6504
which	5506
with	4737
on	4714
at	4292
this	4208
not	3981
i	3910
you	3768
one	3500
as	3447
for	3129
him	3118
have	2919
there	2869
her	2767
who	2676
all	2606
she	2605
by	2604
from	2568
be	2484
are	2258
an	2249
they	2236
but	2187
s	2141
man	2107
no	2058
were	1962
what	1932
said	1879
been	1601
marius	1471
when	1429
we	1407
their	1323
two	1284
jean	1275
so	1262
will	1258
me	1207
my	1206
more	1198
himself	1155
valjean	1154
them	1126
has	1122
would	1114
these	1097
then	1097
into	1058
like	1055
out	1047
did	1046
little	1034
cosette	1033
m	1005
very	976
its	969
up	965
or	955
do	952
other	940
old	939
than	930
day	869
only	837
some	830
good	830
made	823
time	795
nothing	794
those	779
your	765
if	752
without	739
could	727
de	725
rue	720
first	681
about	678
well	665
where	663
father	638
men	638
say	635
here	631
now	608
should	592
moment	591
over	585
come	582
hand	576
see	573
through	571
any	570
eyes	566
am	561
know	560
even	559
same	551
us	549
after	549
still	546
thenardier	544
great	543
just	538
thought	534
must	533
before	530
once	514
under	511
upon	509
door	508
three	499
being	493
people	491
child	490
how	489
book	487
house	487
head	482
let	480
sort	478
again	474
young	474
go	473
every	472
night	471
each	471
longer	469
javert	465
light	463
right	460
name	458
paris	458
woman	455
can	454
such	446
way	445
place	444
long	443
life	443
back	438
went	431
saint	430
seemed	424
never	421
called	420
four	417
took	416
take	400
seen	397
t	395
years	389
something	389
chapter	388
air	384
left	382
love	381
whom	381
make	380
monsieur	377
though	377
god	373
point	371
mother	368
whole	367
might	367
most	367
between	366
may	363
shall	361
does	358
voice	352
street	352
last	352
almost	351
much	350
down	348
our	348
turned	346
own	342
thing	341
having	340
towards	338
passed	336
face	336
everything	334
always	329
poor	329
soul	327
against	327
order	327
felt	322
off	320
hundred	320
bishop	320
side	318
replied	315
la	314
things	313
certain	312
word	312
away	312
gavroche	312
wall	311
another	308
behind	307
because	307
few	306
hour	306
going	306
room	306
barricade	302
taken	299
five	299
francs	297
too	297
fact	297
black	296
saw	296
fauchelevent	296
put	294
while	291
heard	291
came	290
found	290
heart	284
end	282
enjolras	282
entered	282
madeleine	281
near	281
why	280
themselves	269
madame	268
bed	267
dead	265
sometimes	265
words	265
yes	261
white	260
ah	259
evening	259
girl	253
death	252
six	252
garden	252
le	251
mind	250
itself	249
since	248
thus	248
morning	247
began	246
remained	246
open	245
also	245
gillenormand	244
nor	241
beneath	240
many	240
children	239
half	237
second	237
think	236
table	236
opened	235
set	235
don	233
get	232
terrible	231
hands	231
full	231
done	228
herself	228
large	228
become	228
world	225
anything	224
feet	224
both	223
human	223
person	222
water	219
arms	217
work	217
alone	217
sewer	214
fantine	213
far	211
whose	210
fell	210
idea	210
courfeyrac	209
o	208
police	207
twenty	204
days	204
matter	199
give	199
above	199
already	198
added	198
returned	196
window	194
exclaimed	193
thousand	193
possible	191
corner	190
france	190
earth	190
later	188
however	188
held	187
d	186
knew	186
front	186
louis	186
age	185
less	185
round	183
case	183
speak	182
sir	181
fire	181
tell	180
among	180
yet	180
clock	179
true	179
cold	178
revolution	178
grave	177
lost	176
saying	176
des	174
resumed	173
glance	173
women	173
l	172
part	172
silence	171
look	170
became	170
jondrette	170
rather	169
arm	169
manner	168
new	168
stood	167
sister	167
nevertheless	166
pass	166
iron	165
stone	165
low	164
appeared	164
caught	163
reached	162
oh	162
perhaps	162
raised	162
hair	162
convent	161
read	161
war	160
grand	160
du	159
society	159
beheld	158
fall	158
placed	157
wine	156
shadow	154
happy	154
forth	154
form	154
within	153
making	152
small	152
ground	152
turn	151
state	151
hours	151
nature	151
following	151
grandfather	151
darkness	151
coat	150
joy	149
chamber	149
presence	148
suddenly	148
find	148
myself	147
road	147
letter	147
shop	147
live	147
eye	146
fine	146
foot	146
law	145
paper	145
sight	145
napoleon	145
close	144
smile	144
closed	144
times	143
trees	143
moreover	142
th	142
walls	142
reader	142
seized	142
neither	141
gave	141
quarter	141
history	140
short	140
ancient	139
battle	139
asked	139
beginning	139
king	139
course	138
red	138
present	138
better	138
told	138
third	138
want	137
brought	137
question	137
ever	137
streets	137
piece	137
others	136
rose	136
lay	136
during	136
continued	136
given	136
looked	136
along	136
century	135
knows	134
sound	134
pocket	134
taking	134
rest	134
force	134
enter	134
money	133
direction	133
understand	132
waterloo	132
formed	132
call	131
perceived	131
necessary	130
able	129
strange	129
around	129
melancholy	129
english	129
return	128
sun	128
thou	128
year	128
seated	128
public	127
daughter	127
single	126
mysterious	126
bottom	126
filled	126
gazed	125
floor	125
dark	125
boulevard	124
ten	124
beside	124
cried	124
bourgeois	123
whether	123
die	123
cast	123
visible	123
past	122
seven	122
convict	122
country	121
impossible	121
mayor	120
cut	120
guard	120
hardly	120
appearance	120
shadows	119
laid	119
charming	119
hole	118
means	118
town	118
probably	118
gloomy	117
drew	117
broken	117
pontmercy	117
disappeared	117
blood	117
profound	117
french	117
galleys	116
morrow	116
mademoiselle	116
nearly	116
epoch	116
makes	116
except	116
doubt	115
happiness	115
sous	115
received	115
often	115
together	114
general	114
living	114
mabeuf	114
post	114
least	114
followed	113
comes	113
cannot	113
outside	113
bad	113
says	113
stones	112
leblanc	112
eight	112
paid	112
arrived	112
beautiful	112
houses	112
movement	112
lived	112
re	111
cross	111
known	111
truth	111
depths	111
step	110
hear	110
carriage	110
flowers	109
immense	109
gone	109
lighted	108
progress	108
ideas	108
bread	107
evil	107
girls	107
mouth	107
brother	106
steps	106
sword	106
quite	106
social	106
escape	106
hideous	106
liberty	105
recognized	105
carried	105
army	105
caused	105
certainly	105
pretty	104
hold	104
mingled	104
attention	104
spot	104
effect	103
combeferre	103
thirty	103
slang	103
fallen	103
coming	103
future	103
ago	103
wish	102
pay	102
heaven	102
need	102
shot	102
really	102
family	102
struck	101
passing	101
until	101
below	101
midst	101
months	101
horse	101
city	99
wife	99
conscience	99
loved	99
friends	98
line	98
teeth	98
yourself	97
duty	97
soon	97
breath	97
chair	97
served	97
bent	96
enough	96
sign	96
justice	96
unknown	96
grantaire	96
body	95
seems	95
distance	95
frightful	94
thoughts	94
remain	94
candle	94
high	94
sleep	94
although	93
hat	93
produced	93
covered	93
singular	93
fellow	93
forty	93
wind	92
eponine	92
moments	92
instant	92
simple	92
fifteen	92
further	92
fear	92
secret	92
peace	92
understood	91
insurrection	91
presented	91
bit	91
slowly	91
ll	90
walked	90
occasion	90
formidable	90
doctor	90
gaze	90
square	90
top	90
becomes	90
porter	90
allowed	90
brow	90
glass	89
rendered	89
sad	89
blind	89
husband	89
souls	89
montparnasse	89
horrible	89
windows	89
according	88
monseigneur	88
son	88
pale	88
leave	88
halted	87
enormous	87
succeeded	87
minutes	87
stranger	87
dressed	86
vague	86
ran	86
either	86
power	86
serious	86
uttered	86
tone	86
laugh	86
none	86
forest	86
obliged	86
blue	85
spring	85
sombre	85
use	85
heads	85
touched	84
existed	84
home	84
pavement	84
view	84
despair	84
petit	84
forms	84
prison	84
reply	84
june	83
sky	83
sur	83
doing	83
knees	83
middle	82
hope	82
fixed	82
colonel	82
watch	82
haste	81
killed	81
care	81
misery	81
cannon	81
noise	81
real	81
names	81
prisoner	80
eat	80
bossuet	80
several	80
letters	80
burst	80
spoke	80
youth	80
big	80
destiny	80
tree	80
crime	79
church	79
gentleman	79
fashion	79
deal	79
address	79
rich	79
lower	79
entering	79
asleep	79
vast	78
hence	78
perfectly	78
honest	78
composed	78
standing	78
concealed	78
master	78
resembled	78
service	78
whence	78
sure	77
motionless	77
gun	77
stars	77
number	77
winter	77
civilization	77
terror	77
amid	77
besides	77
magloire	77
chimney	77
honor	77
thrust	77
forced	77
thinking	77
walk	76
baron	76
et	76
chance	76
reason	76
deserted	76
gloom	76
begun	76
school	76
paces	76
neck	76
emperor	76
affair	76
seeing	76
rain	76
ideal	76
speaking	75
latter	75
traversed	75
inn	75
everywhere	75
persons	75
cry	75
lofty	75
beyond	75
march	75
wild	75
feel	75
respect	75
montfermeil	75
got	74
paused	74
holy	74
subject	74
beings	74
else	74
court	74
dawn	74
fault	74
whatever	74
priest	74
aside	74
mass	74
turning	73
peculiar	73
wrong	73
creature	73
rope	73
worthy	73
tholomyes	73
wore	73
shouted	73
race	73
drawing	72
space	72
opening	72
fifty	72
horses	72
sou	72
divine	72
gate	72
shoes	72
double	72
wounded	72
breast	72
spirit	72
free	72
recognize	72
waiting	72
walking	71
change	71
thanks	71
written	71
lines	71
soldier	71
box	71
coffin	71
stared	71
pronounced	70
play	70
account	70
listened	70
bench	70
gentle	70
passage	70
silver	70
evidently	70
memory	70
situation	70
addressed	70
dream	70
kept	70
named	70
possessed	69
key	69
erect	69
behold	69
pity	69
green	69
building	69
cap	69
fresh	69
sainte	68
run	68
bare	68
departure	68
preceding	68
cart	68
mean	68
tried	68
narrow	68
picpus	68
keep	68
soldiers	68
ill	67
obscure	67
angle	67
cloud	67
wellington	67
talking	67
ended	67
finished	67
approached	67
condemned	67
month	67
existence	67
virtue	66
story	66
distant	66
habit	66
quitted	66
attack	66
object	66
wood	66
complete	66
immediately	66
shut	66
sent	66
absolute	65
lightning	65
supreme	65
etc	65
sweet	65
dog	65
dropped	65
noticed	65
revery	65
calm	65
listen	65
believe	65
entrance	65
wrath	65
heavy	65
bore	64
obscurity	64
crowd	64
abyss	64
finally	64
ask	64
rags	64
shoulders	63
pure	63
flight	63
takes	63
goes	63
thither	63
happened	63
died	63
doors	63
emerged	63
advanced	63
twilight	63
fatal	63
gamin	63
deep	63
effort	63
horror	63
stupid	63
committed	63
demanded	63
prioress	63
possession	63
plumet	63
advance	62
sense	62
fifth	62
blow	62
instinct	62
bring	62
best	62
roof	62
daylight	62
revolt	62
purpose	62
merely	62
questions	62
linen	62
aunt	62
conscious	62
tomb	62
gold	62
note	62
attitude	62
encountered	62
field	61
descended	61
england	61
ourselves	61
talk	61
flung	61
suffering	61
action	61
faubourg	61
rise	61
yellow	61
absolutely	61
lies	61
merry	61
required	61
illuminated	61
cure	61
seem	61
self	61
exists	61
repeated	60
ear	60
across	60
mentioned	60
hall	60
falling	60
occupied	60
infinite	60
straw	60
smoke	60
straight	60
branch	60
philosophy	60
cause	60
observed	60
lips	60
pistol	59
holding	59
horizon	59
knowing	59
violent	59
former	59
maire	59
indescribable	59
hung	59
bridge	58

分享到:
评论

相关推荐

    单词字母频率统计.单词字母频率统计

    这里我们主要探讨“单词字母频率统计”的概念、实现方法以及它在不同场景下的应用。 首先,理解“单词字母频率统计”:这是一种统计技术,用于计算一个给定文本中每个字母出现的频次。这种统计可以帮助我们了解文本...

    c++课程设计单词频率统计

    给定指定单词,统计其在选定文本中出现的频率 在磁盘目录下保存一篇英文文章,通过程序打开该文件,对里面的数据进行操作;将磁盘文件中的英文文章先用链表装起来,单词一个个地存放到链表中的结点中;这样一来对...

    统计 单词频率练习

    本实践项目聚焦于“统计单词频率”,这是一个典型的文本处理问题,旨在通过编程来实现对文本数据的高效分析。在这个过程中,我们将学习如何统计单词个数,查询特定单词及其出现频率,以及定位单词在文本中的行号。 ...

    C++双向链表统计文章单词出现频率

    在这个特定的项目中,“C++双向链表统计文章单词出现频率”是一个涉及数据结构和算法的应用,目标是实现一个程序来分析文本文件,计算并显示文章中每个单词出现的次数。双向链表作为数据结构的核心,其特点是每个...

    统计单词在文章中出现频率

    标题 "统计单词在文章中出现频率" 描述的是一个C++编程任务,目的是设计并实现一个程序,能够读取一个包含英文文章的文本文件,分析其中的单词,并统计每个单词出现的次数。最终,程序会将这些信息写入另一个文件,...

    英文单词频率分析器

    英文单词频率分析器

    英文单词统计程序

    vc6.0制作的英文单词统计程序,可对txt中的英文单词统计并排序,显示前十的单词

    统计单词出现频率代码

    标题 "统计单词出现频率代码" 描述的是一个用于计算英文文章中单词频率的程序。这个程序可以帮助我们了解一篇文章中各个单词出现的频次,对于文本分析、信息检索或语言学习等场景都十分有用。标签 "单词" 和 "频率" ...

    c语言统计英文单词

    如题 c语言统计英文单词 先输入文件地址 然后按照提示操作

    统计文本单词频率(c++实现)

    在IT领域,尤其是在编程与数据处理方面,统计文本单词频率是一项基本且重要的任务。通过给定的代码示例,我们可以深入探讨如何使用C++结合STL(标准模板库)中的`map`容器来高效地完成这一工作。 ### 核心知识点...

    c++英文文件统计单词频率,再用哈希表解决冲突,然后查找

    哈希查找 写一篇英文的自我介绍,统计各单词出现的次数,选取适当的哈希函数,构造哈希表,用链表来解决冲突,然后实现哈希查找。

    文本英文单词统计

    可以用简单的图形界面显示文本所有英文单词的数目,并可以查询固定单词的个数

    英文单词排序 (25 分)PTA

    实验11-1-1 英文单词排序 (25 分) 本题要求编写程序,输入若干英文单词,对这些单词按长度从小到大排序后输出。如果长度相同,按照输入的顺序不变。 输入格式: 输入为若干英文单词,每行一个,以#作为输入结束...

    Python实现统计英文单词个数及字符串分割代码

    字符串分割 复制代码 代码如下: ...统计英文单词的个数的python代码 复制代码 代码如下: # -*- coding: utf-8 -*- import os,sys info = os.getcwd() #获取当前文件名称 fin = open(u’c:/a.txt’) info = fin.read

    二叉搜索树统计单词频率 MFC实现

    在"二叉搜索树统计单词频率"的问题中,我们首先需要读取用户输入的一段文本,将其中的单词提取出来。这个过程通常涉及到字符串处理,例如分隔符分割、大小写转换等,以便进行统一的比较。我们可以使用C++的标准...

    单词的统计和频率计数的小工具

    从磁盘中输入文件,然后对文件中的单词进行统计,并由高到低的顺序输出单词及其出现频率

    数据结构 统计单词频率

    在数据结构课程设计中,"统计单词频率"是一个常见的实践项目,它涉及到文本处理、数据组织和算法应用。这个项目的主要目标是分析文本文件中的单词出现频次,并以可视化的方式展示出来。MFC(Microsoft Foundation ...

    用c++写的统计英文文章中的单词个数

    在C++编程语言中,统计英文文章中的单词个数并计算每个字母的出现频率是一项基础但重要的任务。这个过程涉及到字符串处理、字符分析以及计数算法。以下将详细阐述实现这一功能所需的关键知识点: 1. **字符串处理**...

    英文单词txt下载 英语单词txt、word文档下载-15325行英文单词

    标题 "英文单词txt下载 英语单词txt、word文档下载-15325行英文单词" 提供的信息主要集中在英语学习资源上,这通常意味着它包含了一个文本文件,里面列举了15325个英文单词。这些词汇可能是按照字母顺序排列,也可能...

    有趣的英文单词游戏.doc

    有趣的英文单词游戏.doc

Global site tag (gtag.js) - Google Analytics