该帖已经被评为良好帖
|
|
---|---|
作者 | 正文 |
发表时间:2011-04-30
最后修改:2011-05-02
这是我在ITEYE发的第一个帖子:
import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
def words(text): return re.findall('[a-z]+', text.lower())
返回文本里所有单词,并使其字母变为小写(Google里搜索也是不分大小写的,因为如果分大小写的话太复杂了) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model
返回一个'key=word', value='word出现个数'的字典(相当于java里的Hashtable) NWORDS = train(words(file('big.txt').read()))
即把所有在big.txt里的单词按照出现的个数做了统计 def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] #@IndentOk deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)
即几下四种(这里以iteye做为例): def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
这段代码返回的是出错字母个数为2且出现在培训样本里的单词,之所以这样做是因为: def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
在这里,Peter做了个直接了断的假设,即纠正词的选取优先级依次为 :库里的词,出错一个词,出错两个词
写道
The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly.
Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and perhaps the typist's finger slipped off the "e" onto the "w". The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
即要比较不同c的p(c|w)是件很难的事,因为我们是无法知道单词是如何被写错的 splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet]
前段时间到上海博物馆里,看到那些巧夺天工的玉器,惊叹之余我竟然第一个想到的,是那些代码。
其实很早时我就想把酷代码印在T-shirt上了,夏天终于来啦,我决定把这段代码印在我的T-shirt上,Peter也同意啦:
MPx, xphone, money, 美女这些东西终要腐烂的,唯有艺术与爱不朽于世。
想要印这段代码的同学也可以找我要印的图,哈哈,和大家一起分享,Habe a good summer, and happy programming! 声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |
发表时间:2011-05-05
老实说,不失望是不可能的,这篇帖子我还是花了些心思写的,竟然一个回复都没有!
|
|
返回顶楼 | |
发表时间:2011-05-05
最后修改:2011-05-05
我等屁民无法领悟大神精髓,无奈大神千呼万唤,屁民只好出来膜拜一下了
|
|
返回顶楼 | |
发表时间:2011-05-05
综合版人气大不如前了,没办法啊
|
|
返回顶楼 | |
发表时间:2011-05-10
dennis_zane 写道 综合版人气大不如前了,没办法啊 人气不在的一个原因是没什么这样的好贴,也没什么像样的技术贴了。 |
|
返回顶楼 | |
发表时间:2011-05-10
强...非常不错的技术贴.
顶. |
|
返回顶楼 | |
发表时间:2011-05-10
谢谢。不过用python spell check在网上到处搜索到的都是这个代码。 不知道性能如何?
我最后自己做了一个spell check的功能,应该比它这个功能还强一点儿。 我的可以对一段话或者一个网址做英语的拼写连接。大家试试?
网址如下:
http://www.ueseo.org/spellcheck/links/
欢迎大家提意见,交流。
codeincoffee 写道
这是我在ITEYE发的第一个帖子:
import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
def words(text): return re.findall('[a-z]+', text.lower())
返回文本里所有单词,并使其字母变为小写(Google里搜索也是不分大小写的,因为如果分大小写的话太复杂了) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model
返回一个'key=word', value='word出现个数'的字典(相当于java里的Hashtable) NWORDS = train(words(file('big.txt').read()))
即把所有在big.txt里的单词按照出现的个数做了统计 def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] #@IndentOk deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)
即几下四种(这里以iteye做为例): def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
这段代码返回的是出错字母个数为2且出现在培训样本里的单词,之所以这样做是因为: def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
在这里,Peter做了个直接了断的假设,即纠正词的选取优先级依次为 :库里的词,出错一个词,出错两个词
写道
The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly.
Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and perhaps the typist's finger slipped off the "e" onto the "w". The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
即要比较不同c的p(c|w)是件很难的事,因为我们是无法知道单词是如何被写错的 splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet]
前段时间到上海博物馆里,看到那些巧夺天工的玉器,惊叹之余我竟然第一个想到的,是那些代码。
其实很早时我就想把酷代码印在T-shirt上了,夏天终于来啦,我决定把这段代码印在我的T-shirt上,Peter也同意啦:
MPx, xphone, money, 美女这些东西终要腐烂的,唯有艺术与爱不朽于世。
想要印这段代码的同学也可以找我要印的图,哈哈,和大家一起分享,Habe a good summer, and happy programming!
|
|
返回顶楼 | |
发表时间:2011-05-10
很经典的文章,我也考虑学Python了。
|
|
返回顶楼 | |
发表时间:2011-05-10
恩,确实很经典的文章。
|
|
返回顶楼 | |
发表时间:2011-05-10
非常好的文章,希望有人看到,尤其是楼主的风格。
|
|
返回顶楼 | |