与君共品代码: Spelling Corrector

全部 Linux 数据库敏捷编程数据结构软件测试项目管理 Oracle 编程综合互联网 Erlang MySQL

浏览 23941 次

锁定老帖子主题：与君共品代码: Spelling Corrector 该帖已经被评为良好帖
作者	正文
codeincoffee 等级: 初级会员性别: 文章: 63 积分: 90 来自: 上海	发表时间：2011-04-30 最后修改：2011-05-02 相关推荐: misspell-fixer:简单的工具，用于修复源代码中的常见拼写错误，错别字 How to Write a Spelling Corrector代码注释如何写一个拼写纠错器 -- how to write a spelling corrector 【论文复现】MDCSpell: A Multi-task Detector-Corrector Framework for Chinese Spelling Correction论文复现使用C#改写Peter Norvig的Spelling Corrector 更多相关推荐编程综合这是我在ITEYE发的第一个帖子： 1. 一段代码(很短，才21行) 2. 代码赏析(泡杯咖啡，慢慢品) 1. 一段代码如果有人面试问我写过最棒的一段代码是什么，我可能真说不上什么。但要是有人问我见过最棒的一段代码是什么？那非Peter Norvig的Spelling Corrector莫属了。好，大餐上来啦(python2.5, 21行)： import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 2. 代码赏析这里只是简单介绍一下，强烈建议看原文，英文不好的可以看徐大哥的翻译(需fq) 大家一定都曾经在Google里输入错误的单词，比方说输入javaeya，Google会提示: Showing results for javaeye. Search instead for javaeya (其实我想举iteye为例子的，不幸的是Google还没把这词给当回事...) 是的，Peter的这段代码就是干这事的，输入为一个单词，输出为单词的纠正词： >>> correct('speling') 'spelling' >>> correct('korrecter') 'corrector' 好吧，简单分析一下：第一个方法： def words(text): return re.findall('[a-z]+', text.lower()) 返回文本里所有单词，并使其字母变为小写(Google里搜索也是不分大小写的，因为如果分大小写的话太复杂了) 第二个方法： def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model 返回一个'key=word', value='word出现个数'的字典(相当于java里的Hashtable) 用lambda:1(即没有出现的单词的个数记为1)的原因是，概率里用小概率而不是用0来替代不可能事件于是: NWORDS = train(words(file('big.txt').read())) 即把所有在big.txt里的单词按照出现的个数做了统计 (Peter在灰机上写的这段代码，big.txt是他自己电脑的一些文档) 再下来，Peter对输入错误的单词的情况做了几种分类： def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] #@IndentOk deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) 即几下四种(这里以iteye做为例)： 1. 漏输字母：iteye -> itye 2. 字母错位： iteye -> ityee 3. 输错字母： iteye -> iteya 4. 多输字母： iteye -> iteyee 可以看出edits1(word)这个方法即把所有可能字母出错个数为1的单词作了枚举 (注意这里字母出错个数为1的意思是不包括输漏，错，多超过两个字母的情况，比如iteyeee是不被包含的) (从这个角度来看，iteye比javaeye这个域名好多了，我自己试了下javaeye输错的概率会比iteye大很多) 好，接着下一个方法： def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 这段代码返回的是出错字母个数为2且出现在培训样本里的单词，之所以这样做是因为：假如不加判断if e2 in NWORDS的话，"something"这样一个单词的两个字母错误情况居然达到了 114,324 个. Peter在做了些验证后决定只把那些正确的词作为候选词好的，可以进行纠错了 def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 在这里，Peter做了个直接了断的假设，即纠正词的选取优先级依次为：库里的词，出错一个词，出错两个词现在从三个方面来欣赏这段代码： 2.1. 数学之美贝叶斯定理是这段代码的核心，做为人工智能领域的大牛，Peter对这种问题的数学建模能力确实非凡设若单词c(correction)是输入单词w(word)的正确原型，那么P(c\w)即为给定的单词w, 单词c是w的正确原型的概率通过贝叶斯做定理的变型，可得: *P(c\|w)=P(cw)/P(w)=(P(w\|c)P(c))/P(w)* 因为w是用户输入给定的，故p(w)为1，得：*P(c\|w)=P(w\|c)P(c)** 为什么要把p(c\w)转化为p(w\c)呢？Peter给出的解释是：写道 The answer is that P(c\|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c\|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and perhaps the typist's finger slipped off the "e" onto the "w". The point is that to estimate P(c\|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors. 即要比较不同c的p(c\|w)是件很难的事，因为我们是无法知道单词是如何被写错的但把p(c\|w)转化为P(w\|c)P(c)时，事情就变得简单了，因此我们的问题是找出max(p(w\|c)p(c))的c，而如何找到最大的p(w\|c)p(c)呢？恩，在Peter的假设(纠正词的选取优先级依次为：库里的词，出错一个词，出错两个词)下就是这行代码了： return max(candidates, key=NWORDS.get). 按照优先级获得p(c)最大的c, Cool! 推荐大家看原文接下来的一些延伸。刘未鹏在数学之美番外篇：《平凡而又神奇的贝叶斯方法》里也提到了这段程序，推荐大家读读。恩，为什么有人说互联网公司拼的是数据？为什么当初李开复能把语音识别准确率提升了一个台阶？为什么贝叶斯定理在Google里有这么重要的地位？(Just Google it) 2.2. Python之美* 老实说这段代码是我最初学Python的动力看看这简单之美： splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] 2.3. 程序之美你一定看过乔丹跳投的视频吧？很美吧？持球，下蹲，发力，起跳，在最高点时移开左手，右手轻轻一拨，3 point，perfect. Peter的'跳投'也很美，read, train, edit, known, 最后: >>> correct('speling') 'spelling'，pefect! 前段时间到上海博物馆里，看到那些巧夺天工的玉器，惊叹之余我竟然第一个想到的，是那些代码。是的，几百年前Bayes提出了贝叶斯定理，几十年前Guido van Rossum整出了python，几年前Peter Norvig写出了这段代码。若把spelling correct这个需求比作玉的话，Peter便是拿着这块玉，用Guido van Rossum作的工具，一行一行地在Bayes的思想下做出了这艺术品。把编程喻为匠艺，挺确切，但不同的是，这份艺术不仅是大脑来雕砌更不止于欣赏，它不在博客馆里陈列着，而在计算机上运行着，它带来的是真实的价值。倘若Google的一次corrector能为用户省下哪怕是1秒钟的时间，那以亿计的用户使用所省下的时间所创造的财富，足以给予Google和Googler们最丰厚的回报和自豪感。 3. 一个小活动其实很早时我就想把酷代码印在T-shirt上了，夏天终于来啦，我决定把这段代码印在我的T-shirt上，Peter也同意啦： (这里面有个让我很囧的大笑话，希望你能发现) 我想一定有人会说，这样走在街上是不是太招摇了？是的，哥就是出来装AC的。当有人拿很好的MPx，里面塞满一堆垃圾音乐，却在你面前装AC时。当有人拿着xphone，里面塞着一堆美女的号码却不是朋友和亲人，却在你面前装AC时。我希望你能和我一样，转过身去，让他看看这便宜的T-shirt背面印着的代码，当你转回身子看着他一脸茫然时，微笑着告诉他："Well, what monkey can't buy!". MPx, xphone, money, 美女这些东西终要腐烂的，唯有艺术与爱不朽于世。 "Smart is the new sexy"，哥不穿黑丝，还不能性感了？是的，一个夏天一件T-shirt是不够的，那就快点把你最喜欢的代码分享出来吧！如果它也很美，我一定会把它印在T-shirt上的。想要印这段代码的同学也可以找我要印的图，哈哈，和大家一起分享，Habe a good summer, and happy programming！大小: 65.6 KB 大小: 47.3 KB 查看图片附件声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

codeincoffee 等级: 初级会员性别: 文章: 63 积分: 90 来自: 上海	发表时间：2011-05-05 老实说，不失望是不可能的，这篇帖子我还是花了些心思写的，竟然一个回复都没有！
返回顶楼	回帖地址 0 0 请登录后投票

gtssgtss 等级: 性别: 文章: 495 积分: 140 来自: 福州	发表时间：2011-05-05 最后修改：2011-05-05 我等屁民无法领悟大神精髓，无奈大神千呼万唤，屁民只好出来膜拜一下了大小: 98.3 KB 查看图片附件
返回顶楼	回帖地址 0 0 请登录后投票

dennis_zane 等级: 资深会员性别: 文章: 1529 积分: 2148 来自: 杭州	发表时间：2011-05-05 综合版人气大不如前了，没办法啊
返回顶楼	回帖地址 0 0 请登录后投票

kimmking 等级: 性别: 文章: 3531 积分: 1500 来自: 中华大丈夫学院	发表时间：2011-05-10 dennis_zane 写道综合版人气大不如前了，没办法啊人气不在的一个原因是没什么这样的好贴，也没什么像样的技术贴了。
返回顶楼	回帖地址 0 0 请登录后投票

skycray 等级: 初级会员性别: 文章: 61 积分: 50 来自: 广州	发表时间：2011-05-10 强...非常不错的技术贴. 顶.
返回顶楼	回帖地址 0 0 请登录后投票

hejibo 等级: 性别: 文章: 24 积分: 130 来自: 北京	发表时间：2011-05-10 谢谢。不过用python spell check在网上到处搜索到的都是这个代码。不知道性能如何？我最后自己做了一个spell check的功能，应该比它这个功能还强一点儿。我的可以对一段话或者一个网址做英语的拼写连接。大家试试？网址如下： http://www.ueseo.org/spellcheck/links/ 欢迎大家提意见，交流。 codeincoffee 写道这是我在ITEYE发的第一个帖子： 1. 一段代码(很短，才21行) 2. 代码赏析(泡杯咖啡，慢慢品) 1. 一段代码如果有人面试问我写过最棒的一段代码是什么，我可能真说不上什么。但要是有人问我见过最棒的一段代码是什么？那非Peter Norvig的Spelling Corrector莫属了。好，大餐上来啦(python2.5, 21行)： import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 2. 代码赏析这里只是简单介绍一下，强烈建议看原文，英文不好的可以看徐大哥的翻译(需fq) 大家一定都曾经在Google里输入错误的单词，比方说输入javaeya，Google会提示: Showing results for javaeye. Search instead for javaeya (其实我想举iteye为例子的，不幸的是Google还没把这词给当回事...) 是的，Peter的这段代码就是干这事的，输入为一个单词，输出为单词的纠正词： >>> correct('speling') 'spelling' >>> correct('korrecter') 'corrector' 好吧，简单分析一下：第一个方法： def words(text): return re.findall('[a-z]+', text.lower()) 返回文本里所有单词，并使其字母变为小写(Google里搜索也是不分大小写的，因为如果分大小写的话太复杂了) 第二个方法： def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model 返回一个'key=word', value='word出现个数'的字典(相当于java里的Hashtable) 用lambda:1(即没有出现的单词的个数记为1)的原因是，概率里用小概率而不是用0来替代不可能事件于是: NWORDS = train(words(file('big.txt').read())) 即把所有在big.txt里的单词按照出现的个数做了统计 (Peter在灰机上写的这段代码，big.txt是他自己电脑的一些文档) 再下来，Peter对输入错误的单词的情况做了几种分类： def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] #@IndentOk deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) 即几下四种(这里以iteye做为例)： 1. 漏输字母：iteye -> itye 2. 字母错位： iteye -> ityee 3. 输错字母： iteye -> iteya 4. 多输字母： iteye -> iteyee 可以看出edits1(word)这个方法即把所有可能字母出错个数为1的单词作了枚举 (注意这里字母出错个数为1的意思是不包括输漏，错，多超过两个字母的情况，比如iteyeee是不被包含的) (从这个角度来看，iteye比javaeye这个域名好多了，我自己试了下javaeye输错的概率会比iteye大很多) 好，接着下一个方法： def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 这段代码返回的是出错字母个数为2且出现在培训样本里的单词，之所以这样做是因为：假如不加判断if e2 in NWORDS的话，"something"这样一个单词的两个字母错误情况居然达到了 114,324 个. Peter在做了些验证后决定只把那些正确的词作为候选词好的，可以进行纠错了 def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 在这里，Peter做了个直接了断的假设，即纠正词的选取优先级依次为：库里的词，出错一个词，出错两个词现在从三个方面来欣赏这段代码： 2.1. 数学之美贝叶斯定理是这段代码的核心，做为人工智能领域的大牛，Peter对这种问题的数学建模能力确实非凡设若单词c(correction)是输入单词w(word)的正确原型，那么P(c\w)即为给定的单词w, 单词c是w的正确原型的概率通过贝叶斯做定理的变型，可得: *P(c\|w)=P(cw)/P(w)=(P(w\|c)P(c))/P(w)* 因为w是用户输入给定的，故p(w)为1，得：*P(c\|w)=P(w\|c)P(c)** 为什么要把p(c\w)转化为p(w\c)呢？Peter给出的解释是：写道 The answer is that P(c\|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c\|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and perhaps the typist's finger slipped off the "e" onto the "w". The point is that to estimate P(c\|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors. 即要比较不同c的p(c\|w)是件很难的事，因为我们是无法知道单词是如何被写错的但把p(c\|w)转化为P(w\|c)P(c)时，事情就变得简单了，因此我们的问题是找出max(p(w\|c)p(c))的c，而如何找到最大的p(w\|c)p(c)呢？恩，在Peter的假设(纠正词的选取优先级依次为：库里的词，出错一个词，出错两个词)下就是这行代码了： return max(candidates, key=NWORDS.get). 按照优先级获得p(c)最大的c, Cool! 推荐大家看原文接下来的一些延伸。刘未鹏在数学之美番外篇：《平凡而又神奇的贝叶斯方法》里也提到了这段程序，推荐大家读读。恩，为什么有人说互联网公司拼的是数据？为什么当初李开复能把语音识别准确率提升了一个台阶？为什么贝叶斯定理在Google里有这么重要的地位？(Just Google it) 2.2. Python之美* 老实说这段代码是我最初学Python的动力看看这简单之美： splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] 2.3. 程序之美你一定看过乔丹跳投的视频吧？很美吧？持球，下蹲，发力，起跳，在最高点时移开左手，右手轻轻一拨，3 point，perfect. Peter的'跳投'也很美，read, train, edit, known, 最后: >>> correct('speling') 'spelling'，pefect! 前段时间到上海博物馆里，看到那些巧夺天工的玉器，惊叹之余我竟然第一个想到的，是那些代码。是的，几百年前Bayes提出了贝叶斯定理，几十年前Guido van Rossum整出了python，几年前Peter Norvig写出了这段代码。若把spelling correct这个需求比作玉的话，Peter便是拿着这块玉，用Guido van Rossum作的工具，一行一行地在Bayes的思想下做出了这艺术品。把编程喻为匠艺，挺确切，但不同的是，这份艺术不仅是大脑来雕砌更不止于欣赏，它不在博客馆里陈列着，而在计算机上运行着，它带来的是真实的价值。倘若Google的一次corrector能为用户省下哪怕是1秒钟的时间，那以亿计的用户使用所省下的时间所创造的财富，足以给予Google和Googler们最丰厚的回报和自豪感。 3. 一个小活动其实很早时我就想把酷代码印在T-shirt上了，夏天终于来啦，我决定把这段代码印在我的T-shirt上，Peter也同意啦： (这里面有个让我很囧的大笑话，希望你能发现) 我想一定有人会说，这样走在街上是不是太招摇了？是的，哥就是出来装AC的。当有人拿很好的MPx，里面塞满一堆垃圾音乐，却在你面前装AC时。当有人拿着xphone，里面塞着一堆美女的号码却不是朋友和亲人，却在你面前装AC时。我希望你能和我一样，转过身去，让他看看这便宜的T-shirt背面印着的代码，当你转回身子看着他一脸茫然时，微笑着告诉他："Well, what monkey can't buy!". MPx, xphone, money, 美女这些东西终要腐烂的，唯有艺术与爱不朽于世。 "Smart is the new sexy"，哥不穿黑丝，还不能性感了？是的，一个夏天一件T-shirt是不够的，那就快点把你最喜欢的代码分享出来吧！如果它也很美，我一定会把它印在T-shirt上的。想要印这段代码的同学也可以找我要印的图，哈哈，和大家一起分享，Habe a good summer, and happy programming！
返回顶楼	回帖地址 0 0 请登录后投票

徐风子等级: 初级会员性别: 文章: 124 积分: 30 来自: 北京	发表时间：2011-05-10 很经典的文章，我也考虑学Python了。
返回顶楼	回帖地址 0 0 请登录后投票

akandfxs 等级: 初级会员性别: 文章: 52 积分: 40 来自: 北京	发表时间：2011-05-10 恩，确实很经典的文章。
返回顶楼	回帖地址 0 0 请登录后投票

x___man 等级: 初级会员性别: 文章: 1 积分: 30 来自: 大连	发表时间：2011-05-10 非常好的文章，希望有人看到，尤其是楼主的风格。
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → 综合技术版

跳转论坛: