OCR学习记录

san_yun

浏览: 2688461 次
来自: 杭州

最近访客更多访客>>

空城旧梦why

sd3870181

alexqdjay

hanmiao

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

最近对验证码识别做了一些研究，主要是OCR方向的，一些总结记录一下。识别CAPTCHA后面跟了很多参考文章都讲解的很详细了，做ORC不难，难点在于如何提高识别率。基本流程如下:

1.原图

2.预处理（去噪点）

3.标准化（灰度变换,二值化，归一化）

4.image segment（个人感觉这个比较难，有很多算法，比如垂直投影直方图，KNN，Color Filling）

5.提取特征

6.机器学习

7.识别

总之OCR是一个很有意思的研究课题，包含大量对计算机图形图像，机器学习，神经网络方面的研究，可以作为一个问题点来研究机器学习。网上已经有一个学习好的手写体样本库MNIST可供玩耍。附件另有一份是VSM向量空间模型理论的论文，清楚的讲解了如何计算两个对象之间的相似性。

0.PIL简单的API使用：

# -*- coding: utf-8 -*-
path = "/home/yunpeng/test4/data/4399/simple/8.png"

from PIL import Image,ImageDraw
im =Image.open(path)
im = im.convert('L')

#二值化
print 'img info:',im.format,im.size,im.mode
width,height = im.size
for x in xrange(width):
    for y in xrange(height):
        p= im.getpixel((x, y))
        if p>90:
            im.putpixel((x,y),255)
        else:
            im.putpixel((x,y),0)

#去头去尾
mlist = set([])
p = im.load()
for x in xrange(width):
    for y in xrange(height):
        p= im.getpixel((x, y))
        if p<200:
            mlist.add(x)

mlist = list(mlist)
left= mlist[:1][0]
right = mlist[len(mlist)-1:][0]

box = (left, 0, right, height)
im = im.crop(box)

width,height = im.size
ps = [0]*width

for x in xrange(width):
    for y in xrange(height):
        p= im.getpixel((x, y))
        if p==0:
            ps[x]=ps[x]+4
   
image = Image.new('RGB',(200,200),(255,255,255))  
draw = ImageDraw.Draw(image)
ps_width = len(ps)
for k in xrange(ps_width):
    source = (k,199)                 #起点坐标y=99, x=[0,1,2....]
    target = (k,199-ps[k])    #终点坐标y=255-a[x],a[x]的最大数值是200,x=[0,1,2....]
    draw.line([source, target], (100,100,100),1)

image.show()
im.show()

1.什么是灰度变换？

Photoshop里的灰度变换可以使R,G,B 3色按任何比例增强再混合。黑白图片的黑白变换叫灰度变换，彩色图片的色彩变换也叫灰度变换。
比如线性变换
可以用一个线性函数:f(x,y)=a'+(b'-a')/(b-a)×(f(x,y)-a)
f(x,y)代表一个象素
[a,b]是原始图像的灰度范围，[a',b']是变换后新图像的灰度范围
用这个线性函数分别对R,G,B分量进行变换可以起到单色增强的目的，然后再混合输出。
如果b'-a' > b-a ，则使得图像灰度范围增大，即对比度增大，图像会变得清晰
如果b'-a' < b-a ，则使得图像灰度范围缩小，即对比度减小。

PS: PIL可以通过im.convert('L')

2.什么是直方图？

直方图就是统计图像中像素点为某个颜色值的个数。

参考：

使用PIL计算直方图并显示

3.tesseract如何安装？

参考：

ubuntu安装tesseract 进行OCR识别

使用tesseract-ocr破解网站验证码

4. 参考资料

Python图像处理库(PIL)--基本概念和类库介绍
http://www.cnblogs.com/wei-li/archive/2012/04/19/2443281.html
http://www.cnblogs.com/wei-li/archive/2012/04/19/2456725.html
http://iysm.net/?tag=pil

用Python做图像处理:
http://blog.csdn.net/gzlaiyonghao/article/details/1852726

计算图像相似度——《Python也可以》之一
http://blog.csdn.net/gzlaiyonghao/article/details/2325027

10 行代码判定色*情*图片——Python 也可以系列之二
http://blog.csdn.net/gzlaiyonghao/article/details/3166735

用BP人工神经网络识别手写数字——《Python也可以》之三
http://blog.csdn.net/gzlaiyonghao/article/details/7109898

大规模识别相似图像的算法探讨（比较浅）
http://caocao.iteye.com/blog/149776

用PIL实现滤镜(一)——素描、铅笔画效果
http://blog.sina.com.cn/s/blog_5eeb1e2f0101axvi.html

图像处理之霍夫变换（直线检测算法）
http://blog.csdn.net/jia20003/article/details/7724530

python 简单图像处理（最详细1-16篇，包括细化，傅立叶变换，)
http://www.cnblogs.com/xianglan/category/272764.html

使用（ImageMagick+tesseract-ocr）实现图像验证码识别实例（识别读比较高）：
http://blog.csdn.net/mlks_2008/article/details/8052782

tesseract-ocr训练方法：
http://www.lixin.me/blog/2012/05/26/29536

OCR学习及tesseract的一些测试:
http://blog.csdn.net/viewcode/article/details/7784600

某网站验证码的识别笔记（去除背景色）：
http://blog.csdn.net/bh20077/article/details/7041280

用imagemagick和tesseract-ocr破解简单验证码（ruby）:
http://hooopo.iteye.com/blog/993538

使用 Python 构造神经网络(IBM Hopfield 网络可以重构失真的图案并消除噪声)
http://www.ibm.com/developerworks/cn/linux/l-neurnet/

常见验证码的弱点与验证码识别
http://drops.wooyun.org/tips/141

一种通用的去除文字图像中干扰线的算法:
http://wenku.baidu.com/view/63bac64f2b160b4e767fcfed.html

Decoding CAPTCHA’s:
http://www.boyter.org/decoding-captchas/

===================================================================
Tesseract OCR 训练和识别总结:
http://miphol.com/muse/2013/06/tesseract-ocr-1.html
http://miphol.com/muse/2013/05/tesseract-ocr.html

Tesseract-OCR 字符识别---样本训练（使用jTessBoxEditor工具，比较详细）
http://blog.csdn.net/firehood_/article/details/8433077

Tesseract-OCR引擎入门
http://blog.csdn.net/xiaochunyong/article/details/7193744

Tesseract官方配置
http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

粘连字符的图片验证码识别
http://wenku.baidu.com/view/343c200c581b6bd97f19ead9.html

字符扭曲粘连验证码识别技术研究
http://wenku.baidu.com/view/45896630580216fc700afd16.html

-----------------------------------------------------------------------
wiki:
http://zh.wikipedia.org/zh-cn/captcha
http://en.wikipedia.org/wiki/Image_segmentation

Python Module for Mean Shift Image Segmentation:
http://code.google.com/p/pymeanshift/

淘宝验证码:
http://pin.aliyun.com/get_img?identity=taoquan.taobao.com&sessionid=1381293634479

验证码识别工具-tesseract（最详细）
http://hilojack.sinaapp.com/?p=866

如何识别高级的验证码鬼仔's Blog（最高级）
http://huaidan.org/archives/2085.html

浅谈OCR之Tesseract:
http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html

tesseract-ocr使用方法总结:
http://hyhx2008.github.io/tesseract-ocrshi-yong-fang-fa-zong-jie.html

开源OCR引擎Tesseract
http://hi.baidu.com/lifulinghan/item/b59af9eb1d92282d5a7cfb69

使用tesseract-ocr破解网站验证码
http://grunt1223.iteye.com/blog/904313

breaking weak captcha in slightly more than 26 lines of groovy-code
http://www.kellyrob99.com/blog/2010/03/14/breaking-weak-captcha-in-slightly-more-than-26-lines-of-groovy-code/

tesseract-ocr3.02的用法详解(训练词库)
http://www.cnblogs.com/huyulin/p/3305563.html

关于tesseract-ocr3的训练和使用
http://www.cnblogs.com/zcsor/archive/2011/02/21/1959555.html

tesseract java api
http://stackoverflow.com/questions/13974645/using-tesseract-from-java

tesseract python api
http://code.google.com/p/pytesser/
https://github.com/rosarior/pytesser
https://code.google.com/p/pytesser/wiki/README

识别验证码，你有几分成功率？
http://aoingl.iteye.com/blog/1389232
http://ptlogin.4399.com/ptlogin/captcha.do?captchaId=captchaReq011404b815f6235726
http://www.andrew.cmu.edu/user/ericwu/parch/finalreport.html()

[1] L. von Ahn, M. Blum and J. Langford. Telling Humans and Computer Apart
Automatically[J], Comm. Of the ACM, 46(Aug. 2003), 57-60.
[2] K. Chellapilla, K. Larson, P. Simard and M. Czerwinski, Building Segmentation
Based Human-friendly Human Interaction Proofs[C], 2nd Int’l Workshop on Human Interaction Proofs, Springer-Verlag, LNCS 3517, 2005.
[3] J. Yan and A. S. EI. Ahmad. Usability of CAPTCHAs - Or, Usability issues in
CAPTCHA design[C], the fourth Symposium on Usable Privacy and Security, Pittsburgh, USA, July 2008.
[4] K. Chellapilla, K. Larson, P. Simard, M. Czerwinski, Computers beat humans at
single character recognition in reading-based Human Interaction Proofs[C], In 2nd Conference on Email and Anti-Spam (CEAS’05), 2005.
[5] J. Yan and A. S. El Ahmad. A Low-cost Attack on a Microsoft CAPTCHA[C], 15th
ACM Conference on Computer and Communications Security (CCS’08). Virginia, USA, Oct 27-31, 2008. ACM Press. 543-554.
[6] Microsoft Corporation. Human Interaction Proof (HIP) - Technical and Market
Overview[J], 2006. Accessed in Jan 2011.
[7] J. Yan and A. S. El Ahmad. Breaking Visual CAPTCHAs with Naive Pattern
Recognition Algorithms[C], in Proc. of the 23rd Annual Computer Security Applications Conference (ACSAC’07). FL, USA, Dec 2007. IEEE computer society. 279-291.
[8] G. Mori and J. Malik. Recognizing Objects in Adversarial Clutter: Breaking a
Visual CAPTCHA[C], IEEE Conference on Computer Vision and Pattern Recognition(CVPR'03), Vol 1, June 2003, 134-141.
[9] G. Moy, N. Jones, C. Harkless and R. Potter. Distortion estimation techniques in
solving visual CAPTCHAs[C], IEEE CVPR, 2004.
[10] K. Chellapilla and P. Simard, Using Machine Learning to Break Visual Human
Interaction Proofs[M], Neural Information Processing Systems (NIPS), MIT Press, 2004.
[11] L. von Ahn, M.Blum, N. J. Hopper, and J. Langford, CAPTCHA: Using hard AI
problems for security[C]. Eurocrypt’2003.
[12] W. Zhang, J. Sun, and X. Tang. Cat head detection -how to effectively exploit shape and texture features[C]. In Proc. ECCV 2008, Part IV, LNCS 5305 (2008), 802–816.
[13] P. Golle. Machine learning attacks against the Asirra CAPTCHA[C]. In ACM
CCS’2008, 535-542.
[14] http://recaptcha.net/learnmore.html，2012-10-19。
[15] Elie Bursztein, Matthieu Martin, and John C. Mitchell. Text-based CAPTCHA
strengths and weaknesses[C]. 18th ACM conference,2011.
[16] Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel
Blum, 2008. reCAPTCHA: Human- Based Character Recognition via Web Security Measures[J]. Science, 321(5895):1465-1468.
[17] 李颖,Web验证码生成和识别[D]。南京理工大学2008 研究生论文。
[18] Zeidenberg, Matthew. Neural Networks in Artificial Intelligence[M]. 1990: Ellis
Horwood Limited. 1990. ISBN 0-13-612185-3.
[19] 张淑雅，赵一鸣，赵晓宇等.认证码字符识别方法的研究[J].宁波大学学报:
理工版，2007,12(4):429-433.
[20] 潘大夫，汪渤.一种基于外部轮廓的数字验证码识别方法[J],微计算机信息:
测控自动化，2007,23(9-1):0256-0258.
[21] 贾磊磊，陈锡华，熊川，验证码的模糊识别[J],西昌学院学报：自然科学版，
2010，24(1)：60-62