训练验证码识别程序-training tessertact

lineageII

浏览: 83066 次
来自: ...

最近访客更多访客>>

rl724

sn200837

geniusian

ztao2333

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

验证码识别

Google thread

Simplest steps to train tesseract

参考

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/983317066a5acbd1/58ccdd7c1da5884e?lnk=gst&q=train#58ccdd7c1da5884e

1.收集验证码，把所有验证码图片二值化，去噪点后，用PS合并在一张图片上如图,把图片转换成tif格式。如scan.tif

2.生成box文件

运行"tesseract scan.tif scan batch.nochop makebox"; 会生成scan.txt文本文件，修正错误的字符。把scan.txt改名为scan.box(这一步可以用bbtesseract代替。bbtesseract下载地址http://code.google.com/p/bbtesseract/downloads/list)

3.开始训练tesseract

运行"tesseract scan.tif junk nobatch box.train"; 生成文件scan.tr

4.Clustering

运行"mftraining scan.tr"; 生成文件"inttemp", "pffmtable" and "Microfeat"(Not used)

运行"cnTraining scan.tr";生成文件"normproto";

5.Compute the Character Set
运行"unicharset_extractor scan.box"; 生成文件"unicharset"

6.Dictionary Data

这一步操作可以不用，直接复制其他的。

Create two UTF-8 text file, "frequent_words_list" and "words_list",
the words in the files should not be duplicated;
Run "wordlist2dawg frequent_words_list freq-dawg"
Run "wordlist2dawg words_list word-dawg";
This will generate two files, "freq-dawg" and "word-dawg";

7. Putting it all together
All you need to do now is collect together all 8 files and rename
them with a lang. prefix;
File "eng.DangAmbigs" and "eng.user-words" could be empty;
If create "eng.DangAmbigs" file, the characters must be exist in the
"scan.box";

8. Try it
Run "tesseract scan.tif output -l eng"
The file "output.txt" is the result;

快速步骤

1.收集验证码，把所有验证码图片二值化，去噪点后，用PS合并在一张图片上如图,把图片转换成tif格式。如scan.tif

2.生成box文件

3.把tesseract中training中的所有文件复制到tesseract.exe所在目录中，在tesseract.exe所在目录新建batch

tesseract scan.tif junk nobatch box.train
mftraining scan.tr
cnTraining scan.tr
unicharset_extractor scan.box

运行后，生成的inttemp，normproto，pffmtable，unicharset有用。

查看图片附件

分享到：

开心网acc的取得，java调用javascript的函 ... | PWNtcha-makefont

2008-12-02 17:22
浏览 8796
评论(4)
查看更多

4 楼 luohoufu 2009-05-02

这样做的目的，不了解。

3 楼 diddyrock 2009-02-21

v861 写道

请教一下  用tesseract可以识别中文吗？如何实现  交流一下  wanglm@live.cn  thinks!

可以识别，有cnTraing,自己编写一个train文件，按照wiki上面的步骤

2 楼 v861 2009-02-17

请教一下用tesseract可以识别中文吗？如何实现交流一下 wanglm@live.cn
thinks!

1 楼 mslk 2008-12-14

tesseract真是好东西，值得好好学习

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论