`

extract captcha image

阅读更多

Decoding CAPTCHA's
extract captcha image
OCR (Optical Character Recognition) is pretty accurate these days and can easily read printed text.
rails ocr
ruby ocr
break google captcha


http://stackoverflow.com/search?q=rails+ocr
http://www.wausita.com/captcha/

-----------------------------------------------------------

1.tesseract-x.xx.tar.gz contains all the source code.

2.tesseract-2.xx.<lang>.tar.gz contains the Tesseract 2 language data files for <lang>. You need at least one of these or tesseract 2 will not work.

3. <lang>.traineddata.gz contains the Tesseract 3 language data file for <lang>. You need at least one of these or tesseract 3 will not work.

4.Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory.
tesseract-2.01.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them
into your tesseract-2.04 directory, so you can use unpack here or equivalent.
 You can unpack as many of the language packs as you care to, as they all
contain different files. Note that if you are using make install you should
unpack your language data to your source tree before you run make install.
If you unpack them as root to the destination directory of make install,
then the user ids and access permissions might be messed up.


If they are not already installed, you need the following libraries (Ubuntu):

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlibg-dev

E: 无法找到软件包 zlibg-dev => download source
sudo apt-get install zlib1g-dev

download Leptonica from http://www.leptonica.org/source/leptonlib-1.67.tar.gz
tar zxvf leptonlib-1.67.tar.gz


You also need to install Leptonica. There is an apt-get package (name unknown), or the sources are at http://www.leptonica.org/. The instructions at Leptonica README are clear, but basically it is the usual
 
./configure
make
sudo make install
sudo ldconfig

Now back to Tesseract. Download the source from svn:
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
or package tesseract-3.00.tar.gz from download page. The same build process as usual applies:

http://code.google.com/p/tesseract-ocr/downloads/list

./runautoconf
./configure
make
sudo make install

sudo vi /etc/profile
vi ~/.bashrc

gunzip FileName.gz

   1. Download langugage data file (e.g. 'wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz')
   2. Decompress it ('gzip -d eng.traineddata.gz')
   3. Move it to instalation tessdata (e.g. 'mv eng.traineddata $TESSDATA_PREFIX' if defined TESSDATA_PREFIX)


You may still get an error when trying to run tesseract:
$ tesseract foo.png bar

tesseract: error while loading shared libraries: libtesseract_api.so.3 cannot open shared object file: No such file or directory
You need to update the cache for the runtime linker. The following should get you up and running:
$ sudo ldconfig

--------------------------------------------------
copy eng.traineddata  to /usr/local/share/tessdata
pwd
/usr/local/share/tessdata
ls
configs  eng.traineddata  tessconfigs
-------------------------------------------------
tesseract digit only
improve tesseract digits  accuracy
use tesseract to get plain ascii text out of the bitmap.


`curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600113322' > /home/simon/Desktop/weizh/ca.jpg`

tesseract ca.bmp outputbase -l eng
more outputbase.txt

tesseract ca.bmp outputbase nobatch digits
more outputbase.txt

only support jpg:
curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600111234' > ca.jpg
tesseract ca.jpg outputbase nobatch digits
cat outputbase.txt


Reloading /etc/profile

source ~/.profile
$ source /etc/profile

.profile settings overwrite those in /etc/profile. You can also use .bash_profile in your home directory to customize your bash shell's profile.

Basically, if you need to load shell variables from any file just run the .
(dot) command, followed by space and (the absolute path is necessary) the path
 to the file. (Be carefull what file you're loading variables from because
you meight overwrite some important environment variables and your system
could become unstable).

$ tesseract wenzhou.jpeg outputbase -l eng
Error openning data file /usr/local/sharetessdata/eng.traineddata
=> cp eng.traineddata to /usr/local/sharetessdata


cd /home/simon/Desktop/weizh
curl 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304472587796' > xian.png
tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits


<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<script>
alert("验证码错误!");
window.close();
</script>
</head>
</html>

curl --cookie-jar newcookies.txt 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304494360513'  > xian.png

curl --cookie newcookies.txt 'http://117.36.53.122:9081/wfcx/query.do?actiontype=vioSurveil&vcode=2148&hpzl=02&hphm=AUL695&tj=CLSBDH&tj_val=LFV2A11GX93178557'

tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits



-----------------------------------

cd /usr/local/sharetessdata:
eng.traineddata


/usr/local/share/tessdata:
chi_sim.traineddata 
configs 
eng.traineddata 
tessconfigs


-----------------------------------

$ sudo apt-get install imagemagick
$ dpkg -l |grep imagemagick
imagemagick                                                 
imagemagick-doc                           

$ convert
$ whereis convert
$ which is convert
$ convert -compress none -depth 8 -alpha off zhejiang.gif zhejiang.tif

enlarge the image can improve ocr accuracy

I believe the real challenge to apply ocr for plate recognition is
that the plate image are "too dirty" comparing to paper documents.
There are frames, skews, un-even shadows, etc. You have to do your own
work to parse the plate into separate chars and feed the ocr engine. I
don't think tesseract itself can handle this automatically given the
raw image. But I believe it will do pretty well once you get the
binarized separate chars. Basically, plate recognition is more a image
processing problem than ocr problem.

You can use the grammar as post-process to make corrections.


to convert the pdf I used Image Magick convert application. bellow the set command that I use.
convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif
tesseract tmp.tif out.txt

how to eliminate noise

 

 

 

 

分享到:
评论

相关推荐

    ASP.NET Captcha image

    在项目中,`CaptchaImage`可能是一个包含Captcha图像生成代码的文件或者是一个类库,包含了创建和显示验证码的完整逻辑。开发者可以将这个模块集成到自己的ASP.NET应用程序中,以提供安全的用户验证功能。 总的来说...

    captchaimage-1.4

    Linux下captchaimage-1.4安装包 python-captchaimage is a fast and easy to use Python extension for creating images with distorted text that are easy for humans and difficult for computers to read.

    Captcha-Image-Api:验证码

    : " :copyright: Dhruv " , " font " : " arial.ttf " , " img_url " : " https://Captcha-Image-Api.dhruvnation1.repl.co/captchame/FkciuPXxCnJ5d9Dyg4UA2Dr6d4e5cPWla9A2eABEp0ZdSYs4bmFIVab5iCg "} Dhruv...

    image_captcha.php

    php验证码

    Drupal CAPTCHA模块配置

    进一步配置 image captcha 模块,进入 "Configuration" &gt; "People" &gt; "CAPTCHA" &gt; "Image CAPTCHA"。你可以自定义验证码的特性,如在 "Characters to use in the code" 中设定验证码的字符集,"Codelength" 设置...

    Zend_captcha_image点击刷新图片验证码(dojo_ajax)

    总结起来,"Zend_captcha_image点击刷新图片验证码(dojo_ajax)"涉及到的技术包括PHP的Zend Framework用于创建和管理验证码,利用Dojo进行前端交互,以及Ajax实现无刷新的图像刷新和验证。这种组合提供了高效且安全的...

    captcha.rar

    在本资源包"captcha.rar"中,我们可以找到与Python编程语言相关的验证码实现和处理工具。 Python是一种高级编程语言,由于其简洁明了的语法和丰富的库支持,它被广泛用于开发各种应用,包括网络安全领域。在处理...

    Captcha_breaker

    Captcha breaker can identify the number in captcha image and label them.CNN was trained on custom dataset made out of captcha image

    captcha-1.3.0-API文档-中文版.zip

    赠送jar包:captcha-1.3.0.jar; 赠送原API文档:captcha-1.3.0-javadoc.jar; 赠送源代码:captcha-1.3.0-sources.jar; 赠送Maven依赖信息文件:captcha-1.3.0.pom; 包含翻译后的API文档:captcha-1.3.0-javadoc-...

    cool-php-captcha

    cool-php-captcha 是一个很酷的 PHP 用来生成验证码的库。示例代码:session_start();$captcha = new SimpleCaptcha();// Change configuration...//$captcha-&gt;... // Change session variable$captcha-&gt;CreateImage();

    b2evo-captcha-1.3.1

    switch($captcha-&gt;validate_submit($_POST['image'],$_POST['attempt'])) { // form was submitted with incorrect key case 0: echo '&lt;p&gt;&lt;br&gt;Sorry. Your code was incorrect.'; echo ' &lt;br...

    captcha 验证码识别

    captcha 验证码识别

    集成aj-captcha实现滑块验证码.zip

    res.type('image/png').send(captcha.image); }); app.post('/validate', async (req, res) =&gt; { const { data, solution } = req.body; const isValid = AjCaptcha.validate(data, solution); if (isValid) { ...

    python的captcha库

    python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库

    thinkphp5图片组件解决captcha_src()

    `captcha_src()` 和 `captcha_img()` 是ThinkPHP5框架中的两个重要函数,它们与图片验证码的生成和显示密切相关。本文将详细讲解这两个函数的工作原理以及如何在项目中正确使用它们。 `captcha_src()` 函数是用于...

    AJ-Captcha行为验证码 v1.3.0.zip

    AJ-Captcha行为验证码是一款用于网站安全验证的工具,版本为1.3.0。这款验证码系统旨在防止自动化脚本或机器人进行恶意操作,如垃圾邮件发送、账户注册、恶意登录等。它通过检测用户在输入验证码时的行为模式来判断...

    基于springboot行为滑块验证码tianai-captcha的快速启动器.zip

    import cloud.tianai.captcha.spring.annotation.Captcha; import cloud.tianai.captcha.spring.request.CaptchaRequest; import org.springframework.web.bind.annotation.PostMapping; import org.springframework...

    captcha.class.php:一个简单的 PHP CAPTCHA 类

    ###参数s: user defined captcha text c: captcha type 可以在课堂上更改更多设置... ###如何使用它只需调用 captcha.php 文件并传递所需的类型和/或预定义的验证码文本。 captcha.php?s=123456 输出: ...

    行为验证码 AJ-Captcha 1.3.0

    行为验证码AJ-Captcha 1.3.0是一种用于网络安全验证的解决方案,旨在防止自动化脚本或机器人进行恶意操作。此版本提供了丰富的功能和多种平台的支持,包括前后端交互,以及前端框架如Vue.js、H5、Android、iOS、...

    Laravel开发-captcha

    在Laravel框架中,Captcha是一个非常重要的组件,主要用于防止自动化程序(如机器人)进行恶意操作,例如批量注册、垃圾评论等。Captcha通常要求用户输入图像上显示的一串随机字符,以此验证用户是真实的人而不是...

Global site tag (gtag.js) - Google Analytics