SOLR: tika with OCR engine

ylzhj02

浏览: 250015 次
性别:
来自: 成都

最近访客更多访客>>

daqin

bbpopeye

也许on

learnmore

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Solr

I want to parse the content not just the metadata of a jpg picture.

The following code is the test class

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.ocr.TesseractOCRParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class JpegParse {

    public static void main(final String[] args) throws IOException, SAXException, TikaException, InterruptedException {
	File file = new File("/path/to/menu.jpg");

	BodyContentHandler handler = new BodyContentHandler();

	Metadata metadata = new Metadata();
	FileInputStream inputstream = new FileInputStream(file);
	ParseContext pcontext = new ParseContext();

	TesseractOCRConfig config = new TesseractOCRConfig();
	config.setLanguage("chi");

	config.setTesseractPath("/path/to/tesseract-ocr");
	pcontext.set(TesseractOCRConfig.class, config);

	TesseractOCRParser JpegParser = new TesseractOCRParser();
	pcontext.set(TesseractOCRParser.class, JpegParser);

	JpegParser.parse(inputstream, handler, metadata, pcontext);

	System.out.println("Metadata of the document:");
	String[] metadataNames = metadata.names();
	for (String name : metadataNames) {
	    System.out.println(name + ": " + metadata.get(name));
	}
	System.out.println("Contents of the document:" + handler.toString());
    }
}

Note:

config.setTesseractPath("/path/to/tesseract-ocr");

must be parent dir includes tessdata dir.

And tesseract cmd must be linked in this dir

#ln -s /usr/local/bin/tesseract /path/to/tesseract-ocr

Preferences

https://wiki.apache.org/tika/TikaOCR

http://www.kaiyuanba.cn/html/1/131/227/7891.htm

分享到：

Solr: Install solr to production | Solr: Install tesseract-ocr

2015-06-12 15:03
浏览 1558
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

SOLR: tika with OCR engine

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

SOLR: tika with OCR engine

评论

发表评论

相关推荐

Storm: monitor storm with supervisor

Solr: 5.2.1 install and config

Solr: index product and price for sellers and perfoming query and sorting

Solr: Using FunctionQuery in SOLR Sort Syntax

Solr: integrate carrot2 with solr-5.1.0

Solr: Spatial Search

Solr: Synonym Query

Solr: Install tesseract-ocr

用 Apache Tika 理解信息内容

Solr: Using solrJ to operate solr

Flume: morphline sink with solr 5.1.0

Solr : realtime recommender

Data ETL tools for hadoop ecosystem Morphlines

flume source using mysql-replication-listener to realtime copy data from mysql

Solr: realtime ingest data from mysql to solr using flume

Solr: integrate with hadoop

Running Solr on HDFS

OpenNLP integrate with solr

Apache SOLR and Carrot2 integration strategies 2

Apache SOLR and Carrot2 integration strategies 1

最近访客更多访客>>