TextExtract(1)Tika Basic

sillycat

浏览: 2560248 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary

TextExtract(1)Tika Basic

1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.

Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser

There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true

Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui

And we can choose files and change the view to see different contents we get from the files.

2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class TestFunMain {

    static final String file = "/opt/data/resume/3-resume.pdf";

    public static void main(String[] args) {
        // Create a Tika instance with the default configuration
        Tika tika = new Tika();
        // Parse all given files and print out the extracted text content
        String text = null;
        try {
            text = tika.parseToString(new File(file));
        } catch (IOException | TikaException e) {
            e.printStackTrace();
        }
        System.out.print(text);
    }
}

Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TestFunMain {

    static final String file = "/opt/data/resume/3-duffy.pdf";

    public static void main(String[] args) {
        Tika tika = new Tika();
        String text = null;
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler();
        ParseContext context = new ParseContext();
        Metadata metadata = new Metadata();

        // fetch the content
        try {
            text = tika.parseToString(new File(file));
        } catch (IOException | TikaException e) {
            e.printStackTrace();
        }
        // System.out.print(text);

        // fetch the meta
        try {
            parser.parse(new FileInputStream(file), handler, metadata, context);
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
        // System.out.println(handler.toString());

        String[] metadataNames = metadata.names();

        for (String name : metadataNames) {
            // System.out.println(name + ": " + metadata.get(name));
        }

        // identify language
        try {
            parser.parse(new FileInputStream(file), handler, metadata,
                    new ParseContext());
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        }
        LanguageIdentifier object = new LanguageIdentifier(handler.toString());
        System.out.println("Language name :" + object.getLanguage());
    }
}

References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8

books
Tika in Action.pdf

http://m.yiibai.com/tika/tika_content_extraction.html

分享到：

TextExtract(2)NLP Basic | FIPS and County Code Lookup

2015-10-13 23:39
浏览 740
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论