- 浏览: 2551003 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
TextExtract(1)Tika Basic
1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.
Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser
There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true
Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui
And we can choose files and change the view to see different contents we get from the files.
2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-resume.pdf";
public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}
Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-duffy.pdf";
public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);
// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}
// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
Tika in Action.pdf
http://m.yiibai.com/tika/tika_content_extraction.html
1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.
Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser
There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true
Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui
And we can choose files and change the view to see different contents we get from the files.
2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-resume.pdf";
public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}
Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TestFunMain {
static final String file = "/opt/data/resume/3-duffy.pdf";
public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);
// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}
// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
Tika in Action.pdf
http://m.yiibai.com/tika/tika_content_extraction.html
发表评论
-
Stop Update Here
2020-04-28 09:00 315I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 475NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 367Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 367Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 335Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 429Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 435Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 373Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 454VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 384Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 475NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 421Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 336Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 246GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 450GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 326GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 312Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 317Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 292Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 311Serverless with NodeJS and Tenc ...
相关推荐
1. **Apache Tika的核心功能**: - **文本抽取**:Tika能够从各种文档中提取纯文本内容,这对于信息检索、文本分析或自然语言处理等任务非常有用。 - **元数据提取**:它还能获取文件的元数据,如作者、创建日期、...
1. **MIME类型识别**:Tika通过使用Apache Tika-Mime库来识别文件的MIME类型,这是确定如何解析文件的关键步骤。MIME类型是一种标准,用来定义文件在网络上传输时的数据类型和格式。 2. **解析器架构**:Tika的解析...
tika-python 绑定到 Apache Tika REST 服务 Python binding to the Apache Tika REST services Apache Tika 库的 Python 端口,可使用 Tika REST 服务器使 Tika 可用。这使得 Apache Tika 可作为 Python 库使用,可...
Apache Tika本产品包括在以下位置开发的软件Apache软件基金会。版权所有1993-2010大学大气研究公司/ Unidata该软件包含源自UCAR / Unidata的NetCDF库的代码。Tika服务器组件使用CDDL许可的依赖项
1. **tika-core-1.5.jar**: 这是Tika的核心库,包含了处理元数据、内容提取、解析器接口等基础功能。它提供了API,使得开发者可以方便地与Tika交互,例如,通过`org.apache.tika.Tika`类来获取文件的基本信息。这...
如果需要处理现代的文件格式或者更全面的功能,建议升级到较新版本的Tika,如Tika 1.x系列,它们通常会提供更好的性能和更多的文件格式支持。同时,随着技术的发展,新的安全问题可能会被发现,使用较旧版本的库可能...
1. **Detector**:用于识别文件类型,Tika可以根据文件的二进制签名或元数据确定文件类型。 2. **Extractor**:将解析后的文本和元数据提取出来,供进一步处理或存储。 3. **ContentHandler**:这是一个接口,允许...
英文Tika in Action Tika in Action to be a hands-on guide for developers working with search engines, content management systems, and other similar applications who want to exploit the information ...
1. **文件解析**:Tika能够处理多种文件格式,如PDF、Microsoft Office文档(Word、Excel、PowerPoint)、HTML、XML、图片、音频和视频等。它通过集成多种解析器库来实现这一点,如Apache POI用于处理Microsoft ...
1. 使用Tika解析文件:通过`Tika`的`parseToString()`方法,可以获取文件的纯文本内容。 2. 创建Lucene索引:利用`Directory`、`Analyzer`和`IndexWriter`等类,将Tika提取的文本内容建立索引。 3. 查询Lucene索引:...
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。 功能包括: 侦测文档的类型,字符编码,语言,等其他现有文档的属性。 提取结构化的文字内容。...
tika最新版本,tika-app-1.0.jar,提取office和pdf文档内容
Apache Tika 1.1 所需要的jar包,方便不想用maven的同学. 此压缩包内是核心jar包,依据http://tika.apache.org/1.1/gettingstarted.html 中Using Tika in an Ant project章节列出的 classpath 找齐 部分版本比文章中...
1. **内容提取**:tika库提供了`parse()`方法,可以将非文本文件中的文本内容提取出来。这对于处理大量非结构化的文档非常有用,例如从PDF或Word文档中获取纯文本。 2. **元数据获取**:除了文本内容,tika还能获取...
1. **文件类型检测(MIME Type Detection)**:Tika能自动识别文件的MIME类型,这对于处理未知格式的文件非常有用。它基于文件头信息和内容特征来确定文件类型。 2. **内容提取(Content Extraction)**:Tika可以...
1. `src/main/java`: 这个目录包含了Tika的主要Java源代码,包括解析器、探测器和其他关键组件的实现。 2. `src/test/java`: 测试代码,用于验证Tika的功能是否正确。这些测试可以作为理解Tika如何工作的示例。 3. `...
tika-app.1.19.1.jar,轻松提取文本正文的工具。。。。
在本篇博文中,“跟益达学Solr5之使用Tika从PDF中提取数据导入索引”,我们将探讨如何利用Apache Solr 5和Tika这两个强大的开源工具,从PDF文档中抽取数据并将其有效地导入到Solr索引库中。Apache Solr是一款功能...