`
gushuizerotoone
  • 浏览: 175055 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

uima初接触

    博客分类:
  • UIMA
阅读更多
http://uima.apache.org/downloads/releaseDocs/2.2.2-incubating/docs/html/overview_and_setup/overview_and_setup.html#ugr.ovv.eclipse_setup
1.2 How to use the documentation
artifact: 文本、音频、视频、图片等
Analysis Engines(AEs):对artifact进行分析

Analysis results:是经过AEs处理得到的结果,是可meta-data of original artifact。根据你想分析的内容得到一系列的statements比如"Bush" denotes a person. the topic of the document is "Bush and Golf".

Type: pre-defined term. 比如person, the topic of the document.是你想要AE分析得到的结果

Annotation Type: begin and end. positions in document.

tightly-coupled: running in the same process
loosely-coupled: running in separate processes or even on different machines

step 2:
不同的component analytics解决analysis task的不同部分,比如一个analysis persion name。一个analysis persion relationship。这些component analytics要容易组装。

Annotators: 是AE的核心。分析的工作就是由Annotators完成的,在Annotators里,developers自定义自己做什么分析。

CAS(common analysis Results): representing analysis results.

Component Descriptors(CD):用XML表示,contains metadata describing the component, its identity, structure and behavior。

delegate analysis engines:The internal AEs specified in an aggregate are also called the delegate analysis engines.

Analysis Engine Assembler:We refer to the development role associated with building an aggregate from delegate AEs as the Analysis Engine Assembler .

Collection Reader: its job is to connect to and iterate through a source collection, acquiring documents and initializing CASes for analysis.

CAS Consumers: Their job is to do the final CAS processing. A CAS Consumer may be implemented, for example, to index CAS contents in a search engine

Collection Processing Engine (CPE): is an aggregate component that specifies a “source to sink” flow from a Collection Reader though a set of analysis engines and then to a set of CAS Consumers.

CPE Descriptors: CPEs are specified by XML files called CPE Descriptors.
2.UIMA是链接unstructred information 和 structred information的桥梁。它的作用就是对unstructred information的内容进行分析,得到structred information。

3. For each person found in the body of a document, the AE would create a Person object in the CAS and link it to the span of text where the person was mentioned in the document.

4. UIMA Context: can ensure that different annotators working together in an aggregate flow may share the same instance of an external file,

5.A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs, however, may be defined to contain other AEs organized in a workflow.

6.Users of this AE need not know how it is constructed internally but only need its name and its published input requirements and output types. These must be declared in the aggregate AE's descriptor. Aggregate AE's descriptors declare the components they contain and a flow specification. The flow specification defines the order in which the internal component AEs should be run.

7.We refer to the development role associated with building an aggregate from delegate AEs as the Analysis Engine Assembler .

8.The UIMA framework implementation has two built-in flow implementations: one that support a linear flow between components, and one with conditional branching based on the language of the document. It also supports user-provided flow controllers.

9.The application then decides what to do with the returned CAS. There are many possibilities. For instance the application could: display the results, store the CAS to disk for post processing, extract and index analysis results as part of a search or database application etc.

10.An Analysis Engine (AE) may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE).

Annotators produce their analysis results in the form of typed Feature Structures, which are simply data structures that have a type and a set of (attribute, value) pairs.

All feature structures, including annotations, are represented in the UIMA Common Analysis Structure(CAS).

native Java interface to the CAS called the JCas. The JCas represents each feature structure as a Java object

Keep in mind that the CAS can represent arbitrary types of feature structures, and feature structures can refer to other feature structures.

UIMA defines basic primitive types such as Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive types. UIMA also defines the built-in types TOP, which is the root of the type system, analogous to Object in Java; FSArray, which is an array of Feature Structures (i.e. an array of instances of TOP); and Annotation

The built-in Annotation type declares three fields (called Features in CAS terminology). The features begin and end store the character offsets of the span of text to which the annotation refers. The feature sofa (Subject of Analysis) indicates which document the begin and end offsets point into.

Annotator implementations all implement a standard interface (AnalysisComponent), having several methods, the most important of which are: 1) initialize, 2) process, 3) destroy.

initialize is called by the framework once when it first creates an instance of the annotator class. process is called once per item being processed. destroy may be called by the application when it is done using your annotator. There is a default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which has implementations of all required methods except for the process method.

we call annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to iterate over the annotations.

If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators

On the Capabilities page, we define our annotator's inputs and outputs, in terms of the types in the type system.

Although capabilities come in sets, having multiple sets is deprecated; here we're just using one set. The RoomNumberAnnotator is very simple. It requires no input types, as it operates directly on the document text -- which is supplied as a part of the CAS initialization

UIMA allows annotators to declare configuration parameters in their descriptors. The descriptor also specifies default values for the parameters, though these can be overridden at runtime.

initialize method is a good place to read configuration parameter values.

The UIMA framework ensures that an Annotator instance is called by only one thread at a time. An instance never has to worry about running some method on one thread, and then asynchronously being called using another thread. When multiple threading is wanted, for performance, multiple instances of the Annotator are created, each one running on just one thread.

分享到:
评论

相关推荐

    ( IBM推出UIMA

    【IBM推出UIMA】这个标题提到的是IBM引入了一项名为UIMA(Unstructured Information Management Architecture,非结构化信息管理架构)的技术。UIMA是IBM开发的一个开源框架,主要用于分析、理解和提取非结构化数据...

    UIMA自带资源的介绍

    标题中的“UIMA自带资源的介绍”指的是一种名为Unstructured Information Management Architecture(UIMA)的框架,它是由Apache软件基金会开发的,主要用于处理非结构化的信息,如文本、语音等。UIMA的设计目的是...

    apache-solr-uima-3.4.0.jar

    jar包,亲测可用

    基于java的开发源码-UIMA注解类 uimaFIT.zip

    Java UIMA(Unstructured Information Management Architecture)框架是一个用于分析大量非结构化信息的开源工具集。UIMA提供了处理文本、语音和其他数据类型的组件,这些组件可以进行语义分析、信息提取、关系抽取...

    Java的UIMA注解类 uimaFIT.7z

    Java的UIMA(Unstructured Information Management Architecture)是一个开源框架,由Apache基金会开发,主要用于处理非结构化信息,如文本、语音等。UIMA提供了一种标准化的方法来分析、标记和检索此类信息。在这个...

    Java的UIMA注解类 uimaFIT

    **Java的UIMA注解类 uimaFIT** UIMA(Unstructured Information Management Architecture)是由Apache软件基金会开发的一个框架,主要用于处理非结构化信息,如文本、语音和图像数据。它提供了一种标准的方式来分析...

    基于Java的实例源码-的UIMA注解类 uimaFIT.zip

    Java的UIMA(Unstructured Information Management Architecture)框架是Apache软件基金会开发的一个开源项目,主要用于处理非结构化的信息,如文本、语音等。UIMA提供了一种标准的方式来分析、标记和检索此类信息,...

    java源码:Java的UIMA注解类 uimaFIT.zip

    Java的UIMA(Unstructured Information Management Architecture)框架是Apache开发的一个用于分析大量非结构化信息的开源工具。它提供了一种标准的方式来处理、管理和理解文本、图像等非结构化数据。UIMA允许开发者...

    基于java的的UIMA注解类 uimaFIT.zip

    Java的UIMA(Unstructured Information Management Architecture)框架是Apache软件基金会开发的一个开源项目,主要用于处理非结构化的信息,如文本、语音等。UIMA提供了一种标准的方式来分析、标记和检索此类信息,...

    apache-uima-fit

    Unstructured Information Management applications are software systems that analyze large volumes ...官网在http://uima.apache.org/。 网络环境不方便到官网下载的话可以下载本资源。 有兴趣的同学可以学习一下。

    基于Java的的UIMA注解类 uimaFIT.zip

    Java的UIMA(Unstructured Information Management Architecture)框架是Apache软件基金会开发的一个开源项目,主要用于处理非结构化的信息,如文本、语音和图像等。UIMA提供了处理这些数据的工具和服务,包括分析...

    JAVA源码Java的UIMA注解类uimaFIT

    JAVA源码Java的UIMA注解类uimaFIT

    java资源Java的UIMA注解类uimaFIT

    java资源Java的UIMA注解类 uimaFIT提取方式是百度网盘分享地址

    使用UIMA和DB2IntelligentMiner进行文本挖掘.pdf

    使用UIMA和DB2IntelligentMiner进行文本挖掘.pdf

    textimager-uima:用于自然语言处理的软件组件,基于Apache UIMA框架和DKPro

    textimager-uima 基于Apache UIMA框架和DKPro的自然语言处理软件组件。引用Wahed Hemati,Tolga Uslu,Alexander Mehler:TextImager:一个用于NLP的基于UIMA的分布式系统。 COLING(Demos)2016:59-63

    dkpro-core:基于Apache UIMA框架的自然语言处理(NLP)的软件组件集合

    《DKPro Core:Apache UIMA框架下的自然语言处理组件库》 DKPro Core,全称为Deutsche Konrad-Probe Labor für Sprachtechnologie Core,是一款基于Apache UIMA(Unstructured Information Management ...

    uima-opinion:UIMA意见分析组件

    **UIMA意见分析组件** UIMA(Unstructured Information Management Architecture,非结构化信息管理架构)是Apache软件基金会开发的一个开源框架,主要用于处理和分析非结构化的数据,如文本、音频和图像。它提供了...

    apache-uima

    Unstructured Information Management applications are software systems that analyze large volumes ...官网在http://uima.apache.org/。 网络环境不方便到官网下载的话可以下载本资源。 有兴趣的同学可以学习一下。

    当时oajava源码-Minimalist_UIMA:面向BioNLP的UIMA库

    UIMA 支持库。 这个项目包括一些基本的、简单的、uima 管道和分析引擎,这些引擎来自 UIMA 和 uimaFIT 教程,以及过去的 CCP UIMA 工作。 为了简化和提高可访问性,这项工作与过去在 CCP 的 UIMA 工作不同,因为它不...

Global site tag (gtag.js) - Google Analytics