`
turingfellow
  • 浏览: 135089 次
  • 性别: Icon_minigender_1
  • 来自: 福建省莆田市
社区版块
存档分类
最新评论

penn tree bank 6/6

    博客分类:
  • jade
阅读更多
11 This use of
12 Contact the
-atta _hment is identical to its original use in Church's parser (Church 1980).
])ata Consortium, 441 Williams Hall, University of Pennsylvania, Philadelphia
PA 19104-605
e-mail to ldc@unagi.cis.upenn.edu for more information.
326Mitchell P. Marcus et al.
Building a Large Annotated Corpus of English
Table 4
Penn Treebank (as of 11/92).
Description
  Tagged for
Part-of-Speech
    (Tokens)
Dept. of Energy abstracts
Dow Jones Newswire stories
Dept. of Agriculture bulletins
Library of America texts
MUC-3 messages
IBM Manual sentences
WBUR radio transcripts
ATIS sentences
Brown Corpus, retagged
  231,404
3,065,776
  78,555
  105,652
  111,828
  89,121
    11,589
  19,832
1,172,041
  231,404
1,061,166
  78,555
  105,652
  111,828
  89,121
  11,589
  19,832
1,172,041
Total:
4,885,798
2,881,188
Some comments on the materials included:
    Department of Energy abstracts are scientific abstracts from a variety of
    disciplines.
    All of the skeletally parsed Dow Jones Newswire materials are also
    available as digitally recorded read speech as part of the DARPA
    WSJ-CSRl corpus, available through the Linguistic Data Consortium.
    The Department of Agriculture materials include short bulletins on such
    topics as when to plant various flowers and how to can various
    vegetables and fruits.
    The Library of America texts are 5,000-10,000 word passages, mainly
    book chapters, from a variety of American authors including Mark
    Twain, Henry Adams, Willa Cather, Herman Melville, W. E. B. Dubois,
    and Ralph Waldo Emerson.
.The MUC-3 texts are all news stories from the Federal News Service
    about terrorist activities in South America. Some of these texts are
    translations of Spanish news stories or transcripts of radio broadcasts.
    They are taken from training materials for the Third Message
    Understanding Conference.
    The Brown Corpus materials were completely retagged by the Penn
    Treebank project starting from the untagged version of the Brown
    Corpus (Francis 1964).
    The IBM sentences are taken from IBM computer manuals; they are
    chosen to contain a vocabulary of 3,000 words, and are limited in length.
    The ATIS sentences are transcribed versions of spontaneous sentences
    collected as training materials for the DARPA Air Travel Information
    System project.
The entire corpus has been tagged for POS information, at an estimated error rate
327Computational Linguistics
Volume 19, Number 2
of approximately 3%. The POS-tagged version of the Library of America texts and the
Department of Agriculture bulletins have been corrected twice (each by a different
annotator), -and the corrected files were then carefully adjudicated; we estimate the
error rate of the adjudicated version at well under 1%. Using a version of PARTS
retrained on the entire preliminary corpus and adjudicating between the output of the
retrained version and the preliminary version of the corpus, we plan to reduce the
error rate of the final version of the corpus to approximately 1%. All the skeletally
parsed materials have been corrected once, except for the Brown materials, which have
been quickly proofread an additional time for gross parsing errors.
5.2 Future Directions
A large number of research efforts, both at the University of Pennsylvania and else-
where, have relied on the output of the Penn Treebank Project to date. A few examples
already in print: a number of projects investigating stochastic parsing have used either
the POS-tagged materials (Magerman and Marcus 1990; Brill et al. 1990; Brill 1991) or
the skeletally parsed corpus (Weischedel et al. 1991; Pereira and Schabes 1992). The
POS-tagged corpus has also been used to train a number of different POS taggers in-
cluding Meteer, Schwartz, and Weischedel (1991), and the skeletally parsed corpus has
been used in connection with the development of new methods to exploit intonational
cues in disambiguating the parsing of spoken sentences (Veilleux and Ostendorf 1992).
The Penn Treebank has been used to bootstrap the development of lexicons for particu-
lar applications (Robert Ingria, personal communication) and is being used as a source
of examples for linguistic theory and psychological modelling (e.g. Niv 1991). To aid
in the search for specific examples of grammatical phenomena using the Treebank,
Richard Pito has developed tgrep, a tool for very fast context-free pattern matching
against the skeletally parsed corpus, which is available through the Linguistic Data
Consortium.
    While the Treebank is being widely used, the annotation scheme employed has a
variety of limitations. Many otherwise clear argument/ adjunct relations in the corpus
are not indicated because of the current Treebank's essentially context-free represen-
tation. For example, there is at present no satisfactory representation for sentences in
which complement noun phrases or clauses occur after a sentential level adverb. Either
the adverb is trapped within the VP, so that the complement can occur within the VP
where it belongs, or else the adverb is attached to the S, closing off the VP and forcing
the complement to attach to the S. This "trapping" problem serves as a limitation for
groups that currently use Treebank material semiautomatically to derive lexicons for
particular applications. For most of these problems, however, solutions are possible
on the basis of mechanisms already used by the Treebank Project. For example, the
pseudo-attachment notation can be extended to indicate a variety of crossing depen-
dencies. We have recently begun to use this mechanism to represent various kinds
of dislocations, and the Treebank annotators themselves have developed a detailed
proposal to extend pseudo-attachment to a wide range of similar phenomena.
    A variety of inconsistencies in the annotation scheme used within the Treebank
have also become apparent with time. The annotation schemes for some syntactic
categories should be unified to allow a consistent approach to determining predicate-
argument structure. To take a very simple example, sentential adverbs attach under
VP when they occur between auxiliaries and predicative ADJPs, but attach under S
when they occur between auxiliaries and VPs. These structures need to be regularized.
    As the current Treebank has been exploited by a variety of users, a significant
number have expressed a need for forms of annotation richer than provided by the
project's first phase. Some users would like a less skeletal form of annotation of surface
328Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
grammatical structure, expanding the essentially context-free analysis of the current
Penn Treebank to indicate a wide variety of noncontiguous structures and dependen-
cies. A wide range of Treebank users now strongly desire a level of annotation that
makes explicit some form of predicate-argument structure. The desired level of rep-
resentation would make explicit the logical subject and logical object of the verb, and
would indicate, at least in clear cases, which subconstituents serve as arguments of
the underlying predicates and which serve as modifiers.
    During the next phase of the Treebank project, we expect to provide both a richer
analysis of the existing corpus and a parallel corpus of predicate-argument structures.
This will be done by first enriching the annotation of the current corpus, and then
automatically extracting predicate-argument structure, at the level of distinguishing
logical subjects and objects, and distinguishing arguments from adjuncts for clear
cases. Enrichment will be achieved by automatically transforming the current Penn
Treebank into a level of structure close to the intended target, and then completing
the conversion by hand.
Acknowledgments
The work reported here was partially
supported by DARPA grant
No. N0014-85-K0018, by DARPA and
AFOSR jointly under grant
No. AFOSR-90-0066 and by ARO grant
No. DAAL 03-89-C0031 PRI. Seed money
was provided by the General Electric
Corporation under grant No. J01746000. We
gratefully acknowledge this support. We
would also like to acknowledge the
contribution of the annotators who have
worked on the Penn Treebank Project:
Florence Dong, Leslie Dossey, Mark
Ferguson, Lisa Frank, Elizabeth Hamilton,
Alissa Hinckley Chris Hudson, Karen Katz,
Grace Kim, Robert Maclntyre, Mark Parisi,
Britta Schasberger, Victoria Tredinnick and
Matt Waters; in addition, Rob Foye, David
Magerman, Richard Pito and Steven Shapiro
deserve our special thanks for their
administrative and programming support.
We are grateful to AT&T Bell Labs for
permission to use Kenneth Church's PARTS
part-of-speech labeler and Donald Hindle's
Fidditch parser. Finally, we would like to
thank Sue Marcus for sharing with us her
statistical expertise and providing the
analysis of the time data of the experiment
reported in Section 3. The design of that
experiment is due to the first two authors;
they alone are responsible for its
shortcomings.
分享到:
评论

相关推荐

    Penn Tree Bank (PTB)数据集

    积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。

    Penn Treebank

    《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...

    PTB(Penn Treebank Dataset)文本数据集

    PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...

    PTB(Penn Tree Bank)小型语料库

    PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...

    Penn Tree Bank(PTB文本数据集)

    来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...

    penn_treebank_tagset.xlsx

    corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    huggface下载的penn-treebank数据集

    可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用

    PTB文本数据集.zip

    PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...

    宾州中文树库分词指导手册《The Segmentation Guidelines for the Penn Chinese TreeBank(3.0)》

    宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...

    ptb-reader-rust:合并的Penn Treebank格式的简单解析

    《ptb-reader-rust:合并的Penn Treebank格式的简单解析》 在自然语言处理(NLP)领域,数据是模型训练的基础。其中,Penn Treebank(PTB)是一个广泛使用的英文语料库,它包含了丰富的句法结构信息,对于句法分析...

    PennToPCFG:从 Penn Treebank 学习未词法化的 PCFG

    从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    swda:带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库

    带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库 概述 (SwDA) 扩展了带有轮次/话语级别的对话行为标签。 标签总结了有关相关转向的句法、语义和语用信息。 SwDA 项目于 1990 年代后期在加州大学博尔德...

    maxent_treebank_pos_tagger.zip_english_pos

    在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...

    HPSG-Neural-Parser:“在Penn Treebank上的头驱动短语结构语法解析”的源代码在ACL 2019上发布

    HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...

    cky_parser:NLP 项目,使用 CKY 算法在 Java 中实现的部分或语音解析器。 训练数据和输出解析树都是 Penn Tree Bank 格式

    训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...

    PTB文本数据集

    PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...

    LSTM神经网络训练的PTB语料

    NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章

Global site tag (gtag.js) - Google Analytics