`
turingfellow
  • 浏览: 135092 次
  • 性别: Icon_minigender_1
  • 来自: 福建省莆田市
社区版块
存档分类
最新评论

penn tree bank 5/n

    博客分类:
  • jade
阅读更多
always errs on the side of caution. Since determining the correct attachment point of
prepositional phrases, relative clauses, and adverbial modifiers almost always requires
extrasyntactic information, Fidditch pursues the very conservative strategy of always
leaving such constituents unattached, even if only one attachment point is syntacti-
cally possible. However, Fidditch does indicate its best guess concerning a fragment's
attachment site by the fragment's depth of embedding. Moreover, it attaches preposi-
tional phrases beginning with of if the preposition immediately follows a noun; thus,
tale of…and boatload of…are parsed as single constituents, while first of…is not.
Since Fidditch lacks a large verb lexicon, it cannot decide whether some constituents
serve as adjuncts or arguments and hence leaves subordinate clauses such as infini-
322Mitchell P. Marcus et al.
Building a Large Annotated Corpus of English
tives as separate fragments. Note further that Fidditch creates adjective phrases only
when it determines that more than one lexical item belongs in the ADJP Finally, as
is well known, the scope of conjunctions and other coordinate structures can only
be determined given the richest forms of contextual information; here again, Fidditch
simply turns out a string of tree fragments around any conjunction. Because all de-
cisions within Fidditch are made locally, all commas (which often signal conjunction)
must disrupt the input into separate chunks
    The original design of the Treebank called for a level of syntactic analysis compa-
rable to the skeletal analysis used by the Lancaster Treebank, but a limited experiment
was performed early in the project to investigate the feasibility of providing greater
levels of structural detail. While the results were somewhat unclear, there was ev-
idence that annotators could maintain a much faster rate of hand correction if the
parser output was simplified in various ways, reducing the visual complexity of the
tree representations and eliminating a range of minor decisions. The key results of this
experiment were as follows:
Annotators take substantially longer to learn the bracketing task than the
POS tagging task, with substantial increases in speed occurring even
after two months of training.
Annotators can correct the full structure provided by Fidditch at an
average speed of approximately 375 words per hour after three weeks
and 475 words per hour after six weeks.
Reducing the output from the full structure shown in Figure 3 to a more
skeletal representation similar to that used by the Lancaster UCREL
Treebank Project increases annotator productivity by approximately
100-200 words per hour.
It proved to be very difficult for annotators to distinguish between a
verb's arguments and adjuncts in all cases. Allowing annotators to
ignore this distinction when it is unclear (attaching constituents high)
increases productivity by approximately 150-200 words per hour.
Informal examination of later annotation showed that forced distinctions
cannot be made consistently.
    As a result of this experiment, the originally proposed skeletal representation was
adopted, without a forced distinction between arguments and adjuncts. Even after
extended training, performance varies markedly by annotator, with speeds on the task
of correcting skeletal structure without requiring a distinction between arguments and
adjuncts ranging from approximately 750 words per hour to well over 1,000 words
per hour after three or four months' experience. The fastest annotators work in bursts
of well over 1,500 words per hour alternating with brief rests. At an average rate
of 750 words per hour, a team of five part-time annotators annotating three hours a
day should maintain an output of about 2.5 million words a year of "treebanked"
sentences, with each sentence corrected once.
    It is worth noting that experienced annotators can proofread previously corrected
material at very high speeds. A parsed subcorpus of over 1 million words was recently
proofread at an average speed of approximately 4,000 words per annotator per hour.
At this rate of productivity, annotators are able to find and correct gross errors in
parsing, but do not have time to check, for example, whether they agree with all
prepositional phrase attachments.
323Computational Linguistics
Volume 19, Number 2
((S
(NP (ADJP Battle-tested industrial)
      managers)
(?here)
(?always)
(VP buck))
(?(PP up
          (NP nervous newcomers)))
(?(PP with
        (NP the tale
              (PP of
                  (NP the
                      (ADJP first))))))
(?(PP of
        (NP their countrymen)))
(?(S (NP*)
            to
        (UP visit
            (NP Mexico))))
(?,)
(?(NP a boatload
          (PP of
              (NP warriors))
        (VP blown
              (?ashore)
            (NP 375 years))))
(?ago)
(?.))
Figure 4
Sample bracketed text-after simplification, before correction.
    The process that creates the skeletal representations to be corrected by the anno-
tators simplifies and flattens the structures shown in Figure 3 by removing POS tags,
nonbranching lexical nodes, and certain phrasal nodes, notably NBAR. The output of
the first automated stage of the bracketing task is shown in Figure 4.
    Annotators correct this simplified structure using a mouse-based interface. Their
primary job is to "glue" fragments together, but they must also correct incorrect parses
and delete some structure. Single mouse clicks perform the following tasks, among
others. The interface correctly reindents the structure whenever necessary.
Attach constituents labeled ?. This is done by pressing down the
appropriate mouse button on or immediately after the ?, moving the
mouse onto or immediately after the label of the intended parent and
releasing the mouse. Attaching constituents automatically deletes their?
label.
Promote a constituent up one level of structure, making it a sibling of its
current parent.
Delete a pair of constituent brackets.
324Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
((S
    (NP Battle-tested industrial managers
        here)
    always
    (VP buck
          up
          (NP nervous newcomers)
          (PP with
              (NP the tale
                    (PP of
                        (NP (NP the
                                    (ADJP first
                                            (PP of
                                        (NP their countrymen)))
                                  (S (NP*)
                                                      to
                                      (VP visit
                                          (NP Mexico))))
(NP (NP a boatload
                                        (PP of
                                      (NP (NP warriors)
                                              (VP-1 blown
                                                                      ashore
                                                (ADVP (NP 375 years)
                                                          ago)))))
                            (VP-1 *pseudo-attach*))))))))
  .)
Figure 5
Sample bracketed text-after correction.
    Create a pair of brackets around a constituent. This is done by typing a
    constituent tag and then sweeping out the intended constituent with the
    mouse. The tag is checked to assure that it is a legal label.
    Change the label of a constituent. The new tag is checked to assure that
    it is legal.
    The bracketed text after correction is shown in Figure 5. The fragments are now
connected together into one rooted tree structure. The result is a skeletal analysis in
that much syntactic detail is left unannotated. Most prominently, all internal structure
of the NP up through the head and including any single-word post-head modifiers is
left unannotated.
    As noted above in connection with POS tagging, a major goal of the Treebank
project is to allow annotators only to indicate structure of which they were certain. The
Treebank provides two notational devices to ensure this goal: the X constituent label
and so-called "pseudo-attachment." The X constituent label is used if an annotator
is sure that a sequence of words is a major constituent but is unsure of its syntactic
category; in such cases, the annotator simply brackets the sequence and labels it X. The
second notational device, pseudo-attachment, has two primary uses. On the one hand,
it is used to annotate what Kay has called permanent predictable ambiguities, allowing an
annotator to indicate that a structure is globally ambiguous even given the surrounding
context (annotators always assign structure to a sentence on the basis of its context). An
example of this use of pseudo-attachment is shown in Figure 5, where the participial
phrase blown ashore 375 years ago modifies either warriors or boatload, but there is no way
of settling the question-both attachments mean exactly the same thing. In the case
at hand, the pseudo-attachment notation indicates that the annotator of the sentence
thought that VP-1 is most likely a modifier of warriors, but that it is also possible that
it is a modifier of boatload." A second use of pseudo-attachment is to allow annotators
to represent the "underlying" position of extraposed elements; in addition to being
attached in its superficial position in the tree, the extraposed constituent is pseudo-
attached within the constituent to which it is semantically related. Note that except
for the device of pseudo-attachment, the skeletal analysis of the Treebank is entirely
restricted to simple context-free trees.
    The reader may have noticed that the ADJP brackets in Figure 4 have vanished in
Figure 5. For the sake of the overall efficiency of the annotation task, we leave all ADJP
brackets in the simplified structure, with the annotators expected to remove many
of them during annotation. The reason for this is somewhat complex, but provides
a good example of the considerations that come into play in designing the details
of annotation methods. The first relevant fact is that Fidditch only outputs ADJP
brackets within NPs for adjective phrases containing more than one lexical item. To
be consistent, the final structure must contain ADJP nodes for all adjective phrases
within NPs or for none; we have chosen to delete all such nodes within NPs under
normal circumstances. (This does not affect the use of the ADJP tag for predicative
adjective phrases outside of NPs.) In a seemingly unrelated guideline, all coordinate
structures are annotated in the Treebank; such coordinate structures are represented
by Chomsky-adjunction when the two conjoined constituents bear the same label.
This means that if an NP contains coordinated adjective phrases, then an ADJP tag
will be used to tag that coordination, even though simple ADJPs within NPs will not
bear an APJP tag. Experience has shown that annotators can delete pairs of brackets
extremely quickly using the mouse-based tools, whereas creating brackets is a much
slower operation. Because the coordination of adjectives is quite common, it is more
efficient to leave in ADJP labels, and delete them if they are not part of a coordinate
structure, than to reintroduce them if necessary.
5. Progress to Date
5.1 Composition and Size of Corpus
Table 4 shows the output of the Penn Treebank project at the end of its first phase. All
the materials listed in Table 4 are available on CD-ROM to members of the Linguistic
Data Consortium." About 3 million words of POS-tagged material and a small sam-
pling of skeletally parsed text are available as part of the first Association for Com-
putational Linguistics/ Data Collection Initiative CD-ROM, and a somewhat larger
subset of materials is available on cartridge tape directly from the Penn Treebank
Project. For information, contact the first author of this paper or send e-mail to tree-
bank@unagi.cis.upenn.edu.
分享到:
评论

相关推荐

    Penn Tree Bank (PTB)数据集

    积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。

    Penn Treebank

    《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...

    PTB(Penn Treebank Dataset)文本数据集

    PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...

    PTB(Penn Tree Bank)小型语料库

    PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...

    Penn Tree Bank(PTB文本数据集)

    来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...

    penn_treebank_tagset.xlsx

    corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    huggface下载的penn-treebank数据集

    可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用

    PTB文本数据集.zip

    PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...

    宾州中文树库分词指导手册《The Segmentation Guidelines for the Penn Chinese TreeBank(3.0)》

    宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...

    ptb-reader-rust:合并的Penn Treebank格式的简单解析

    《ptb-reader-rust:合并的Penn Treebank格式的简单解析》 在自然语言处理(NLP)领域,数据是模型训练的基础。其中,Penn Treebank(PTB)是一个广泛使用的英文语料库,它包含了丰富的句法结构信息,对于句法分析...

    PennToPCFG:从 Penn Treebank 学习未词法化的 PCFG

    从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    swda:带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库

    带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库 概述 (SwDA) 扩展了带有轮次/话语级别的对话行为标签。 标签总结了有关相关转向的句法、语义和语用信息。 SwDA 项目于 1990 年代后期在加州大学博尔德...

    maxent_treebank_pos_tagger.zip_english_pos

    在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...

    HPSG-Neural-Parser:“在Penn Treebank上的头驱动短语结构语法解析”的源代码在ACL 2019上发布

    HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...

    cky_parser:NLP 项目,使用 CKY 算法在 Java 中实现的部分或语音解析器。 训练数据和输出解析树都是 Penn Tree Bank 格式

    训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...

    PTB文本数据集

    PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...

    LSTM神经网络训练的PTB语料

    NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章

Global site tag (gtag.js) - Google Analytics