`
turingfellow
  • 浏览: 135094 次
  • 性别: Icon_minigender_1
  • 来自: 福建省莆田市
社区版块
存档分类
最新评论

penn tree bank 4/n

    博客分类:
  • jade
阅读更多
4. Bracketing
4.1 Basic Methodology
The methodology for bracketing the corpus is completely parallel to that for tagging-
hand correction of the output of an errorful automatic process. Fidditch, a deterministic
parser developed by Donald Hindle first at the University of Pennsylvania and sub-
sequently at AT&T Bell Labs (Hindle 1983, 1989), is used to provide an initial parse of
the material. Annotators then hand correct the parser's output using a mouse-based
interface implemented in GNU Emacs Lisp. Fidditch has three properties that make it
ideally suited to serve as a preprocessor to hand correction:
    Fidditch always provides exactly one analysis for any given sentence, so
    that annotators need not search through multiple analyses.
    Fidditch never attaches any constituent whose role in the larger structure
    it cannot determine with certainty. In cases of uncertainty, Fidditch
    chunks the input into a string of trees, providing only a partial structure
    for each sentence.
    Fidditch has rather good grammatical coverage, so that the grammatical
    chunks that it does build are usually quite accurate.
    Because of these properties, annotators do not need to rebracket much of the
parser's output-a relatively time-consuming task. Rather, the annotators' main task
is to "glue" together the syntactic chunks produced by the parser. Using a mouse-based
interface, annotators move each unattached chunk of structure under the node to which
it should be attached. Notational devices allow annotators to indicate uncertainty
concerning constituent labels, and to indicate multiple attachment sites for ambiguous
modifiers. The bracketing process is described in more detail in Section 4.3.
4.2 The Syntactic Tagset
Table 3 shows the set of syntactic tags and null elements that we use in our skeletal
bracketing. More detailed information on the syntactic tagset and guidelines concern-
ing its use are to be found in Santorini and Marcinkiewicz (1991).
    Although different in detail, our tagset is similar in delicacy to that used by the
Lancaster Treebank Project, except that we allow null elements in the syntactic anno-
tation. Because of the need to achieve a fairly high output per hour, it was decided
not to require annotators to create distinctions beyond those provided by the parser.
Our approach to developing the syntactic tagset was highly pragmatic and strongly
influenced by the need to create a large body of annotated material given limited hu-
man resources. Despite the skeletal nature of the bracketing, however, it is possible to
make quite delicate distinctions when using the corpus by searching for combinations
of structures. For example, an SBAR containing the word to immediately before the
VP will necessarily be infinitival, while an SBAR containing a verb or auxiliary with a
320Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
Table 3
The Penn Treebank syntactic tagset.
ADJP
ADVP
NP
PP
S
SBAR
SBARQ
SINV
SQ
VP
WHADVP
WHNP
WHPP
X
Adjective phrase
Adverb phrase
Noun phrase
Prepositional phrase
Simple declarative clause
Clause introduced by subordinating conjunction or 0 (see below)
Direct question introduced by wh-word or wh-phrase
Declarative sentence with subject-aux inversion
Subconstituent of SBARQ excluding wh-word or wh-phrase
Verb phrase
wh-adverb phrase
wh-noun phrase
wh-prepositional phrase
Constituent of unknown or uncertain category
Null elements
"Understood" subject of infinitive or imperative
Zero variant of that in subordinate clauses
Trace-marks position where moved wh-constituent is interpreted
Marks position where preposition is interpreted in pied-piping contexts
tense feature will necessarily be tensed. To take another example, so-called that-clauses
can be identified easily by searching for SBARs containing the word that or the null
element 0 in initial position.
    As can be seen from Table 3, the syntactic tagset used by the Penn Treebank in-
cludes a variety of null elements, a subset of the null elements introduced by Fidditch.
While it would be expensive to insert null elements entirely by hand, it has not proved
overly onerous to maintain and correct those that are automatically provided. We have
chosen to retain these null elements because we believe that they can be exploited in
many cases to establish a sentence's predicate-argument structure; at least one recipient
of the parsed corpus has used it to bootstrap the development of lexicons for partic-
ular NLP projects and has found the presence of null elements to be a considerable
aid in determining verb transitivity (Robert Ingria, personal communication). While
these null elements correspond more directly to entities in some grammatical theories
than in others, it is not our intention to lean toward one or another theoretical view in
producing our corpus. Rather, since the representational framework for grammatical
structure in the Treebank is a relatively impoverished flat context-free notation, the eas-
iest mechanism to include information about predicate-argument structure, although
indirectly, is by allowing the parse tree to contain explicit null items.
4.3 Sample Bracketing Output
Below, we illustrate the bracketing process for the first sentence of our sample text.
Figure 3 shows the output of Fidditch (modified slightly to include our POS tags).
    As Figure 3 shows, Fidditch leaves very many constituents unattached, labeling
them as "?", and its output is perhaps better thought of as a string of tree fragments
than as a single tree structure. Fidditch only builds structure when this is possible for
a purely syntactic parser without access to semantic or pragmatic information, and it
321Computational Linguistics
Volume 19, Number 2
((S
      (NP (NBAR (ADJP (ADJ "Battle-tested/JJ")
                        (ADJ "industrial/JJ"))
              (NPL "managers/NNS")))
      (?(ADV "here/RB"))
    (?(ADV "always/RB"))
    (AUX (TNS*))
    (VP (VPRES "buck/VBP")))
    (?(PP (PREP "up/RP")
              (NP (NEAR (ADJ "nervous/JJ")
                        (NPL "newcomers/NNS")))))
      (?(PP (PREP "with/IN")
              (NP (DART "the/DT")
                    (NEAR (N "tale/NN"))
                            (PP of/PREP
                                (NP (DART "the/DT")
                                      (NEAR (ADJP
                                        (ADJ "first/JJ"))))))))
      (?(PP of/PREP
              (NP (PROS "their/PP$")
                (NEAR (NPL "countrymen/NNS"))))
      (?(S (NP (PRO*))
                  (AUX to/TNS)
                  (VP (V "visit/VB")
                      (NP (PNP "Mexico/NNP")))))
    (?(MID”,/,”))
      (?(NP (IART "a/DT")
              (NEAR (N "boatload/NN"))
                      (PP of/PREP
                            (NP (NBAR
                              (NPL "warriors/NNS"))))
                      (VP (VPPRT "blown/VBN")
                          (?(ADV "ashore/RB"))
                            (NP (NEAR (CARD "375/CD")
                                  (NPL "years/NNS"))))))
    (?(ADV "ago/RB"))
      (?(FIN”./.”)))
Figure 3
Sample bracketed text-full structure provided by Fidditch.
分享到:
评论

相关推荐

    Penn Tree Bank (PTB)数据集

    积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。

    Penn Treebank

    《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...

    PTB(Penn Treebank Dataset)文本数据集

    PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...

    PTB(Penn Tree Bank)小型语料库

    PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...

    Penn Tree Bank(PTB文本数据集)

    来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...

    penn_treebank_tagset.xlsx

    corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    huggface下载的penn-treebank数据集

    可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用

    PTB文本数据集.zip

    PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...

    宾州中文树库分词指导手册《The Segmentation Guidelines for the Penn Chinese TreeBank(3.0)》

    宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...

    ptb-reader-rust:合并的Penn Treebank格式的简单解析

    《ptb-reader-rust:合并的Penn Treebank格式的简单解析》 在自然语言处理(NLP)领域,数据是模型训练的基础。其中,Penn Treebank(PTB)是一个广泛使用的英文语料库,它包含了丰富的句法结构信息,对于句法分析...

    PennToPCFG:从 Penn Treebank 学习未词法化的 PCFG

    从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    swda:带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库

    带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库 概述 (SwDA) 扩展了带有轮次/话语级别的对话行为标签。 标签总结了有关相关转向的句法、语义和语用信息。 SwDA 项目于 1990 年代后期在加州大学博尔德...

    maxent_treebank_pos_tagger.zip_english_pos

    在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...

    HPSG-Neural-Parser:“在Penn Treebank上的头驱动短语结构语法解析”的源代码在ACL 2019上发布

    HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...

    cky_parser:NLP 项目,使用 CKY 算法在 Java 中实现的部分或语音解析器。 训练数据和输出解析树都是 Penn Tree Bank 格式

    训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...

    PTB文本数据集

    PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...

    LSTM神经网络训练的PTB语料

    NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章

Global site tag (gtag.js) - Google Analytics