Building a Large Annotated Corpus of
English: The Penn Treebank
Mitchell P Marcus*
University of Pennsylvania
Beatrice Santorinit
Northwestern University
Mary Ann Marcinkiewiczt
University of Pennsylvania
1. Introduction
There is a growing consensus that significant, rapid progress can be made in both text
understanding and spoken language understanding by investigating those phenom-
ena that occur most centrally in naturally occurring unconstrained materials and by
attempting to automatically extract information about language from very large cor-
pora. Such corpora are beginning to serve as important research tools for investigators
in natural language processing, speech recognition, and integrated spoken language
systems, as well as in theoretical linguistics. Annotated corpora promise to be valu-
able for enterprises as diverse as the automatic construction of statistical models for
the grammar of the written and the colloquial spoken language, the development of
explicit formal theories of the differing grammars of writing and speech, the investi-
gation of prosodic phenomena in speech, and the evaluation and comparison of the
adequacy of parsing models.
In this paper, we review our experience with constructing one such large annotated
corpus-the Penn Treebank, a corpus' consisting of over 4.5 million words of American
English. During the first three-year phase of the Penn Treebank Project (1989-1992), this
corpus has been annotated for part-of-speech (POS) information. In addition, over half
of it has been annotated for skeletal syntactic structure. These materials are available
to members of the Linguistic Data Consortium; for details, see Section 5.1.
The paper is organized as follows. Section 2 discusses the POS tagging task. After
outlining the considerations that informed the design of our POS tagset and pre-
senting the tagset itself, we describe our two-stage tagging process, in which text
is first assigned POS tags automatically and then corrected by human annotators.
Section 3 briefly presents the results of a comparison between entirely manual and
semi-automated tagging, with the latter being shown to be superior on three counts:
speed, consistency, and accuracy. In Section 4, we turn to the bracketing task. Just as
with the tagging task, we have partially automated the bracketing task: the output of
1 A distinction is sometimes made between a corpus as a carefully struct
together to jointly meet some design principles, and a collection, which
opportunistic in construction. We acknowledge that from this point of
Penn Treebank form a collection.
materials gathered
may be much
view, the raw
more
materials of the
。1993 Association for Computational Linguistics
Computational Linguistics
the POS tagging phase is automatically parsed and simplified to yield a skeletal syn-
tactic representation, which is then corrected by human annotators. After presenting
the set of syntactic tags that we use, we illustrate and discuss the bracketing process. In
particular, we will outline various factors that affect the speed with which annotators
are able to correct bracketed structures, a task that-not surprisingly-is considerably
more difficult than correcting POS-tagged text. Finally, Section 5 describes the com-
position and size of the current Treebank corpus, briefly reviews some of the research
projects that have relied on it to date, and indicates the directions that the project is
likely to take in the future.
2. Part-of-Speech Tagging
2.1 A Simplified POS Tagset for English
The POS tagsets used to annotate large corpora in the past have traditionally been
fairly extensive. The pioneering Brown Corpus distinguishes 87 simple tags (Francis
1964; Francis and Kucera 1982) and allows the formation of compound tags; thus, the
contraction I'm is tagged as PPSS+BEM (PPSS for "non-third person nominative per-
sonal pronoun" and BEM for "am, 'm".2 Subsequent projects have tended to elaborate
the Brown Corpus tagset. For instance, the Lancaster-Oslo/Bergen (LOB) Corpus uses
about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Cor-
pus of Spoken English 197 tags. The rationale behind developing such large, richly
articulated tagsets is to approach "the ideal of providing distinct codings for all classes
of words having distinct grammatical behaviour" (Garside, Leech, and Sampson 1987,
p. 167).
2.1.1 Recoverability. Like the tagsets just mentioned, the Penn Treebank tagset is based
on that of the Brown Corpus. However, the stochastic orientation of the Perm Tree-
bank and the resulting concern with sparse data led us to modify the Brown Corpus
tagset by paring it down considerably. A key strategy in reducing the tagset was to
eliminate redundancy by taking into account both lexical and syntactic information.
Thus, whereas many POS tags in the Brown Corpus tagset are unique to a particular
lexical item, the Penn Treebank tagset strives to eliminate such instances of lexical re-
dundancy. For instance, the Brown Corpus distinguishes five different forms for main
verbs: the base form is tagged VB, and forms with overt endings are indicated by
appending D for past tense, G for present participle/ gerund, N for past participle,
and Z for third person singular present. Exactly the same paradigm is recognized for
have, but have (regardless of whether it is used as an auxiliary or a main verb) is as-
signed its own base tag HV. The Brown Corpus further distinguishes three forms of
do-the base form (DO), the past tense (DOD), and the third person singular present
(DOZ) 4 and eight forms of be-the five forms distinguished for regular verbs as well
as the irregular forms am (BEM), are (BER), and was (BEDZ). By contrast, since the
distinctions between the forms of VB on the one hand and the forms of BE, DO, and
HV on the other are lexically recoverable, they are eliminated in the Penn Treebank,
as shown in Table 1.5
2 Countin
both
simple and
c and tags
the Brown
Corpus to
3Au
可
erview
of the re
of these and other tagsets to
contains 187 tags.
ther and to the Brown Corpus
giv
p :!ndix B of Garside, Leech, anc
n (1987).
n ind
articiple of do are tagged VBG
N in the Brown Corpus,
respectively-presumably because
The irregular present tense forms
are are tagged as
are never used as aux
verbs in American English.
the Penn Treebank (see
Section 2.1.3), just like any other non-third person singular present tense form
314
分享到:
相关推荐
积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。
《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...
PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...
PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...
来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...
corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用
PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...
宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...
1. **Penn Treebank 格式**: PTB 格式是由宾夕法尼亚大学创建的一个标准语料库,主要包含经过人工标注的句法树结构。每个句子由一系列单词组成,每个单词附有词性标签,而整个句子则以树形结构表示其句法结构。...
从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...
formatted_task1167_penn_treebank_coarse_pos_tagging.json
带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库 概述 (SwDA) 扩展了带有轮次/话语级别的对话行为标签。 标签总结了有关相关转向的句法、语义和语用信息。 SwDA 项目于 1990 年代后期在加州大学博尔德...
在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...
HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...
训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...
PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...
NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章