- 浏览: 135098 次
- 性别:
- 来自: 福建省莆田市
文章分类
最新评论
-
houruiming:
tks for your info which helps m ...
setcontent和setcontentobject用的是同一片内存 -
turingfellow:
in.tftpd -l -s /home/tmp -u ro ...
commands -
turingfellow:
LINUX下的网络设置 ifconfig ,routeLINU ...
commands -
turingfellow:
安装 linux loopbackyum install um ...
commands
Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
Table 1
Elimination of lexically recoverable distinctions.
mg
ung
s
/VB
s/VBZ
do/VB
does/ VBZ
sang/ VBD
singing/ VBG
sung/ VBN
be/VB
is/VBZ
was/VBD
being/ VBG
been / VBN
did/VBD
doing/VBG
done/VBN
have/VB
has/ VBZ
had/VBD
having/VBG
had/VBN
A second example of lexical recoverability concerns those words that can precede
articles in noun phrases. The Brown Corpus assigns a separate tag to pre-qualifiers
(quite, rather, such), pre-quantifiers (all, ha犷many, nary) and both. The Penn Treebank,
on the other hand, assigns all of these words to a single category PDT (predeterminer).
Further examples of lexically recoverable categories are the Brown Corpus categories
PPL (singular reflexive pronoun) and PPLS (plural reflexive pronoun), which we col-
lapse with PRP (personal pronoun), and the Brown Corpus category RN (nominal
adverb), which we collapse with RB (adverb).
Beyond reducing lexically recoverable distinctions, we also eliminated certain POS
distinctions that are recoverable with reference to syntactic structure. For instance, the
Penn Treebank tagset does not distinguish subject pronouns from object pronouns
even in cases where the distinction is not recoverable from the pronoun's form, as
with you, since the distinction is recoverable on the basis of the pronoun's position
in the parse tree in the parsed version of the corpus. Similarly, the Penn Treebank
tagset conflates subordinating conjunctions with prepositions, tagging both categories
as IN. The distinction between the two categories is not lost, however, since subor-
dinating conjunctions can be recovered as those instances of IN that precede clauses,
whereas prepositions are those instances of IN that precede noun phrases or preposi-
tional phrases. We would like to emphasize that the lexical and syntactic recoverability
inherent in the POS-tagged version of the Penn Treebank corpus allows end users to
employ a much richer tagset than the small one described in Section 2.2 if the need
arises.
2.1.2 Consistency. As noted above, one reason for eliminating a POS tag such as RN
(nominal adverb) is its lexical recoverability. Another important reason for doing so is
consistency. For instance, in the Brown Corpus, the deictic adverbs there and now are
always tagged RB (adverb), whereas their counterparts here and then are inconsistently
tagged as RB (adverb) or RN (nominal adverb)-even in identical syntactic contexts,
such as after a preposition. It is clear that reducing the size of the tagset reduces the
chances of such tagging inconsistencies.
2.1.3 Syntactic Function. A further difference between the Penn Treebank and the
Brown Corpus concerns the significance accorded to syntactic context. In the Brown
Corpus, words tend to be tagged independently of their syntactic function.' For in-
stance, in the phrase the one, one is always tagged as CD (cardinal number), whereas
An important exception is there, which the Brown Corpus tags as EX (existential there) when it is used
as a formal subject and as RB (adverb) when it is used as a locative adverb. In the case of there, we did
not pursue our strategy of tagset reduction to its logical conclusion, which would have implied tagging
existential there as NN (common noun).
315Computational Linguistics
Volume 19, Number 2
in the corresponding plural phrase the ones, ones is always tagged as NNS (plural com-
mon noun), despite the parallel function of one and ones as heads of the noun phrase.
By contrast, since one of the main roles of the tagged version of the Penn Treebank
corpus is to serve as the basis for a bracketed version of the corpus, we encode a
word's syntactic function in its POS tag whenever possible. Thus, one is tagged as NN
(singular common noun) rather than as CD (cardinal number) when it is the head of
a noun phrase. Similarly, while the Brown Corpus tags both as ABX (pre-quantifier,
double conjunction), regardless of whether it functions as a prenominal modifier (both
the boys), a postnominal modifier (the boys both), the head of a noun phrase (both of
the boys) or part of a complex coordinating conjunction (both boys and girls), the Penn
Treebank tags both differently in each of these syntactic contexts-as PDT (predeter-
miner), RB (adverb), NNS (plural common noun) and coordinating conjunction (CC),
respectively.
There is one case in which our concern with tagging by syntactic function has led
us to bifurcate Brown Corpus categories rather than to collapse them: namely, in the
case of the uninflected form of verbs. Whereas the Brown Corpus tags the bare form
of a verb as VB regardless of whether it occurs in a tensed clause, the Penn Treebank
tagset distinguishes VB (infinitive or imperative) from VBP (non-third person singular
present tense).
2.1.4 Indeterminacy. A final difference between the Penn Treebank tagset and all other
tagsets we are aware of concerns the issue of indeterminacy: both POS ambiguity in
the text and annotator uncertainty. In many cases, POS ambiguity can be resolved with
reference to the linguistic context. So, for instance, in Katharine He户urn's witty line
Grant can be outspoken-but not by anyone 1 know, the presence of the by-phrase forces
us to consider outspoken as the past participle of a transitive derivative of speak-
outspeak-rather than as the adjective outspoken. However, even given explicit criteria
for assigning POS tags to potentially ambiguous words, it is not always possible to
assign a unique tag to a word with confidence. Since a major concern of the Treebank
is to avoid requiring annotators to make arbitrary decisions, we allow words to be
associated with more than one POS tag. Such multiple tagging indicates either that
the word's part of speech simply cannot be decided or that the annotator is unsure
which of the alternative tags is the correct one. In principle, annotators can tag a word
with any number of tags, but in practice, multiple tags are restricted to a small number
of recurring two-tag combinations: JJINN (adjective or noun as prenominal modifier),
JJIVBG (adjective or gerund/present participle), JJIVBN (adjective or past participle),
NNIVBG (noun or gerund), and RBIRP (adverb or particle).
2.2 The POS Tagset
The Penn Treebank tagset is given in Table 2. It contains 36 POS tags and 12 other
tags (for punctuation and currency symbols). A detailed description of the guidelines
governing the use of the tagset is available in Santorini (1990).'
2.3 The POS Tagging Process
The tagged version of the Penn Treebank corpus is produced in two stages, using a
combination of automatic POS assignment and manual correction.
7 In versions of the tagged corpus distributed before November 1992, singular proper nouns, plural
proper nouns, and personal pronouns were tagged as "NP," "NPS," and "PP," respectively. The current
tags "NNP," "NNPS," and "PR-P" were introduced in order to avoid confusion with the syntactic tags
"NP" (noun phrase) and "PP" (prepositional phrase) (see Table 3).
316Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
Table 2
The Penn Treebank POS tagset
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition/subordinating
conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol (mathematical or scientific)
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP$
WRB
#
$
to
Inte巧ection
Verb, base form
Verb, past tense
Verb, gerund/present
participle
Verb, past participle
Verb, non-3rd ps. sing. present
Verb, 3rd ps. sing. present
wh-determiner
wh-pronoun
Possessive wh-pronoun
wh-adverb
Pound sign
Dollar sign
Sentence-final punctuation
Comma
Colon, semi-colon
Left bracket character
Right bracket character
Straight double quote
Left open single quote
Left open double quote
Right close single quote
Right close double quote
Building a Large Annotated Corpus of English
Table 1
Elimination of lexically recoverable distinctions.
mg
ung
s
/VB
s/VBZ
do/VB
does/ VBZ
sang/ VBD
singing/ VBG
sung/ VBN
be/VB
is/VBZ
was/VBD
being/ VBG
been / VBN
did/VBD
doing/VBG
done/VBN
have/VB
has/ VBZ
had/VBD
having/VBG
had/VBN
A second example of lexical recoverability concerns those words that can precede
articles in noun phrases. The Brown Corpus assigns a separate tag to pre-qualifiers
(quite, rather, such), pre-quantifiers (all, ha犷many, nary) and both. The Penn Treebank,
on the other hand, assigns all of these words to a single category PDT (predeterminer).
Further examples of lexically recoverable categories are the Brown Corpus categories
PPL (singular reflexive pronoun) and PPLS (plural reflexive pronoun), which we col-
lapse with PRP (personal pronoun), and the Brown Corpus category RN (nominal
adverb), which we collapse with RB (adverb).
Beyond reducing lexically recoverable distinctions, we also eliminated certain POS
distinctions that are recoverable with reference to syntactic structure. For instance, the
Penn Treebank tagset does not distinguish subject pronouns from object pronouns
even in cases where the distinction is not recoverable from the pronoun's form, as
with you, since the distinction is recoverable on the basis of the pronoun's position
in the parse tree in the parsed version of the corpus. Similarly, the Penn Treebank
tagset conflates subordinating conjunctions with prepositions, tagging both categories
as IN. The distinction between the two categories is not lost, however, since subor-
dinating conjunctions can be recovered as those instances of IN that precede clauses,
whereas prepositions are those instances of IN that precede noun phrases or preposi-
tional phrases. We would like to emphasize that the lexical and syntactic recoverability
inherent in the POS-tagged version of the Penn Treebank corpus allows end users to
employ a much richer tagset than the small one described in Section 2.2 if the need
arises.
2.1.2 Consistency. As noted above, one reason for eliminating a POS tag such as RN
(nominal adverb) is its lexical recoverability. Another important reason for doing so is
consistency. For instance, in the Brown Corpus, the deictic adverbs there and now are
always tagged RB (adverb), whereas their counterparts here and then are inconsistently
tagged as RB (adverb) or RN (nominal adverb)-even in identical syntactic contexts,
such as after a preposition. It is clear that reducing the size of the tagset reduces the
chances of such tagging inconsistencies.
2.1.3 Syntactic Function. A further difference between the Penn Treebank and the
Brown Corpus concerns the significance accorded to syntactic context. In the Brown
Corpus, words tend to be tagged independently of their syntactic function.' For in-
stance, in the phrase the one, one is always tagged as CD (cardinal number), whereas
An important exception is there, which the Brown Corpus tags as EX (existential there) when it is used
as a formal subject and as RB (adverb) when it is used as a locative adverb. In the case of there, we did
not pursue our strategy of tagset reduction to its logical conclusion, which would have implied tagging
existential there as NN (common noun).
315Computational Linguistics
Volume 19, Number 2
in the corresponding plural phrase the ones, ones is always tagged as NNS (plural com-
mon noun), despite the parallel function of one and ones as heads of the noun phrase.
By contrast, since one of the main roles of the tagged version of the Penn Treebank
corpus is to serve as the basis for a bracketed version of the corpus, we encode a
word's syntactic function in its POS tag whenever possible. Thus, one is tagged as NN
(singular common noun) rather than as CD (cardinal number) when it is the head of
a noun phrase. Similarly, while the Brown Corpus tags both as ABX (pre-quantifier,
double conjunction), regardless of whether it functions as a prenominal modifier (both
the boys), a postnominal modifier (the boys both), the head of a noun phrase (both of
the boys) or part of a complex coordinating conjunction (both boys and girls), the Penn
Treebank tags both differently in each of these syntactic contexts-as PDT (predeter-
miner), RB (adverb), NNS (plural common noun) and coordinating conjunction (CC),
respectively.
There is one case in which our concern with tagging by syntactic function has led
us to bifurcate Brown Corpus categories rather than to collapse them: namely, in the
case of the uninflected form of verbs. Whereas the Brown Corpus tags the bare form
of a verb as VB regardless of whether it occurs in a tensed clause, the Penn Treebank
tagset distinguishes VB (infinitive or imperative) from VBP (non-third person singular
present tense).
2.1.4 Indeterminacy. A final difference between the Penn Treebank tagset and all other
tagsets we are aware of concerns the issue of indeterminacy: both POS ambiguity in
the text and annotator uncertainty. In many cases, POS ambiguity can be resolved with
reference to the linguistic context. So, for instance, in Katharine He户urn's witty line
Grant can be outspoken-but not by anyone 1 know, the presence of the by-phrase forces
us to consider outspoken as the past participle of a transitive derivative of speak-
outspeak-rather than as the adjective outspoken. However, even given explicit criteria
for assigning POS tags to potentially ambiguous words, it is not always possible to
assign a unique tag to a word with confidence. Since a major concern of the Treebank
is to avoid requiring annotators to make arbitrary decisions, we allow words to be
associated with more than one POS tag. Such multiple tagging indicates either that
the word's part of speech simply cannot be decided or that the annotator is unsure
which of the alternative tags is the correct one. In principle, annotators can tag a word
with any number of tags, but in practice, multiple tags are restricted to a small number
of recurring two-tag combinations: JJINN (adjective or noun as prenominal modifier),
JJIVBG (adjective or gerund/present participle), JJIVBN (adjective or past participle),
NNIVBG (noun or gerund), and RBIRP (adverb or particle).
2.2 The POS Tagset
The Penn Treebank tagset is given in Table 2. It contains 36 POS tags and 12 other
tags (for punctuation and currency symbols). A detailed description of the guidelines
governing the use of the tagset is available in Santorini (1990).'
2.3 The POS Tagging Process
The tagged version of the Penn Treebank corpus is produced in two stages, using a
combination of automatic POS assignment and manual correction.
7 In versions of the tagged corpus distributed before November 1992, singular proper nouns, plural
proper nouns, and personal pronouns were tagged as "NP," "NPS," and "PP," respectively. The current
tags "NNP," "NNPS," and "PR-P" were introduced in order to avoid confusion with the syntactic tags
"NP" (noun phrase) and "PP" (prepositional phrase) (see Table 3).
316Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
Table 2
The Penn Treebank POS tagset
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition/subordinating
conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol (mathematical or scientific)
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP$
WRB
#
$
to
Inte巧ection
Verb, base form
Verb, past tense
Verb, gerund/present
participle
Verb, past participle
Verb, non-3rd ps. sing. present
Verb, 3rd ps. sing. present
wh-determiner
wh-pronoun
Possessive wh-pronoun
wh-adverb
Pound sign
Dollar sign
Sentence-final punctuation
Comma
Colon, semi-colon
Left bracket character
Right bracket character
Straight double quote
Left open single quote
Left open double quote
Right close single quote
Right close double quote
发表评论
-
protocols
2011-04-03 19:22 924<!-- The protocols capabilit ... -
dfcap
2011-04-03 19:15 875<!-- The df capability has a ... -
booktrading /seller
2011-03-29 23:19 927<html><head><tit ... -
booktrading / manager
2011-03-29 23:18 1086<html><head><tit ... -
booktrading / common
2011-03-29 23:17 985<html><head><tit ... -
booktrading / buyer
2011-03-29 23:13 844<!-- <H3>The buyer age ... -
tomcat的context说明书
2011-03-20 17:39 803http://tomcat.apache.org/tomcat ... -
msyql的select语法
2010-09-13 22:52 107513.2.7. SELECT语法 13.2.7.1. ... -
zotero与word集成
2010-09-11 08:50 1765Manually Installing the Zotero ... -
university 2/n
2010-08-24 07:54 896Chapter 1.Introduction of regis ... -
university 1/n
2010-08-24 07:53 939chapter? Introduction ?.?The st ... -
Sun Java Bugs that affect lucene
2010-08-23 08:59 734Sometimes Lucene runs amok of b ... -
Snowball分词
2010-08-22 13:07 1222using System; using Lucene.Net. ... -
penn tree bank 6/6
2010-08-20 07:09 91811 This use of 12 Contact the - ... -
penn tree bank 5/n
2010-08-19 07:40 921always errs on the side of caut ... -
penn tree bank 4/n
2010-08-19 07:39 8174. Bracketing 4.1 Basic Methodo ... -
penn tree bank 3/n
2010-08-15 23:31 8182.3.1 Automated Stage. During t ... -
capabilities 3/3
2010-08-11 22:58 77401<capability xmlns="ht ... -
capabilities 2/3
2010-08-11 22:57 737Fig.3.Element creation cases:a) ... -
capabilities 1/3
2010-08-11 22:56 947Extending the Capability Concep ...
相关推荐
积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。
《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...
PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...
PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...
来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...
corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用
PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...
宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...
《ptb-reader-rust:合并的Penn Treebank格式的简单解析》 在自然语言处理(NLP)领域,数据是模型训练的基础。其中,Penn Treebank(PTB)是一个广泛使用的英文语料库,它包含了丰富的句法结构信息,对于句法分析...
从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...
formatted_task1167_penn_treebank_coarse_pos_tagging.json
带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库 概述 (SwDA) 扩展了带有轮次/话语级别的对话行为标签。 标签总结了有关相关转向的句法、语义和语用信息。 SwDA 项目于 1990 年代后期在加州大学博尔德...
在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...
HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...
训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...
PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...
NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章