`
turingfellow
  • 浏览: 135084 次
  • 性别: Icon_minigender_1
  • 来自: 福建省莆田市
社区版块
存档分类
最新评论

penn tree bank 3/n

    博客分类:
  • jade
阅读更多
2.3.1 Automated Stage. During the early stages of the Penn Treebank project, the
initial automatic POS assignment was provided by PARTS (Church 1988), a stochastic
algorithm developed at AT&T Bell Labs. PARTS uses a modified version of the Brown
Corpus tagset close to our own and assigns POS tags with an error rate of 3-5%. The
output of PARTS was automatically tokenized' and the tags assigned by PARTS were
automatically mapped onto the Penn Treebank tagset. This mapping introduces about
4% error, since the Penn Treebank tagset makes certain distinctions that the PARTS
tagset does not.9 A sample of the resulting tagged text, which has an error rate of
7-9%, is shown in Figure 1.
    More recently, the automatic POS assignment is provided by a cascade of stochastic
and rule-driven taggers developed on the basis of our early experience. Since these
taggers are based on the Penn Treebank tagset, the 4% error rate introduced as an
artefact of mapping from the PARTS tagset to ours is eliminated, and we obtain error
rates of 2一%.
2.3.2 Manual Correction Stage. The result of the first, automated stage of POS tagging
is given to annotators to correct. The annotators use a mouse-based package written
8 In contrast to the Brown Corpus, we do not allow compound tags of the sort illustrated above for 1'm.
Rather, contractions and the Anglo-Saxon genitive of nouns are automatically split into their
component morphemes, and each morpheme is tagged separately. Thus, children's is tagged
  "children/NNS 's/POS," and won't is tagged "wo-/MD n't/RB“
9 The two largest sources of mapping error are that the PARTS tagset distinguishes neither infinitives
from non-third person singular present tense forms of verbs, nor prepositions from particles in cases
like run up a hill and run up a bill.
317Computational Linguistics
Volume 19, Number 2
    Battle-tested/NNP industrial/JJ managers/NNS here/RB
always/RB buck/VB up/IN nervous/JJ newcomers/NNS with/IN the/DT tale/NN
WIN the/DT first/JJ of /IN their/:PP$ countrymen/NNS to/TO visit/VB
Mexico/NNP,/,a/DT boatload/NN of/IN samurai/NNS warriors/NNS
blown/VBN ashore/RB 375/CD years/NNS ago/RB./.
    “/“From/IN the/DT beginning/NN,/,it/PRP took/VBD a/DT man/NN
with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB in/IN Mexico/NNP,/,
‘’/’‘says/VBZ Kimihide/NNP Takimura/NNP,/,president/NN of/IN Mitsui/NNS
group/NN 's/POS Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN./.
Figure 1
Sample tagged text-before correction
    Battle-tested/NNP*/JJ industrial/JJ managers/NNS here/RB
always/RB buck/VB*/VBP up/IN*/RP nervous/JJ newcomers/NNS with/IN
the/DT tale/NN WIN the/DT first/JJ of/IN their/PP$ countrymen/NNS to/TO
visit/VB Mexico/NNP,/,a/DT boatload/NN WIN samurai/NNS*/FW
warriors/NNS blown/VBN ashore/RB 375/CD years/NNS ago/RB./.
    “/“From/IN the/DT beginning/NN,/,it/PRP took/VBD a/DT man/NN
with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB WIN Mexico/NNP,/,
“/’‘says/VBZ Kimihide/NNP Takimura/NNP,/,president/NN WIN
Mitsui/NNS*/NNP group/NN 's/POS Kensetsu/NNP Engineering/NNP Inc./NNP
unit/NN./.
Figure 2
Sample tagged text-after correction
in GNU Emacs Lisp, which is embedded within the GNU Emacs editor (Lewis et al.
1990). The package allows annotators to correct POS assignment errors by positioning
the cursor on an incorrectly tagged word and then entering the desired correct tag
(or sequence of multiple tags). The annotators' input is automatically checked against
the list of legal tags in Table 2 and, if valid, appended to the original word-tag pair
separated by an asterisk. Appending the new tag rather than replacing the old tag
allows us to easily identify recurring errors at the automatic POS assignment stage.
We believe that the confusion matrices that can be extracted from this information
should also prove useful in designing better automatic taggers in the future. The result
of this second stage of POS tagging is shown in Figure 2. Finally, in the distribution
version of the tagged corpus, any incorrect tags assigned at the first, automatic stage
are removed.
    The learning curve for the POS tagging task takes under a month (at 15 hours a
week), and annotation speeds after a month exceed 3,000 words per hour.
3. Two Modes of Annotation-An Experiment
To determine how to maximize the speed, inter-annotator consistency, and accuracy of
POS tagging, we performed an experiment at the very beginning of the project to com-
pare two alternative modes of annotation. In the first annotation mode ("tagging"),
annotators tagged unannotated text entirely by hand; in the second mode ("correct-
ing"), they verified and corrected the output of PARTS, modified as described above.
318Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
This experiment showed that manual tagging took about twice as long as correcting,
with about twice the inter-annotator disagreement rate and an error rate that was
about 50% higher.
    Four annotators, all with graduate training in linguistics, participated in the exper-
iment. All completed a training sequence consisting of 15 hours of correcting followed
by 6 hours of tagging. The training material was selected from a variety of nonfiction
genres in the Brown Corpus. All the annotators were familiar with GNU Emacs at the
outset of the experiment. Eight 2,000-word samples were selected from the Brown Cor-
pus, two each from four different genres (two fiction, two nonfiction), none of which
any of the annotators had encountered in training. The texts for the correction task
were automatically tagged as described in Section 2.3. Each annotator first manually
tagged four texts and then corrected four automatically tagged texts. Each annotator
completed the four genres in a different permutation.
    A repeated measures analysis of annotation speed with annotator identity, genre,
and annotation mode (tagging vs. correcting) as classification variables showed a sig-
nificant annotation mode effect (p=.05). No other effects or interactions were signif-
icant. The average speed for correcting was more than twice as fast as the average
speed for tagging: 20 minutes vs. 44 minutes per 1,000 words. (Median speeds per
1,000 words were 22 vs. 42 minutes.)
    A simple measure of tagging consistency is inter-annotator disagreement rate, the
rate at which annotators disagree with one another over the tagging of lexical tokens,
expressed as a percentage of the raw number of such disagreements over the number
of words in a given text sample. For a given text and n annotators, there are
disagreement ratios (one for each possible pair of annotators). Mean inter-annotator
disagreement was 7.2% for the tagging task and 4.1% for the correcting task (with me-
dians 7.2% and 3.6%, respectively). Upon examination, a disproportionate amount of
disagreement in the correcting case was found to be caused by one text that contained
many instances of a cover symbol for chemical and other formulas. In the absence of
an explicit guideline for tagging this case, the annotators had made different decisions
on what part of speech this cover symbol represented. When this text is excluded
from consideration, mean inter-annotator disagreement for the correcting task drops
to 3.5%, with the median unchanged at 3.6%.
    Consistency, while desirable, tells us nothing about the validity of the annotators'
corrections. We therefore compared each annotator's output not only with the output
of each of the others, but also with a benchmark version of the eight texts. This
benchmark version was derived from the tagged Brown Corpus by (1) mapping the
original Brown Corpus tags onto the Penn Treebank tagset and (2) carefully hand-
correcting the revised version in accordance with the tagging conventions in force at
the time of the experiment. Accuracy was then computed as the rate of disagreement
between each annotator's results and the benchmark version. The mean accuracy was
5.4% for the tagging task (median 5.7%) and 4.0% for the correcting task (median 3.4%).
Excluding the same text as above gives a revised mean accuracy for the correcting task
of 3.4%, with the median unchanged.
    We obtained a further measure of the annotators' accuracy by comparing their
error rates to the rates at which the raw output of Church's PARTS program-appropri-
ately modified to conform to the Penn Treebank tagset-disagreed with the benchmark
version. The mean disagreement rate between PARTS and the benchmark version was
319Computational Linguistics
Volume 19, Number 2
9.6%, while the corrected version had a mean disagreement rate of 5.4%, as noted
above." The annotators were thus reducing the error rate by about 4.2%.
分享到:
评论

相关推荐

    Penn Tree Bank (PTB)数据集

    积分已重新设置。 -------------------- 数据集已经预处理并含有整体10000个不同的词,包括结束句子的标记和用于罕见词语的特殊符号(\ )。数据量小,适用于RNN的训练。积分莫名增加了,无积分可以私信我。

    Penn Treebank

    《 Penn Treebank:深入理解与应用》 Penn Treebank,简称PTB,是自然语言处理领域的一个重要资源,由宾夕法尼亚大学于1990年代初创建,旨在为英语句法分析提供大规模的标注语料库。这个项目由Martha Palmer、...

    PTB(Penn Treebank Dataset)文本数据集

    PTB(Penn Treebank Dataset)是由宾夕法尼亚大学创建的一个广泛使用的文本语料库,主要包含从《华尔街日报》中摘录的约100万个单词,用于语言学研究和自然语言处理(NLP)任务。这个数据集最初是为了句法分析而设计...

    PTB(Penn Tree Bank)小型语料库

    PTB(Penn Tree Bank)是自然语言处理领域中一个经典的英文语料库,它由宾夕法尼亚大学创建,主要用于研究语法分析、句法结构、机器翻译等任务。这个小型语料库是从《华尔街日报》的文章中抽样出来的,因此其内容...

    Penn Tree Bank(PTB文本数据集)

    来源于 Tomas Mikolov 网站上的 PTB 数据集http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz。 该数据集已经预先处理过并且包含了全部的 10000 个不同的词语,其中包括语句结束标记符,以及标记稀有...

    penn_treebank_tagset.xlsx

    corenlp词性标注全部标签及含义excel版(自己整理了一下),详情见https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    huggface下载的penn-treebank数据集

    可以直接放在~/.cache/huggingface/datasets/ptb_text_only 中直接使用

    PTB文本数据集.zip

    PTB文本数据集,全称为Penn Treebank(宾夕法尼亚树库)文本数据集,是自然语言处理(NLP)领域中一个经典的资源,尤其在语言模型的学习和研究中占有重要地位。这个数据集源自《华尔街日报》的新闻文本,经过精心...

    宾州中文树库分词指导手册《The Segmentation Guidelines for the Penn Chinese TreeBank(3.0)》

    宾州中文树库(Penn Chinese TreeBank)是一个广泛使用的中文语言资源库,它对中文分词的准则进行了详细的描述和规定。分词是自然语言处理(NLP)中的一个基本任务,特别是在中文处理中,因为中文是一种没有空格来...

    ptb-reader-rust:合并的Penn Treebank格式的简单解析

    《ptb-reader-rust:合并的Penn Treebank格式的简单解析》 在自然语言处理(NLP)领域,数据是模型训练的基础。其中,Penn Treebank(PTB)是一个广泛使用的英文语料库,它包含了丰富的句法结构信息,对于句法分析...

    PennToPCFG:从 Penn Treebank 学习未词法化的 PCFG

    从 Penn Treebank 风格的语料库(例如华尔街日报)中学习未词法化的 PCFG。 需要 NLTK2 来读取树库和处理树。 ##用法 usage: PennToPCFG.py [-h] [-p PENN] [-g GRAMMAR] [-pe PENNEVAL] [-s SENTENCES] [-t ...

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    formatted_task1167_penn_treebank_coarse_pos_tagging.json

    swda:带有 Penn Treebank 链接的 Switchboard Dialog Act 语料库

    SwDA 与 Switchboard 的 Penn Treebank 3 解析没有内在的联系,而且对齐这两个资源远非直接的。 此外,SwDA 不与 Switchboard 的有关对话及其参与者的元数据表一起分发。 这个项目包括一个语料库版本 ( swda.zip )...

    maxent_treebank_pos_tagger.zip_english_pos

    在词性标注任务中,模型通过学习大量预先标记的语料库(如Penn Treebank)中的模式,来预测新句子中单词的词性。这种学习过程通常包括特征工程,即选择和构造有助于预测的特征,如单词的前缀、后缀、上下文词性等。 ...

    HPSG-Neural-Parser:“在Penn Treebank上的头驱动短语结构语法解析”的源代码在ACL 2019上发布

    HPSG神经解析器这是ACL 2019中的“在Penn Treebank上的头驱动短语结构语法解析”中描述的解析器的Python实现。内容要求Python 3.6或更高版本。 Cython 0.25.2或任何兼容版本。 0.4.0。 该代码尚未在PyTorch 1.0中...

    cky_parser:NLP 项目,使用 CKY 算法在 Java 中实现的部分或语音解析器。 训练数据和输出解析树都是 Penn Tree Bank 格式

    训练数据(来自 Penn Tree Bank)和测试脚本由哥伦比亚大学的 Michael Collins 教授提供。 原则 基本上,它首先从训练数据中学习,然后为 CKY 算法生成参数。 然后它运行 CKY 算法来恢复给定英语句子的解析。 怎么...

    PTB文本数据集

    PTB(Penn Treebank Dataset)文本数据集是一个在自然语言处理(NLP)领域非常重要的资源,主要用于训练和评估语言模型。这个数据集源于宾夕法尼亚大学的树库项目,其中包含了经过精心标注的英文文本,特别是新闻...

    LSTM神经网络训练的PTB语料

    NLP中常用的PTB语料库,全名Penn Treebank。 Penn Treebank是一个项目的名称,项目目的是对语料进行标注,标注内容包括词性标注以及句法分析。 语料来源为:1989年华尔街日报 语料规模:1M words,2499篇文章

Global site tag (gtag.js) - Google Analytics