- 浏览: 135238 次
- 性别:
- 来自: 福建省莆田市
文章分类
最新评论
-
houruiming:
tks for your info which helps m ...
setcontent和setcontentobject用的是同一片内存 -
turingfellow:
in.tftpd -l -s /home/tmp -u ro ...
commands -
turingfellow:
LINUX下的网络设置 ifconfig ,routeLINU ...
commands -
turingfellow:
安装 linux loopbackyum install um ...
commands
Chapter 1.Introduction
diagnostic tools for the evaluation of exam prompts,using the corpus analyses as
a baseline.
In the first stage of the project,we constructed the T2K-SWAL Corpus,which
was designed to represent both spoken and written university registers,as well
as the major academic disciplines(e.g.,humanities,natural sciences,social sci-
ences)and academic levels(lower division,upper division,graduate).The corpus
included both academic registers,such as lectures,textbooks,and course reading
packets,and institutional registers,such as university catalogs,course syllabi,and
service encounters on campus.The corpus is described in detail in Chapter 2.
In the second stage of the project,we analyzed the linguistic patterns of varia-
tion in the corpus,considering differences associated with register,discipline,and
level.All linguistic features included in Biber(1988)were analyzed,as well as many
additional grammatical features from the LGSWE(Biber et al.1999).In addition,
we carried out extensive analyses of vocabulary distribution and lexical bundles.
The procedures for these analyses are described in Chapters 2 and 3,while the
results of the analyses are covered in Chapters 4–8.
In the final stage of the project we shifted our attention to the development
of diagnostic tools.These tools analyze the linguistic characteristics of a particular
text and assess the extent to which that text is representative of a target register.
For example,an exam author might want to evaluate the representativeness of a
new text constructed as an upper division science lecture,or assess whether a cer-
tain textbook passage is representative of the textbook category overall.The tools
present the linguistic characteristics of the target register as the baseline for com-
parison,and then they analyze the linguistic characteristics of the selected text in
relation to that baseline.These diagnostic tools are described in Biber et al.(2004).
?.?Overview of the present book
The present book builds on the research efforts in the T2K-SWAL Project to pro-
vide a broad linguistic description of university language.Rather than focus on
academic research articles and other stereotypically academic registers,the book
analyzes a wide range of registers encountered by students in university life.These
registers fall into two general categories:(1)educational language,and(2)task
management language.
Educational language includes all spoken and written registers that relate to
teaching or learning.Educational language can be primarily teacher-centered(e.g.,
classroom lectures and textbooks)or co-constructed by teachers and students
(e.g.,lab sessions,office hours,and study groups).The focus of the present investi-
gation is on the language that students encounter in the university,rather than the??University Language
language that students actually produce.As a result,the study excludes registers
like student term papers or student class presentations.
Task management language occurs in situations where students are told how to
successfully complete a university education:registers like course syllabi,university
catalogs,program brochures,and classroom management talk(e.g.,discussions of
course requirements).Some registers,like office hours,will normally include both
educational language and task management language.
The central goal of the book is simple:to provide a relatively comprehensive
linguistic description of the range of university registers,surveying the distinctive
linguistic characteristics of each register.These linguistic descriptions include vo-
cabulary distributions,semantic categories of words,extended lexical expressions
(‘lexical bundles’),grammatical features,and more complex syntactic construc-
tions.The linguistic patterns are interpreted in functional terms,resulting in an
overall characterization of the typical kinds of language that students encounter in
university registers:academic and non-academic;spoken and written.
The following chapter provides a detailed description of the T2K-SWAL Cor-
pus and introduces the methods used for the linguistic analyses.The remainder
of the book,then,is organized according to different types of linguistic research
questions.These descriptions begin with the study of vocabulary distributions
(Chapter 3)and grammatical characteristics(Chapter 4).The following two chap-
ters then focus on more specific aspects of language use:the linguistic expression
of stance(Chapter 5)and the use of lexical bundles in university registers(Chap-
ter 6).Chapter 7 takes a different perspective,presenting the results of a Multi-
Dimensional analysis that describes the overall patterns of linguistic variation
among university registers and academic disciplines.Finally,Chapter 8 synthesizes
these linguistic descriptions,providing an overall description of the distinctive
characteristics of each register.
Note
?.In earlier Multi-Dimensional studies(e.g.,Biber 1988),I use the term genre instead of register
as a general cover term for situationally-defined varieties.chapter?
The Spoken and Written Academic Language
(T2K-SWAL)Corpus
Chapter co-authors:Susan Conrad,Randi Reppen,Pat Byrd,
Marie Helt
The descriptions of university language in this book grew out of the TOEFL 2000
Spoken and Written Academic Language(T2K-SWAL)Project(see Biber et al.
2004).As explained in the last chapter,that project was sponsored by Educational
Testing Service to carry out a comprehensive linguistic analysis of university reg-
isters,with the ultimate goal of determining whether the language used in the
TOEFL exam tasks is representative of actual language use in universities.
The first stage of the project was to construct the TOEFL 2000 Spoken and
Written Academic Language Corpus(T2K-SWAL Corpus).We designed the T2K-
SWAL Corpus to be relatively large(2.7 million words)as well as representative
of the range of spoken and written registers that students encounter in U.S.uni-
versities and of the major academic disciplines(e.g.,humanities,natural sciences)
and academic levels(lower division,upper division,and graduate).The corpus
included both academic registers,such as lectures,textbooks,and course reading
packets,and institutional registers,such as university catalogs,course syllabi,and
service encounters.We did not include more general registers–such as fiction,
newspapers,or casual conversation.Although these registers are used on campus,
they are not university-specific registers.We also did not include e-mail correspon-
dence between instructors and students or electronic postings by students as part
of course work.Although these registers deserve study in the future,they were not
part of the focus for the T2K-SWAL project.
A detailed description of the T2K-SWAL Corpus is given in Biber et al.(2004;
also available on-line at www.ets.org/ell/research/toeflmonograph.html).The fol-
lowing sections summarize the major aspects of design and construction.??University Language
Table 2.1 Composition of the T2K-SWAL Corpus
Register#of texts#of words
Spoken:
Class sessions 176 1,248,800
Classroom management*(40)39,300
Labs/In-class groups 17 88,200
Office hours 11 50,400
Study groups 25 141,100
Service encounters 22 97,700
Total speech 251(+40)1,665,500
Written:
Textbooks 87 760,600
Course packs 27 107,200
Course management 21 52,400
Institutional writing 37 151,500
Total writing 172 1,071,700
TOTAL CORPUS 423 2,737,200
*Classroom management texts are extracted from the“class session”tapes so they are not in-
cluded in the total tape counts.
?.?Design and construction of the T2K-SWAL Corpus
The register categories chosen for the T2K-SWAL corpus were sampled from
across the range of spoken and written activities associated with academic life,
including classroom teaching,office hours,study groups,on-campus service en-
counters,textbooks,course packs,and institutional written materials(e.g.,univer-
sity catalogs,brochures).The depth of sampling for each register category reflects
our assessment of its relative availability and importance;for example,there are
many more different texts and total words for class sessions and textbooks than
for office hours and course packs.Table 2.1 shows the overall composition of the
corpus by register category.
Data collection focused on capturing naturally-occurring discourse.One ma-
jor concern that we needed to address was that the presence of researchers in
spoken settings was likely to be intrusive and therefore result in somewhat ar-
tificial discourse.As a result,we employed participants who already worked or
studied in the settings where we wanted to collect data.They carried tape recorders
and recorded speech as it occurred spontaneously.We obtained high quality,nat-
ural interactions using this approach;the major disadvantage was that we did
not observe the interactions first-hand and thus were not able to obtain detailed
information about the setting and participants.Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
For the spoken corpus,we used students as our primary participants,re-
cruiting them to record classroom teaching,study groups,and other academic
conversations.Student participants recorded the class sessions and study groups
that they were involved in during a two week period,keeping a log of speech
events and participants to the extent that it was practical.We also recruited fac-
ulty members to help with the recording of office hours,and university staff for
service encounters.
The collection of texts from class sessions was designed to include a range
of teaching styles,as measured by the extent of interactiveness.Three levels of
interactiveness are distinguished for classroom teaching:
Low interactiveness:fewer than 10 turns per 1,000 words(i.e.,average length
of turn longer than 100 words per turn):54 class sessions;337,800 words
Medium interactiveness:between 10 and 25 turns per 1,000 words(i.e.,aver-
age length of turn between 40–100 words per turn):64 class sessions;448,400
words
High interactiveness:more than 25 turns per 1,000 words(i.e.,average length
of turn shorter than 40 words per turn):75 class sessions;550,900 words
Service encounters were recorded in locations where students regularly interact
with university staff conducting the business of the university.We distinguished
two major types of service encounters:for regular commerce(coffee shop,uni-
versity book store,copy shop)and for other university services(student business
services,academic department offices,the library reference desk,the front desk
in a dormitory,the media center).We collected 22 tapes at these locations;these
represent 97,700 words and hundreds of individual service encounters.1
For classroom teaching and textbooks,we sampled texts from six major disci-
plines(Business,Education,Engineering,Humanities,Natural Science,Social Sci-
ence)and three levels of education(lower division undergraduate,upper division
undergraduate,graduate).Table 2.2 shows the breakdown of texts by discipline.
Recognizing the existence of systematic variation within each of these high-
level disciplines,the corpus design targeted specific sub-disciplines(e.g.,account-
ing,anthropology,astronomy;see Biber et al.2004,Tables 6 and 7).Rather than
aiming for an exhaustive sampling of sub-disciplines,we collected texts from spe-
cific sub-disciplines within each major discipline(represented by at least 3 text
samples).While these distinctions will enable register comparisons at a more spe-
cific level in future research,the analyses in the present book are restricted to the
major disciplinary categories.
Course packs include written texts of several types:lecture notes,study guides,
and detailed descriptions of assignments or experimental procedures written by
the instructor,in addition to photocopies of published journal articles and book??University Language
Table 2.2 Breakdown of classroom teaching and textbooks by discipline
Discipline#of texts#of words
Classroom teaching
Business 36 236,400
Education 16 137,200
Engineering 30 171,300
Humanities 31 248,600
Natural Science 25 160,800
Social Science 38 294,400
Textbooks
Business 15 116,200
Education 6 50,100
Engineering 9 72,000
Humanities 18 164,100
Natural Science 18 145,200
Social Science 21 213,000
Table 2.3 Breakdown of texts within institutional writing
Category#of computer files#of words
Academic program brochures 7 22,500
University catalogs:
academic program descriptions 10 27,400
University catalogs:
admissions,requirements,etc.9 52,500
Student handbooks 9 43,800
University magazine articles 2 2,700
chapters.Similar to the sampling procedures used for textbooks,course packs were
collected from all the major disciplines and a range of the sub-disciplines.2
Finally,the category of‘institutional written material’attempted to represent
the range of miscellaneous campus-related written texts that students encounter.
Many of these texts are among the first material that a prospective student re-
ceives from a university,either through paper copy or on the Web:informational
brochures about student services and academic programs,university catalogs(in-
cluding both discussion of general requirements and specific programs),etc.Al-
though not often considered‘academic discourse’,written material of this type is
ubiquitous on campus and required reading for the prospective student attempt-
ing to navigate the maze of university requirements and services.Many of these
texts are very short(e.g.,from academic program brochures),so in some cases we
include multiple texts in a single computer file.Table 2.3 displays the breakdown
of texts within institutional writing.Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
Table 2.4 Breakdown of spoken texts by university
University#of texts#of words
Northern Arizona University 140 787,700
Georgia State University 56 369,200
Iowa State University 49 275,400
Cal State University,Sacramento 34 222,800
We collected spoken texts at four academic sites(Northern Arizona Univer-
sity,Iowa State University,California State University at Sacramento,Georgia State
University).Table 2.4 shows the breakdown of transcribed texts by university.
(Many additional texts were tape recorded but not able to be transcribed in the
scope of the project.)Written materials were collected from all four universities,
with the exception of course packs.Because there was little variation in the types
of texts included in course packs at the four universities,we collected these texts
only at Iowa State University.
Although we did not achieve full demographic/institutional representative-
ness,we aimed to avoid obvious skewing for these factors.Thus,the corpus
materials were collected from four major regions in the U.S.:west coast,rocky
mountain west,mid-west,and the deep south.Further,we collected materials
from four different types of academic institutions:a teacher’s college(California
State,Sacramento),a mid-size regional university(Northern Arizona University),
an urban research university(Georgia State),and a Research 1 university with a
national reputation in agriculture and engineering(Iowa State).The collection
procedures were approved by the Human Subjects Review Boards at all the uni-
versities.The amount collected from each written text conformed to copyright
laws as interpreted by legal advisors at Educational Testing Service.
?.?Transcription,scanning,and editing of texts in the T2K-SWAL Corpus
All texts in the corpus are coded with a header to identify content area and regis-
ter.Spoken texts were transcribed using a consistent transcription convention(see
Edwards&Lampert 1993),and to the extent possible speakers were distinguished
and some demographic information supplied in the header for each speaker(e.g.
their status as instructor vs student).Conventional spellings were used for all
words except the following:OK,cuz,yup,nope,mm,mhm,um,uh.Grammatical
dysfluencies were transcribed exactly as they occurred.
Written texts were scanned to disk or copied from websites.All texts were
edited to insure accuracy in scanning.??University Language
?.?Grammatical tagging and tag-editing
All texts in the T2K-SWAL Corpus were grammatically annotated using an auto-
matic grammatical“tagger”(a computer program developed and revised over a
10 year period by Biber).The tagger is designed to identify a large number of lin-
guistic features in written and spoken(transcribed)texts.It has various rules built
in for the tokenization of words(e.g.,contractions are separated and treated as
two words,multi-word prepositions or subordinators are marked with ditto tags,
phrasal verbs are identified as such,etc.).However,it does not have rules to dis-
ambiguate punctuation marks(especially‘.’–which can be used as sentence-final
punctuation and for a wide range of abbreviations).
The tagset incorporates an extended version of the CLAWS tagset(see Gar-
side,Leech,&Sampson 1987).For example,the CLAWS VBN tag(past participle)
is extended by several tags that identify grammatical function,such as perfect
aspect verb,passive voice verb(further distinguishing among finite BY-passives,
finite agentless passives,and non-finite post-nominal modifiers),and participial
adjectives.
The tagger has four major components:a simple‘look-up’component for
closed classes and multi-word fixed phrases(e.g.,identifying sequences of words
as fixed multi-word prepositions);a probabilistic component for individual words
(e.g.,considering the probabilities for the word abstract occurring as a noun,verb,
or adjective);a probabilistic component to compare the likelihood of each possible
tag sequence(working on a four-word window);and a rule-based component.
The tagger uses a number of on-line dictionaries.For example,one dictionary
lists common content words,identifies their possible part-of-speech categories
(e.g.,the word run would be listed as a verb and a noun),and records the probabil-
ity for each of those part-of-speech categories.Other dictionaries store multi-word
grammatical units(e.g.,such as,that is,for example)or other lists of words with a
specific grammatical function(e.g.,all verbs that can control a that-clause).
The probabilities used by the tagger were originally computed from a dis-
tributional analysis of the LOB(Lancaster-Oslo-Bergen)Corpus.For example,
book and runs are both noun-verb ambiguities,but book has a very high like-
lihood of being a noun(99%in the LOB expository genres),while runs has a
high likelihood of being a verb(74%).Separate dictionaries were compiled for
expository/informational discourse versus non-expository discourse,to reflect the
differing lexical and grammatical preferences of the two.For example,many past
participial forms(e.g.,admitted,expected)are much more likely to function as past
tense or perfect aspect verbs in fiction and other kinds of non-informational dis-
course,while they are much more likely to function as passive constructions in
exposition.Many noun/verb ambiguities(e.g.,trust,rule)are much more likely to
occur as verbs in non-informational discourse and as nouns in exposition.AmongChapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
Table 2.5 Sentences from tagged texts
university textbook class session
The^ati++++I^pp1a+pp1+++
dissolved^jj+atrb++xvbn+want^vb++++
components^nns++++you^pp2+pp2+++
that^tht+rel+subj++to^to++++
precipitate^vb++++have^vbi+hv+vrb++
to^to++++two^cd++++
form^vbi++++books^nns++++
these^dt+dem+++for^in++++
rocks^nns++++the^ati++++
are^vb+ber+aux++class^nn++++
decomposed^vpsv++agls+xvbnx.^.+clp+++
from^in++++
pre-existing^jj+atrb++xvbg+
rocks^nns++++
and^cc++++
minerals^nns++++
.^.+clp+++
the function words,some preposition/subordinator ambiguities(e.g.,before,as)
are more likely to occur as subordinators in non-informational discourse,but
more likely to function as prepositions in exposition.
Tagged texts are produced in a vertical format:the running text appears in the
left-hand column,and the tags associated with each word are given to the right
(beginning with the delimiter‘^’).Table 2.5 shows examples of tagged sentences
from a university textbook and a classroom session.The first tag field identifies
the major part of speech for each word;for example,jj marks an adjective,and
nn marks a noun.The remaining tag fields identify particular grammatical func-
tions or larger syntactic structures.For example,atrb in Field 2 marks an adjective
as‘attributive’.The tag sequence tht+rel+subj++is used to characterize the func-
tion word that functioning as a‘relative pronoun’,where the gap in the following
relative clause is in‘subject’position.
After the texts in the corpus were grammatically coded by the automatic tag-
ger,the codes(or‘tags’)were edited using an interactive tag-checker.While this
step is labor-intensive and extremely time consuming,it assured a high degree
of accuracy for the final annotated corpus(see Biber,Conrad,&Reppen 1998;
Methodology Boxes 4&5).We paid special attention to words that are multi-
functional and hard to disambiguate automatically,including that,WH words,the
form’s,and past participles when they are not in main clauses(e.g.,passive verbs
as postnominal modifiers).We also checked the tagging of words that were not in
the dictionaries.??University Language
Tagging the corpus made it possible to conduct a series of more sophisti-
cated analyses than would have been possible with an untagged corpus.Using
the grammatical tags,further coding and categorizing of words and structures
was undertaken in order to facilitate the linguistic analyses of the corpus(see
Appendix A).
?.?Overview of linguistic analyses
The primary goal of the present book is to provide a relatively comprehensive lin-
guistic description of university registers.Thus,the descriptions are based on as
wide a range of linguistic characteristics as possible,including any linguistic fea-
tures that have obvious functional associations(since these should be important
indicators of the differences among registers).
In selecting linguistic features for analysis,I relied on several previous corpus-
based studies.The descriptions incorporate many of the analytical distinctions
used in the Longman Grammar of Spoken and Written English(Biber et al.1999;
referred to as LGSWE below).These were especially important for the seman-
tic categories,lexico-grammatical associations,and the analysis of lexical bun-
dles.The descriptions also include all linguistic features used in previous Multi-
Dimensional register studies(see especially Biber 1988:Appendix II):67 differ-
ent linguistic features identified from a survey of previous research on speech
and writing.Finally,the descriptions include several analyses of vocabulary fea-
tures,motivated by a survey of previous research on vocabulary use in academic
language.
Computer programs were developed for each type of linguistic analysis,us-
ing the tagged version of the T2K-SWAL Corpus(see Section 2.3 above).The
tagged corpus was useful even for the vocabulary analyses,because the gram-
matical tags made it possible to distinguish among the different uses of a single
word form when it occurs with different parts of speech(e.g.,measure used as a
noun vs.verb).However,the tagged corpus was more important for the grammat-
ical/syntactic analyses,since the distribution of those features could not have been
accurately analyzed in an untagged corpus.
Some computer programs performed straightforward counts of features that
had been identified in the tagging procedures(e.g.,simple counts of nouns or ad-
jectives).Other programs were developed for more specific analyses of linguistic
features,such as the distribution of specific syntactic constructions in particular
lexico-grammatical contexts.For example,that-complement clauses were ana-
lyzed for each of the major syntactic types(e.g.,controlled by a verb,adjective,
or noun)and for the major semantic classes of the controlling word(e.g.,mentalverbs or likelihood adjectives).Lexico-grammatical analyses at these more spe-Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus??
cific levels allow much more insightful descriptions of register differences than
general analyses.
Appendix A provides a more detailed description of the procedures used for
some of the linguistic feature categories.The following chapters show how univer-
sity registers employ different constellations of features to achieve their particular
communicative goals.
Notes
?.Classroom management talk occurs at the beginning and end of class sessions,to discuss
course requirements,expectations,and past student performance.Office hours are individ-
ual meetings between a student and faculty member,for advising purposes or for tutor-
ing/mentoring on academic content.Study groups are meetings with two or more students who
are discussing course assignments and content.
?.Written course management includes syllabi(196 syllabi totaling ca.34,000 words)and
course assignments(162 texts totaling ca.18,500 words).The main communicative purpose
of these texts is similar to classroom management,namely communicating requirements and
expectations about a course or particular assignment.
diagnostic tools for the evaluation of exam prompts,using the corpus analyses as
a baseline.
In the first stage of the project,we constructed the T2K-SWAL Corpus,which
was designed to represent both spoken and written university registers,as well
as the major academic disciplines(e.g.,humanities,natural sciences,social sci-
ences)and academic levels(lower division,upper division,graduate).The corpus
included both academic registers,such as lectures,textbooks,and course reading
packets,and institutional registers,such as university catalogs,course syllabi,and
service encounters on campus.The corpus is described in detail in Chapter 2.
In the second stage of the project,we analyzed the linguistic patterns of varia-
tion in the corpus,considering differences associated with register,discipline,and
level.All linguistic features included in Biber(1988)were analyzed,as well as many
additional grammatical features from the LGSWE(Biber et al.1999).In addition,
we carried out extensive analyses of vocabulary distribution and lexical bundles.
The procedures for these analyses are described in Chapters 2 and 3,while the
results of the analyses are covered in Chapters 4–8.
In the final stage of the project we shifted our attention to the development
of diagnostic tools.These tools analyze the linguistic characteristics of a particular
text and assess the extent to which that text is representative of a target register.
For example,an exam author might want to evaluate the representativeness of a
new text constructed as an upper division science lecture,or assess whether a cer-
tain textbook passage is representative of the textbook category overall.The tools
present the linguistic characteristics of the target register as the baseline for com-
parison,and then they analyze the linguistic characteristics of the selected text in
relation to that baseline.These diagnostic tools are described in Biber et al.(2004).
?.?Overview of the present book
The present book builds on the research efforts in the T2K-SWAL Project to pro-
vide a broad linguistic description of university language.Rather than focus on
academic research articles and other stereotypically academic registers,the book
analyzes a wide range of registers encountered by students in university life.These
registers fall into two general categories:(1)educational language,and(2)task
management language.
Educational language includes all spoken and written registers that relate to
teaching or learning.Educational language can be primarily teacher-centered(e.g.,
classroom lectures and textbooks)or co-constructed by teachers and students
(e.g.,lab sessions,office hours,and study groups).The focus of the present investi-
gation is on the language that students encounter in the university,rather than the??University Language
language that students actually produce.As a result,the study excludes registers
like student term papers or student class presentations.
Task management language occurs in situations where students are told how to
successfully complete a university education:registers like course syllabi,university
catalogs,program brochures,and classroom management talk(e.g.,discussions of
course requirements).Some registers,like office hours,will normally include both
educational language and task management language.
The central goal of the book is simple:to provide a relatively comprehensive
linguistic description of the range of university registers,surveying the distinctive
linguistic characteristics of each register.These linguistic descriptions include vo-
cabulary distributions,semantic categories of words,extended lexical expressions
(‘lexical bundles’),grammatical features,and more complex syntactic construc-
tions.The linguistic patterns are interpreted in functional terms,resulting in an
overall characterization of the typical kinds of language that students encounter in
university registers:academic and non-academic;spoken and written.
The following chapter provides a detailed description of the T2K-SWAL Cor-
pus and introduces the methods used for the linguistic analyses.The remainder
of the book,then,is organized according to different types of linguistic research
questions.These descriptions begin with the study of vocabulary distributions
(Chapter 3)and grammatical characteristics(Chapter 4).The following two chap-
ters then focus on more specific aspects of language use:the linguistic expression
of stance(Chapter 5)and the use of lexical bundles in university registers(Chap-
ter 6).Chapter 7 takes a different perspective,presenting the results of a Multi-
Dimensional analysis that describes the overall patterns of linguistic variation
among university registers and academic disciplines.Finally,Chapter 8 synthesizes
these linguistic descriptions,providing an overall description of the distinctive
characteristics of each register.
Note
?.In earlier Multi-Dimensional studies(e.g.,Biber 1988),I use the term genre instead of register
as a general cover term for situationally-defined varieties.chapter?
The Spoken and Written Academic Language
(T2K-SWAL)Corpus
Chapter co-authors:Susan Conrad,Randi Reppen,Pat Byrd,
Marie Helt
The descriptions of university language in this book grew out of the TOEFL 2000
Spoken and Written Academic Language(T2K-SWAL)Project(see Biber et al.
2004).As explained in the last chapter,that project was sponsored by Educational
Testing Service to carry out a comprehensive linguistic analysis of university reg-
isters,with the ultimate goal of determining whether the language used in the
TOEFL exam tasks is representative of actual language use in universities.
The first stage of the project was to construct the TOEFL 2000 Spoken and
Written Academic Language Corpus(T2K-SWAL Corpus).We designed the T2K-
SWAL Corpus to be relatively large(2.7 million words)as well as representative
of the range of spoken and written registers that students encounter in U.S.uni-
versities and of the major academic disciplines(e.g.,humanities,natural sciences)
and academic levels(lower division,upper division,and graduate).The corpus
included both academic registers,such as lectures,textbooks,and course reading
packets,and institutional registers,such as university catalogs,course syllabi,and
service encounters.We did not include more general registers–such as fiction,
newspapers,or casual conversation.Although these registers are used on campus,
they are not university-specific registers.We also did not include e-mail correspon-
dence between instructors and students or electronic postings by students as part
of course work.Although these registers deserve study in the future,they were not
part of the focus for the T2K-SWAL project.
A detailed description of the T2K-SWAL Corpus is given in Biber et al.(2004;
also available on-line at www.ets.org/ell/research/toeflmonograph.html).The fol-
lowing sections summarize the major aspects of design and construction.??University Language
Table 2.1 Composition of the T2K-SWAL Corpus
Register#of texts#of words
Spoken:
Class sessions 176 1,248,800
Classroom management*(40)39,300
Labs/In-class groups 17 88,200
Office hours 11 50,400
Study groups 25 141,100
Service encounters 22 97,700
Total speech 251(+40)1,665,500
Written:
Textbooks 87 760,600
Course packs 27 107,200
Course management 21 52,400
Institutional writing 37 151,500
Total writing 172 1,071,700
TOTAL CORPUS 423 2,737,200
*Classroom management texts are extracted from the“class session”tapes so they are not in-
cluded in the total tape counts.
?.?Design and construction of the T2K-SWAL Corpus
The register categories chosen for the T2K-SWAL corpus were sampled from
across the range of spoken and written activities associated with academic life,
including classroom teaching,office hours,study groups,on-campus service en-
counters,textbooks,course packs,and institutional written materials(e.g.,univer-
sity catalogs,brochures).The depth of sampling for each register category reflects
our assessment of its relative availability and importance;for example,there are
many more different texts and total words for class sessions and textbooks than
for office hours and course packs.Table 2.1 shows the overall composition of the
corpus by register category.
Data collection focused on capturing naturally-occurring discourse.One ma-
jor concern that we needed to address was that the presence of researchers in
spoken settings was likely to be intrusive and therefore result in somewhat ar-
tificial discourse.As a result,we employed participants who already worked or
studied in the settings where we wanted to collect data.They carried tape recorders
and recorded speech as it occurred spontaneously.We obtained high quality,nat-
ural interactions using this approach;the major disadvantage was that we did
not observe the interactions first-hand and thus were not able to obtain detailed
information about the setting and participants.Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
For the spoken corpus,we used students as our primary participants,re-
cruiting them to record classroom teaching,study groups,and other academic
conversations.Student participants recorded the class sessions and study groups
that they were involved in during a two week period,keeping a log of speech
events and participants to the extent that it was practical.We also recruited fac-
ulty members to help with the recording of office hours,and university staff for
service encounters.
The collection of texts from class sessions was designed to include a range
of teaching styles,as measured by the extent of interactiveness.Three levels of
interactiveness are distinguished for classroom teaching:
Low interactiveness:fewer than 10 turns per 1,000 words(i.e.,average length
of turn longer than 100 words per turn):54 class sessions;337,800 words
Medium interactiveness:between 10 and 25 turns per 1,000 words(i.e.,aver-
age length of turn between 40–100 words per turn):64 class sessions;448,400
words
High interactiveness:more than 25 turns per 1,000 words(i.e.,average length
of turn shorter than 40 words per turn):75 class sessions;550,900 words
Service encounters were recorded in locations where students regularly interact
with university staff conducting the business of the university.We distinguished
two major types of service encounters:for regular commerce(coffee shop,uni-
versity book store,copy shop)and for other university services(student business
services,academic department offices,the library reference desk,the front desk
in a dormitory,the media center).We collected 22 tapes at these locations;these
represent 97,700 words and hundreds of individual service encounters.1
For classroom teaching and textbooks,we sampled texts from six major disci-
plines(Business,Education,Engineering,Humanities,Natural Science,Social Sci-
ence)and three levels of education(lower division undergraduate,upper division
undergraduate,graduate).Table 2.2 shows the breakdown of texts by discipline.
Recognizing the existence of systematic variation within each of these high-
level disciplines,the corpus design targeted specific sub-disciplines(e.g.,account-
ing,anthropology,astronomy;see Biber et al.2004,Tables 6 and 7).Rather than
aiming for an exhaustive sampling of sub-disciplines,we collected texts from spe-
cific sub-disciplines within each major discipline(represented by at least 3 text
samples).While these distinctions will enable register comparisons at a more spe-
cific level in future research,the analyses in the present book are restricted to the
major disciplinary categories.
Course packs include written texts of several types:lecture notes,study guides,
and detailed descriptions of assignments or experimental procedures written by
the instructor,in addition to photocopies of published journal articles and book??University Language
Table 2.2 Breakdown of classroom teaching and textbooks by discipline
Discipline#of texts#of words
Classroom teaching
Business 36 236,400
Education 16 137,200
Engineering 30 171,300
Humanities 31 248,600
Natural Science 25 160,800
Social Science 38 294,400
Textbooks
Business 15 116,200
Education 6 50,100
Engineering 9 72,000
Humanities 18 164,100
Natural Science 18 145,200
Social Science 21 213,000
Table 2.3 Breakdown of texts within institutional writing
Category#of computer files#of words
Academic program brochures 7 22,500
University catalogs:
academic program descriptions 10 27,400
University catalogs:
admissions,requirements,etc.9 52,500
Student handbooks 9 43,800
University magazine articles 2 2,700
chapters.Similar to the sampling procedures used for textbooks,course packs were
collected from all the major disciplines and a range of the sub-disciplines.2
Finally,the category of‘institutional written material’attempted to represent
the range of miscellaneous campus-related written texts that students encounter.
Many of these texts are among the first material that a prospective student re-
ceives from a university,either through paper copy or on the Web:informational
brochures about student services and academic programs,university catalogs(in-
cluding both discussion of general requirements and specific programs),etc.Al-
though not often considered‘academic discourse’,written material of this type is
ubiquitous on campus and required reading for the prospective student attempt-
ing to navigate the maze of university requirements and services.Many of these
texts are very short(e.g.,from academic program brochures),so in some cases we
include multiple texts in a single computer file.Table 2.3 displays the breakdown
of texts within institutional writing.Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
Table 2.4 Breakdown of spoken texts by university
University#of texts#of words
Northern Arizona University 140 787,700
Georgia State University 56 369,200
Iowa State University 49 275,400
Cal State University,Sacramento 34 222,800
We collected spoken texts at four academic sites(Northern Arizona Univer-
sity,Iowa State University,California State University at Sacramento,Georgia State
University).Table 2.4 shows the breakdown of transcribed texts by university.
(Many additional texts were tape recorded but not able to be transcribed in the
scope of the project.)Written materials were collected from all four universities,
with the exception of course packs.Because there was little variation in the types
of texts included in course packs at the four universities,we collected these texts
only at Iowa State University.
Although we did not achieve full demographic/institutional representative-
ness,we aimed to avoid obvious skewing for these factors.Thus,the corpus
materials were collected from four major regions in the U.S.:west coast,rocky
mountain west,mid-west,and the deep south.Further,we collected materials
from four different types of academic institutions:a teacher’s college(California
State,Sacramento),a mid-size regional university(Northern Arizona University),
an urban research university(Georgia State),and a Research 1 university with a
national reputation in agriculture and engineering(Iowa State).The collection
procedures were approved by the Human Subjects Review Boards at all the uni-
versities.The amount collected from each written text conformed to copyright
laws as interpreted by legal advisors at Educational Testing Service.
?.?Transcription,scanning,and editing of texts in the T2K-SWAL Corpus
All texts in the corpus are coded with a header to identify content area and regis-
ter.Spoken texts were transcribed using a consistent transcription convention(see
Edwards&Lampert 1993),and to the extent possible speakers were distinguished
and some demographic information supplied in the header for each speaker(e.g.
their status as instructor vs student).Conventional spellings were used for all
words except the following:OK,cuz,yup,nope,mm,mhm,um,uh.Grammatical
dysfluencies were transcribed exactly as they occurred.
Written texts were scanned to disk or copied from websites.All texts were
edited to insure accuracy in scanning.??University Language
?.?Grammatical tagging and tag-editing
All texts in the T2K-SWAL Corpus were grammatically annotated using an auto-
matic grammatical“tagger”(a computer program developed and revised over a
10 year period by Biber).The tagger is designed to identify a large number of lin-
guistic features in written and spoken(transcribed)texts.It has various rules built
in for the tokenization of words(e.g.,contractions are separated and treated as
two words,multi-word prepositions or subordinators are marked with ditto tags,
phrasal verbs are identified as such,etc.).However,it does not have rules to dis-
ambiguate punctuation marks(especially‘.’–which can be used as sentence-final
punctuation and for a wide range of abbreviations).
The tagset incorporates an extended version of the CLAWS tagset(see Gar-
side,Leech,&Sampson 1987).For example,the CLAWS VBN tag(past participle)
is extended by several tags that identify grammatical function,such as perfect
aspect verb,passive voice verb(further distinguishing among finite BY-passives,
finite agentless passives,and non-finite post-nominal modifiers),and participial
adjectives.
The tagger has four major components:a simple‘look-up’component for
closed classes and multi-word fixed phrases(e.g.,identifying sequences of words
as fixed multi-word prepositions);a probabilistic component for individual words
(e.g.,considering the probabilities for the word abstract occurring as a noun,verb,
or adjective);a probabilistic component to compare the likelihood of each possible
tag sequence(working on a four-word window);and a rule-based component.
The tagger uses a number of on-line dictionaries.For example,one dictionary
lists common content words,identifies their possible part-of-speech categories
(e.g.,the word run would be listed as a verb and a noun),and records the probabil-
ity for each of those part-of-speech categories.Other dictionaries store multi-word
grammatical units(e.g.,such as,that is,for example)or other lists of words with a
specific grammatical function(e.g.,all verbs that can control a that-clause).
The probabilities used by the tagger were originally computed from a dis-
tributional analysis of the LOB(Lancaster-Oslo-Bergen)Corpus.For example,
book and runs are both noun-verb ambiguities,but book has a very high like-
lihood of being a noun(99%in the LOB expository genres),while runs has a
high likelihood of being a verb(74%).Separate dictionaries were compiled for
expository/informational discourse versus non-expository discourse,to reflect the
differing lexical and grammatical preferences of the two.For example,many past
participial forms(e.g.,admitted,expected)are much more likely to function as past
tense or perfect aspect verbs in fiction and other kinds of non-informational dis-
course,while they are much more likely to function as passive constructions in
exposition.Many noun/verb ambiguities(e.g.,trust,rule)are much more likely to
occur as verbs in non-informational discourse and as nouns in exposition.AmongChapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus
Table 2.5 Sentences from tagged texts
university textbook class session
The^ati++++I^pp1a+pp1+++
dissolved^jj+atrb++xvbn+want^vb++++
components^nns++++you^pp2+pp2+++
that^tht+rel+subj++to^to++++
precipitate^vb++++have^vbi+hv+vrb++
to^to++++two^cd++++
form^vbi++++books^nns++++
these^dt+dem+++for^in++++
rocks^nns++++the^ati++++
are^vb+ber+aux++class^nn++++
decomposed^vpsv++agls+xvbnx.^.+clp+++
from^in++++
pre-existing^jj+atrb++xvbg+
rocks^nns++++
and^cc++++
minerals^nns++++
.^.+clp+++
the function words,some preposition/subordinator ambiguities(e.g.,before,as)
are more likely to occur as subordinators in non-informational discourse,but
more likely to function as prepositions in exposition.
Tagged texts are produced in a vertical format:the running text appears in the
left-hand column,and the tags associated with each word are given to the right
(beginning with the delimiter‘^’).Table 2.5 shows examples of tagged sentences
from a university textbook and a classroom session.The first tag field identifies
the major part of speech for each word;for example,jj marks an adjective,and
nn marks a noun.The remaining tag fields identify particular grammatical func-
tions or larger syntactic structures.For example,atrb in Field 2 marks an adjective
as‘attributive’.The tag sequence tht+rel+subj++is used to characterize the func-
tion word that functioning as a‘relative pronoun’,where the gap in the following
relative clause is in‘subject’position.
After the texts in the corpus were grammatically coded by the automatic tag-
ger,the codes(or‘tags’)were edited using an interactive tag-checker.While this
step is labor-intensive and extremely time consuming,it assured a high degree
of accuracy for the final annotated corpus(see Biber,Conrad,&Reppen 1998;
Methodology Boxes 4&5).We paid special attention to words that are multi-
functional and hard to disambiguate automatically,including that,WH words,the
form’s,and past participles when they are not in main clauses(e.g.,passive verbs
as postnominal modifiers).We also checked the tagging of words that were not in
the dictionaries.??University Language
Tagging the corpus made it possible to conduct a series of more sophisti-
cated analyses than would have been possible with an untagged corpus.Using
the grammatical tags,further coding and categorizing of words and structures
was undertaken in order to facilitate the linguistic analyses of the corpus(see
Appendix A).
?.?Overview of linguistic analyses
The primary goal of the present book is to provide a relatively comprehensive lin-
guistic description of university registers.Thus,the descriptions are based on as
wide a range of linguistic characteristics as possible,including any linguistic fea-
tures that have obvious functional associations(since these should be important
indicators of the differences among registers).
In selecting linguistic features for analysis,I relied on several previous corpus-
based studies.The descriptions incorporate many of the analytical distinctions
used in the Longman Grammar of Spoken and Written English(Biber et al.1999;
referred to as LGSWE below).These were especially important for the seman-
tic categories,lexico-grammatical associations,and the analysis of lexical bun-
dles.The descriptions also include all linguistic features used in previous Multi-
Dimensional register studies(see especially Biber 1988:Appendix II):67 differ-
ent linguistic features identified from a survey of previous research on speech
and writing.Finally,the descriptions include several analyses of vocabulary fea-
tures,motivated by a survey of previous research on vocabulary use in academic
language.
Computer programs were developed for each type of linguistic analysis,us-
ing the tagged version of the T2K-SWAL Corpus(see Section 2.3 above).The
tagged corpus was useful even for the vocabulary analyses,because the gram-
matical tags made it possible to distinguish among the different uses of a single
word form when it occurs with different parts of speech(e.g.,measure used as a
noun vs.verb).However,the tagged corpus was more important for the grammat-
ical/syntactic analyses,since the distribution of those features could not have been
accurately analyzed in an untagged corpus.
Some computer programs performed straightforward counts of features that
had been identified in the tagging procedures(e.g.,simple counts of nouns or ad-
jectives).Other programs were developed for more specific analyses of linguistic
features,such as the distribution of specific syntactic constructions in particular
lexico-grammatical contexts.For example,that-complement clauses were ana-
lyzed for each of the major syntactic types(e.g.,controlled by a verb,adjective,
or noun)and for the major semantic classes of the controlling word(e.g.,mentalverbs or likelihood adjectives).Lexico-grammatical analyses at these more spe-Chapter 2.The Spoken and Written Academic Language(T2K-SWAL)Corpus??
cific levels allow much more insightful descriptions of register differences than
general analyses.
Appendix A provides a more detailed description of the procedures used for
some of the linguistic feature categories.The following chapters show how univer-
sity registers employ different constellations of features to achieve their particular
communicative goals.
Notes
?.Classroom management talk occurs at the beginning and end of class sessions,to discuss
course requirements,expectations,and past student performance.Office hours are individ-
ual meetings between a student and faculty member,for advising purposes or for tutor-
ing/mentoring on academic content.Study groups are meetings with two or more students who
are discussing course assignments and content.
?.Written course management includes syllabi(196 syllabi totaling ca.34,000 words)and
course assignments(162 texts totaling ca.18,500 words).The main communicative purpose
of these texts is similar to classroom management,namely communicating requirements and
expectations about a course or particular assignment.
发表评论
-
Packet Tracer Version 5.3.1 Software Downloads
2010-10-03 15:18 381http://smycll.blog.hexun.com/56 ... -
Lucene 3.0 分词 IKAnalyzer
2010-09-02 12:46 1460最近lucene已经更新到lucene 3.0版本了 2.X版 ... -
university 4/n (45)
2010-08-24 07:57 1069chapter? Vocabulary use in clas ... -
lucene analyzer pos
2010-08-20 07:16 931Parsing? Tokenization? Analysis ... -
penn tree bank 1/n
2010-08-15 23:24 768Building a Large Annotated Corp ... -
[ZZ]功能词与实义词
2010-03-21 22:21 1141Words are divided into two cate ... -
hack the stanford postagger demo
2010-03-20 10:18 997stanford postagger 的demo默认情况下输出 ... -
stanford postagger tagsets
2010-03-20 10:12 1661stanford postagger stagsets sta ...
相关推荐
3. 目标检测与分类:讲解基于深度学习的物体检测方法,如R-CNN系列(Region-based Convolutional Neural Networks)、YOLO(You Only Look Once)和SSD(Single Shot Multibox Detector),以及图像分类的卷积神经...
【CS 231n, 斯坦福大学计算机视觉课程】是计算机科学领域的一门重要课程,由世界顶级学府斯坦福大学于2017年春季开设。该课程聚焦于深度学习在视觉识别中的应用,旨在帮助学生掌握如何通过计算机系统理解和解释图像...
Cambridge University Press - Next Generation Wireless LANs - Throughput, Robustness and Reliability in 802.11n - 2008 這本描述新的specification,如果要了解新的beamforming 和 MIMO-OFDM 或是新的 MAC規格...
CS231n standford university's newest PPTs which is provided by Li FeiFei
|// Function n Return code Function name(Parameter Explain)| |/////////////////////////////////////////////////////////////////| |///////////| |////////////////////////////////////////////////////////...
Palladium(II) Complexes with the Mixed-Donor Ligand CH3S-(CH2)3-PPh2 Crystal Structures of [PdCl2{CH3S-(CH2)3-PPh2}n](n [equals] 1, 2) Palladium(II) Complexes with the Mixed-Donor Ligand CH3S-(CH2)3...
顺式和反式-N-异丙基酰氨基二甲基铟的晶体结构,[(CH3) 2In-N (H) iC3H7] 2 顺式和反式-N-异丙基酰氨基二甲基铟的晶体结构,[(CH3) 2In-N (H ) iC3H7] 2 F. Cordeddu, H.-D. Hausen † 和 J. Weidlein * Stuttgart, ...
1. **NSL-KDD数据集**:NSL-KDD(Network Security Laboratory - Knowledge Discovery and Data Mining Dataset)是由美国New York University的Tavish Vaidya等人改进的KDD Cup '99数据集。它修复了原始KDD Cup数据...
Palladium(II) Complexes with the Mixed-Donor Ligand CH3S-(CH2)3-PPh2 Crystal Structures of [PdCl2{CH3S-(CH2)3-PPh2}n](n [equals] 1, 2) Palladium(II) Complexes with the Mixed-Donor Ligand CH3S-(CH2)3...
8. **复杂度分析**:理解时间复杂度和空间复杂度的概念,分析算法效率,如O(n log n)、O(n^2)等。 9. **算法设计技巧**:包括分治、动态规划、贪心、回溯、模拟等方法,用于设计和优化算法。 10. **编程实现**:...
一个包放不下,一共分成了3个包,包含百余篇论文,朋友们可以挑选自己感兴趣的部分下载,我尽量把文章目录写得明白一些。 这是第三部分 Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and ...
### 802.11 TGn Channel Models #### 概述 《802.11 TGn Channel Models》是一份重要的技术文档,由Vinko Erceg(Zyray Wireless)等人于2004年5月提交至IEEE 802.11工作组。该文档主要介绍了用于高吞吐量任务组...
[Nien3][(O3S)2(CF2)n] (n[hairsp]=[hairsp]1, 3) 钡、钾和三(乙烷-1,2-二胺)氟代链烷二磺酸镍 (II) 和 K2(O3S)2CHF、K2(O3S)2CF2、K2(O3S)2(CF2)3´H2O 和 [Nien3][(O3S)2(CF2)n 的 X 射线晶体结构] (n = 1, 3) ...
### Stanford University机械学习笔记 #### 第三章 逻辑回归 **第一节 分类** 在机器学习领域,特别是监督学习中,分类是一项重要的任务。本节主要介绍逻辑回归算法应用于分类问题的基础概念及其工作原理。 首先...
LnCl3(L)n(L [等于] 四氢呋喃或 1, 2-二甲氧基乙烷)结构主题的新变化 - NdCl3(dme)2 和 YbCl3(thf)3.5 LnCl3(L)n(L 四氢呋喃或1,2-二甲氧基乙烷) 结构主题 NdCl3(dme)2 和 YbCl3(thf)3.5 Glen B. Deacon, David J...
University-Randomized-N-Queens:算法分析课的简短任务
University Jena, August-Bebel-Strasse 2, D-07743 Jena 钙钛矿; 稀土金属 与 DFT 能带结构计算一致,铟 (R3N 6 ) [In] 5 e 稀土金属氮化物中的电子过剩情况表明具有金属特性。 根据计算,当氮处于阴离子状态时,...
religion /rɪˈlɪdʒən/ n. 宗教,信仰 - **含义**:一种信仰体系,通常包括神灵崇拜、道德规范等。 - **例句**:People around the world practice various religions. - **应用场景**:适用于文化研究、社会学...