Contents
Tools
:
Machine Translation
,
POS Taggers
,
NP chunking
,
Sequence models
,
Parsers
,
Semantic Parsers/SRL
,
NER
,
Coreference
,
Language models
,
Concordances
,
Summarization
,
Other
Corpora
:
Large collections
,
Particular languages
,
Treebanks
,
Discourse
,
WSD
,
Literature
,
Acquisition
SGML/XML
Dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational Resources
Mailing lists
Other stuff on the Web
:
General
,
IR
,
IE/Wrappers
,
People
,
Societies
Instructions
Building a baseline statistical phrase MT system
Wonderful pages about how to download a bunch of tools and some data
and put them
together to build a very competent baseline statistical MT system:
NAACL 2006
WMt
or
2009 WMT
.
Freely downloadable
EGYPT system
System from 1999 JHU workshop. Mainly of historical interest.
GIZA++
and
mkcls
Franz Och. C++. GPL.
Thot
Phrase-based model building kit
Phramer
An Open-Source Java Statistical Phrase-Based MT Decoder
Moses
A new open-source
phrase-based MT decoder with functionality beyond Pharaoh.
Syntax Augmented Machine
Translation via Chart Parsing
Andreas Zollmann and Ashish Venugopal
Free, but getting them requires hassle
Pharaoh
decoder
Philip Koehn, ISI.
MTTK
Machine Translation Tool Kit. Deng and Byrne.
Freely downloadable
Stanford POS
tagger
Loglinear tagger in Java (by Kristina Toutanova)
hunpos
An HMM tagger with models available for English and Hungarian. A
reimplementation of TnT (see below) in OCaml.
pre-compiled models. Runs on Linux, Mac OS X, and Windows.
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
A decision tree based tagger from the University of Stuttgart
(Helmut Scmid). It's
language independent, but comes complete with parameter files for
English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian,
and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions.
Binary distribution only.) Page has links to sites where you can run it online.
SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST
(formerly
ICOPOST)
Open source C taggers originally written by by Ingo Schröder.
Implements maximum entropy, HMM trigram, and
transformation-based learning. C source available
under GNU public license.
MXPOST
: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger
Java POS tagger. A sentence
boundary detector (MXTERMINATOR) is also included. Original version was
only JDK1.1; later version worked with JDK1.3+. Class files, not source.
fnTBL
A fast and flexible implementation of Transformation-Based
Learning in C++. Includes a POS tagger, but also NP chunking
and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner (a la Brill),
usable for POS tagging and other things by Torbjörn Lager. Web
demo also available. Prolog.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++
open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS
tagger for an end user.)
QTAG
Part of speech tagger
An HMM-based Java POS tagger from Birmingham U. (Oliver Mason).
English and German parameter files. [Java class files, not source.]
The TOSCA/LOB tagger
.
Currently available for MS-DOS only. But the decision to make this
famous system available is very interesting from an historical
perspective, and for software sharing in academia more generally.
LOB tag set.
The venerable Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a
canonical location, but you might find a version from the
Wikipedia page
or you could try a reimplementation such
as fnTBL
.
Original Xerox Tagger
A common lisp HMM tagger available by
ftp
.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version
0.11. (A bigram HMM tagger.)
Free, but require registration
TATOO
The ISSCO tagger. HMM tagger. Need to register to download.
PoSTech Korean
morphological analyzer and tagger
Online registration.
TnT - A Statistical
Part-of-Speech Tagger
Trainable for various languages, comes with English and German
pre-compiled models. Runs on Solaris and Linux.
Usable by email or on the web, but not distributed freely
Memory-based tagger
From ILK group, Catholic University Brabant (Jakub Zavrel/Walter
Daelemans). Does Dutch, English, Spanish, Swedish, Slovene.
Other MBL
demos
are also available.
Birmingham tagger
Accepts only plain ASCII
email message
contents. The tagset used
is similar to the Brown/LOB/Penn set.
CLAWS tagger
The UCREL CLAWS tagger is available for trial use on the web. (It's
limited to 300 words though -- this site is more of an advertisement for
licensing the real thing -- available as software for Suns or as a paid
service.) You can also find info on
CLAWS tagsets
,
though that page doesn't seem to link to the
C7 tagset
.
The
AMALGAM tagger
The AMALGAM
Project
also has various other useful resources, in particular a web
guide to different tag sets in common use
. The tagging is actually
done by a (retrained) version of the Brill tagger (q.v.).
Xerox
XRCE MLTT Part Of Speech Taggers
Tags any of 14 languages (European and Arabic), online on the web.
Portuguese taggers on the web: Projecto
Natura
and a QTAG adaptation
.
Not free
Lingsoft
Lingsoft
in Finland has (symbolic)
analysis tools for many European languages. More information can be
obtained by emailing info@lingsoft.fi
. There
is an online demo
.
Conexor
Conexor
in Finland has
demonstrations of EngCG-style taggers and parsers, for English, Swedish,
and Spanish.
Xerox
Xerox
has
morphological analyzers and taggers for many languages.
There are demos
of some of their tools on the web.
More information can be
obtained by contacting
Daniella Russo
.
Infogistics
Infogistics
, an
Edinburgh spinoff has a tagging and NP/Verb group chunker
available commercially, including an evaluation version.
No longer available
LT POS and LT TTT
The Edinburgh Language Technology Group tagger and text tokenizer (and
sentence splitter were binary-only Solaris tools which no longer seem to
be available.
Downloadable
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++
open source. Won
CoNLL 2000 shared task. (Less automatic than a specialized POS
tagger for an end user.)
Mark
Greenwood's Noun Phrase Chunker
A Java reimplementation of Ramshaw and Marcus (1995).
fnTBL
A fast and flexible implementation of Transformation-Based
Learning in C++. Includes a POS tagger, but also NP chunking
and general chunking models.
Downloadable
CRF++
Generic CRF-based model in C++. Open source. By the author of YamCha.
Carafe
Generic CRF-based sequence models in O-CaML. Open source. By Ben
Wellner.
FreeLing
A large
suite of language analyzers. Written in C++.
Covers text preprocessing, morphology, NER, POS tagging, parsing.
Information on available probabilistic parsers can be found on the
FSNLP: probabilistic parsing
links page.
Downloadable
ASSERT
PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
Shalmaneser
FrameNet-based by Katrin Erk.
Tree
Kernels in SVMlight
by Alessandro Moschitti.
A general package, but it
has particularly been used for SRL.
Downloadable
Stanford Named
Entity Recognizer
A Java Conditional Random Field sequence model with trained models
for Named Entity Recognition. Java. GPL. By Jenny Finkel.
LingPipe
Tools include statistical named-entity recognition, a heuristic sentence
boundary detector, and a heuristic within-document coreference
resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++
open source. Won
CoNLL 2000 shared task. (Less automatic than a specialized POS
tagger for an end user.)
BART
A Beautiful Anaphora Resolution Toolkit. Java. By Yannick
Versley and many others. Java. Apache with GPL components.
Guitar
Java. GPL.
Downloadable
IRSTLM Toolkit
Compatible with SRILM, suitable for very large language models. LGPL.
By Marcello Federico, Nicola Bertoldi et al.
CMU-Cambridge
Statistical Language Modeling toolkit
Downloadable, but requires registration
The SRI Language
Modeling toolkit
by Andreas Stolcke is another good system for
building language models, freely available for research purposes.
Not yet classified
Lextools
A package of tools for creating weighted finite-state
transducers (WFST) from high-level linguistic descriptions.
Lextools binaries are available free for non-commercial use
at: http://www.research.att.com/sw/tools/lextools/
.
Supported platforms are: linux (i686), sgi (mips2) and sun4.
Lextools is built on top of, and requires, the AT&T WFST
toolkit (version 3.6), available free for non-commercial use
from: http://www.research.att.com/sw/tools/fsm/
Wordsmith Tools (Mike Scott)
The thing to get if you are working in the Windows world.
A prototype Java
Summarisation applet (System Quirk)
MEAD
A public domain portable multi-document summarization
system. (Dragomir Radev and others.)
Downloadable
Tilburg University's TiMBL
Tilburg's Memory Based Learner by Walter Daelemans et al. A general
near-neighbour-based machine learning package, but optimized for statistical NLP
applications.
Time
Expression taggers
TIMEX2 standard taggers (site at Mitre).
NLTK
An open source Python package for NLP application development with
tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
Ted Pedersen's code
Ngram Statistics
Package: Perl code that implements: Fisher's exact test, the
likelihood ratio, Pearson's chi squared test,
the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word
sense disambiguation systems; Senseval-1 data in Senseval-2
format; various other WSD datasets in Senseval formats, and
semantic distances derived via WordNet.
ISIP
tools
The main aim is a publically available speech recognition
system (alpha release available), but along the way there are also
toolkits for discrete HMMs and statistical decision trees, and
for various aspects of signal processing.
Mem
. A Perl
implementation of Generalized and Improved Iterative Scaling
by Hugo WL ter Doest.
Automorphology
A system (for Windows) for automatically learning the morphological
forms of words in a corpus by John Goldsmith.
Wordnet
Wordnet is available by
ftp
,
compiled for a variety of machine types. For money, one can also get
EuroWordNet
for various
European languages,
an Italian/English/Spanish MultiWordNet
and there's now a site for
Global Wordnet
.
(See also Mappings
between WordNet versions
and
Perl
WordNet-Similarity module
by Ted Pedersen, and
WordNet Domains
(coarse-grained sense topic classifications).)
Penn XTAG project
A wide-coverage tree-adjoining grammar written in a mixture of C
and Common Lisp. Also includes a large coverage morphological
analyzer. Now includes more tools such as TCL/Tk tree viewer.
Dan Melamed's
Assorted Tools
A collection of various tools including a simulated annealling program, a
post-processor for English stemming for the Penn XTAG morphology
system, Good-Turing smoothing software, general text processing tools,
text statistics tools and bitext geometry tools (mainly written in Perl 5).
MULTEXT
Constructing corpora and tools for processing multilingual corpora.
Contact: Jean Veronis veronis@univ-aix.fr
. Some stuff
including a multilingual text editor is downloadable.
MULTEXT EAST
has parallel versions
of Orwell's 1984 available free (upon registration) for a number
of Central European languages.
Naive
Bayes algorithm
Software from the Rainbow/Libbow software package that implements
several algorithms for text categorization, including naive Bayes,
TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
HDDI
Text Data Mining API from Lehigh University.
Emdros: a text database engine
for linguistic analysis and research
Chasen
Japanese morphological analyzer. Descendent of JUMAN.
Free, but require registration
Stuttgart's
IMS
Corpus Workbench (CWB)
A workbench for full-text retrieval from large corpora (with a query
language and corpus indexing). Includes the Corpus Query
Processor (CQP) and xkwic.
Available free for research groups (currently only as Solaris 1/2 or
Linux binaries), on signing a license agreement.
Gate
University of Sheffield's General Architecture for Text
Engineering. Primarily an Information Extraction system.
MITRE's
Alembic Workbench
A workbench for the development of tagged corpora. Includes a
tagger based on Brill's TBL approach.
SNoW
SNoW is a learning program that can be used as a general purpose
multi-class classifier and is specifically tailored for learning in
the presence of a very large number of features. The learning
architecture is a sparse network of linear units over a pre-defined
or incrementally acquired feature space (Dan Roth).
Unsure
INTEX
a finite-state transducer analysis system for English, French, and
Italian that runs under NextStep. Contact:
Max Silberztein silberz@ladl.jussieu.fr
The PennTools
page collects information on a variety of NLP systems, many of which are
available externally.
LDC (Linguistic
Data Consortium)
and its
catalogue by year
.
Email: ldc@ldc.upenn.edu
. Provides the largest range of
corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey.
CDs can be purchased individually; institutions can become members and
receive discounts on CDs. There's an
LDC Online
service for
searches over the web (mainly intended for members, but there are samplers
available).
European Language
Resources Association
and its catalogue
.
Distribution agency is ELDA
.
Rapidly growing collection of materials in European languages.
ICAME
(International Computer Archive of Modern English)
Sells various corpora (including
Brown and London-Lund). Information on corpora on
the web
, by sending the
message
help
to fileserv@nora.hd.uib.no
, by ftp to
nora.hd.uib.no
.
Also,
manuals
for
these corpora.
Reuters @
NIST
Reuters corpora are now distributed by NIST.
TRACTOR
TELRI Research Archive of Computational Tools and Resource.
Corpora, many multilingual, in European community languages. Small fee
for joining in order to be able to get corpora (unless you have
contributed corpora).
CLR (Consortium for Lexical
Research)
Email: lexical@nmsu.edu
. Focuses more on language
processing tools and lexicons, but does have some corpora. As of Feb 1996,
you can get most of their stuff by anonymous ftp to clr.nmsu.edu
.
Their catalog
is
available as a postscript file.
OTA (Oxford Text Archive)
Provides mainly literary texts. Has a bright new web
site. Email:
info@ota.ahds.ac.uk
.
Most materials are available on the web or by anonymous ftp to
ota.ox.ac.uk
.
Some require negotiations with the providers.
Leipzig Corpora Collection
Sentence collections in MySQL database for 17 mainly European languages.
BNC (British National Corpus)
A 100 million word corpus of British English. You
can search it online from their simple web
interface
or via View
, a much
better interface by Mark Davies, and there is an
index to
genres
by David Lee. And now, an XML edition
.
European Corpus
Initiative Multilingual Corpus I (ECI/MCI)
A 98 million word corpus, covering most of the major
European languages, as well as Turkish, Japanese, Russian, Chinese, and
Malay. Cheap. Need to sign a license agreement available at either the
WWW site. Also available from the LDC.
Survey of English Usage
At the Department of English Language and
Literature at University College London. Includes the
British part of
ICE
,
the International
Corpus of English
project. Now available
tagged, and parsed for function. 83,419 sentences. Includes ICECUP,
dedicated retrieval software. Also, Diachronic
Corpus of Present-Day Spoken English
(800,000 words, tagged and
parsed, half from ICE-GB and half from London-Lund).
International Corpus of English (ICE)
Million word collections of English from various world Englishes: ICE-NZ,
ICE-HK, ICE-East Africa, etc. Several
of them are downloadable from this site.
Corpora
held by Lancaster University
This link provides its own annotations.
The European Language
Activity Network
Promises a uniform query language for accessing corpora in all EU
languages -- but isn't quite there yet.
Talkbank
.
Rich video and transcripts.
English
English language corpora available from the sites above are not repeated
here.
Corpora by Geoffrey Sampson's team
The
SUSANNE corpus
and the
CHRISTINE
corpus
(SUSANNE markup of a speech corpus).
Michigan Corpus of Academic
Spoken English (MICASE)
.
1.7 million words from 1997-2001.
Penn-Helsinki Parsed Corpus of
Middle English
A syntactically annotated corpus of the Middle English prose
samples in the Helsinki Corpus of Historical English, with
additions. 1.3 million words. $200.
Corpus of Professional, Spoken
American-English (CPSA)
2 million words from faculty and committee meetings and White House
press conferences (50K work sample free on internet).
Lancaster Parsed Corpus
Dialogue Diversity
Corpus
(Bill Mann)
American National
Corpus
Chinese
English language corpora available from the sites above are not repeated
here.
The Lancaster Corpus
of Mandarin Chinese (LCMC)
By Tony McEnery and Richard Xiao. Distinguished by being a balanced
corpus, and freely available.
Multilingual
JRC-Acquis
A parallel corpus of EU documents across all member states.
8 million words or more in each of 20 languages.
EMILLE/CIIL
Monolingual written corpus data for 14 South
Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri,
Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu).
Orthographically transcribed spoken data and parallel
corpus data for five South Asian languages (Bengali, Gujarati, Hindi,
Punjabi and Urdu). In addition, the parallel corpus contains the English
originals from which the translations stored in the corpus were derived.
All data in the corpus is CES and Unicode compliant. The EMILLE corpus
totals some 94 million words. Downloadable.
OPUS
An open source parallel corpus, aligned, in many languages, based on
free Linux etc. manuals.
World
Health Organization Computer Assisted Translation page
.
Also includes a good selection of links on Computer Assisted
Translation. (See also the
copyright page
.)
Searchable
Canadian Hansard French-English parallel texts (1986-1993)
From the
Laboratoire
de Recherche Appliquée en Linguistique Informatique,
Universite de Montréal
European Union web server
Parallel text in all EU languages. (In particular try
European legislation
.)
TELRI CD-ROMs
Parallel and other text in central and eastern european languages.
Bosnian
The Oslo Corpus
of Bosnian Texts
.
Czech
Parallel
Czech-English
Literature translations in Czech and English
Czech National Corpus project:
SYN2000
100 million words of contemporary Czech.
French
Association des Bibliophiles
Universels
Various French literary works.
American and
French Research
on the Treasury of the French Language (ARTFL)
150 million word corpus of various genres of French. You have to be a
member to use it (but membership is fairly cheap).
German
COSMAS
Corpus
Large (over a billion words!) online-searchable German and Austrian
corpora. This is the publically available part of the 1.85
billion word
Mannheimer Corpus
Collection
NEGRA
Corpus
Saarland University Syntactically Annotated Corpus of German
Newspaper Texts. Available free of charge to academics. 20,000
sentences, tagged, and with syntactic structures. Free for academic use.
Russian
Russian National Corpus
150 million words, 5 million words POS-tagged, some in dependency
treebank.
Library of
Russian Internet Libraries
Various literary works.
Slovene
Slovene-English parallel corpus
1 M words, free to download + on-line concordances.
Coming soon: Slovene reference
corpus of 100 M words
Spanish and Portuguese
TychoBrahe
Parsed Corpus of Historical Portuguese
Over a million words of
Portuguese from different historical periods, some of it
morphologically analyzed/tagged. Free.
Information about Mark
Davies' collection of (mainly historical Spanish and Portuguese
.
It's not clear what their availability is.
The CUMBRE corpus. Contact Professor
Aquilino Sánchez
The CRATER Spanish corpus
Morphosyntactically tagged telecommunication
manuals) is available by ftp
.
Corpus
resources for Portuguese
In total about 70 million words, available free, from various
sources (newswire, etc.)
Folha de S. Paulo newspaper
4 annual CDROMs with full text.
COMPARA
Portuguese-English parallel corpus. (In general, various resources
at Linguateca
site.
See also under ELRA, above.
Swedish
Spraakdata
, Department
of Swedish, Göteborgs University.
Has various searcable part of speech
tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some
material in Zimbabwean languages.
Name
Language
Size
Availability
Comments
Penn Treebank
|
US English |
2 million + words |
Available (distributed by LDC) |
1 million WSJ, 1 million speech, surface syntax (1970s TG) |
BLLIP WSJ corpus
|
US English |
30 million words |
Available (distributed by LDC) |
WSJ newswire. Automatically parsed, not hand checked. Same
structure as Penn
Treebank, except for some additional coreference marking |
ICE-GB
|
UK English |
1 million words (83,394 sentences) |
Available; c. 500 pounds |
British part of
ICE, the International Corpus of English project. Tagged and parsed
for function. Half spoken material. |
NEGRA Corpus
|
German |
20,000 sentences |
Available free of charge to academics on completion of
license agreement. |
Saarland University Syntactically Annotated Corpus of German
Newspaper Texts. Tagged, and with syntactic structures. |
TIGER corpus
|
German |
700,000 words |
Available free of charge for research purposes on completion of
license agreement. |
German newspaper text (Frankfurter
Rundschau). Semi-automatically parsed.
They also have a good treebank search tool,
TIGERSearch
.
|
Alpino Dependency Treebank
|
Dutch |
150,000 words |
Freely downloadable |
Assorted subcorpora. By far the largest is
the full cdbl (newspaper) part of the Eindhoven corpus. |
The Prague Dependency
Treebank 1.0
|
Czech |
500,000 words |
Free on completion of license agreement (available through LDC). |
Analyzed at the
levels of parts of speech, syntactic functions (and, in the future,
semantic roles) level in a dependency
framework.
Text from newspapers and weekly magazines.
|
TUT:
Turin University Treebank
|
Italian |
2,400 sentences |
Free download. |
Morhpological analysis and dependency analysis. Penn Treebank translation.
Civil law and newspaper texts.
|
Bulgarian Treebank
|
Bulgarian |
n/a |
POS-tagged texts and dependencies analyses are available (some are
free on the web, others via a license agreement) |
An under construction Bulgarian HPSG treebank. |
Penn
Chinese Treebank
|
Chinese |
100,000 words |
Available (LDC
) |
Based on Xinhua news articles. 1980s-style GB syntax. |
Danish
Dependency Treebank 1.0
|
Danish |
100,000 words |
Available free under the GPL. |
Built on a portion of the Parole corpus. |
Floresta Sintá(c)tica
|
Portuguese |
168,000 words hand-corrected; 1,000,000 words automatically parsed |
Hand corrected part is free web download; automatically parsed part
available through email contact |
Text from
CETEMPúblico
corpus
. Phrase structure and dependency representations. Available
in several formats, including Penn Treebank format. |
Talbanken05
|
Swedish |
300,000 words |
Free download |
Resurrects and modernizes an early treebank from the 1970s. |
Verbmobil
Tübingen
: under construction treebanked corpus of German,
English, and Japanese sentences from Verbmobil (appointment
scheduling) data
Syntactic Spanish Database (SDB)
University of Santago de Compostela. 160,000 clauses / 1.5 million words.
CKIP Chinese
Treebank (Taiwan)
. Based on Academia Sinica corpus.
(There's also a
100
sentence Chinese treebank
at U. Maryland.)
LDC Korean
Treebank
.
Dublin-Essex
Treebank project
Deriving Linguistic Resources from Treebanks.
CSTBank
:
Cross-document Structure Theory: marking sentence functional
relationships across related documents.
The Senseval web site
Has a
comprehensive selection of resources for WSD, including a good
list of WSD
data resources
, but not yet the
new SEMCOR
.
Ted Pedersen's code
Includes various WSD systems.
SenseClusters
Open source package for unsupervised discovery of word senses by clustering
together instances of a word (or words) that are used in similar contexts
in raw text, supporting a wide range of clustering techniques based on
both context vectors and similarity matrices, and including links to
SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
Evocation
WordNet synset similarity judgments
Judgments on how similar the meanings of synsets are and how common
they are in the BNC from Jordan Boyd-Graber.
There are now quite large collections of online literature, available in
various languages (though the majority are in English, of course). Below
are pointers to some of the main collections:
Entirely or mainly English
Alex: A Catalogue
of Electronic Texts on the Internet
Seems to have one of the largest collection. Searching and browsing
facilities through gopher menus. Many languages.
Wiretap Electronic Text Archive
Extensive and good quality. Still in the gopher age, though.
The On-line Books
Page
The index here only covers books in English, but there are lots of
links to other collections of material in all languages.
Project Gutenberg
The oldest and largest project to get out of copyright literature
online, freely available. (Or see the mirror,
Sailor's Project
Gutenberg site
.)
The Electronic Text
Center of the University of Virginia
Large collection of SGML text, mainly in English, but also in other
major languages.
Center for Electronic Texts in the
Humanities
Princeton/Rutgers collaboration. They didn't have it together with
their web site when I stopped by, but they may soon.
Oxford Electronic Text Library Editions
Available from
Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300.
The Complete Works of Jane Austen is $95.00, and is reviewed in
Computers and the Humanities
, 28:4-5 (Aug/Oct, 1994), 317-321.
Coreference
annotated texts
From University of Woverhampton (R. Mitkov, C. Barbu et al.).
CHILDES database
.
Database of child language transcriptions in English and many other
languages. Texts are also available
by ftp
. Certain
usage requirements. Manuals and programs for accessing the data (the
CLAN concordancer) are also available online. Now in Unicode XML.
Robin Cover's SGML/XML
Web Page
This is a wonderful compendium of information on SGML and XML, including
information on
the Text Encoding Initiative (TEI)
. This document is also a guide to
many text collections (ones usi
分享到:
相关推荐
自然语言处理的相关资源列表,持续更新 Contents NLP-Toolkits 自然语言工具包 Toolkits : a set of natural language analysis tools written in Java,by Stanford :a Python Natural Language Toolkit includes...
在当今大数据时代,自然语言处理(NLP)技术已经成为人工智能领域的重要组成部分。对于非英语国家,尤其是小语种如塞尔维亚语,NLP资源的开发与研究尤为重要。"serbian-nlp-resources-源码.rar"这个压缩包文件提供了...
Deep-Learning-for-NLP-Resources, 深入学习NLP入门的资源列表 Deep-Learning-for-NLP-Resources深入学习NLP入门的资源列表。 ( 增量更新)深入学习( 常规 NLP ) 链接::本系列课程有很好的神经网络介绍和深入学习。...
《泰国自然语言处理资源大汇集:nlp_thai_resources》 在信息技术日益发达的今天,自然语言处理(NLP)已经成为了人工智能领域不可或缺的一部分。针对泰语这一独特的语言,开发者们构建了一系列专门针对泰国语的NLP...
该存储库包含在我的研究工作中创建的波兰语自然语言处理的预训练模型和语言资源。 如果您想在研究中使用任何这些资源,请引用: @Misc{polish-nlp-resources, author = {S{\l}awomir Dadas}, title = {A ...
A curated list of NLP resources focused on BERT, attention mechanism, Transformer networks, and transfer learning.
自然语言处理(NLP)是信息时代最重要的技术之一,旨在使计算机能够处理和理解自然语言。 课程基本信息 * 课程名称:自然语言处理 * 英文名称:Natural Language Processing * 课程编码:ALO, BD-100, Sec-100 * ...
Awesome-NLP-Resources:此存储库包含本世纪发表的有关自然语言处理的具有里程碑意义的研究论文
标题中的“norwegian-nlp-resources”是一个专注于挪威语自然语言处理(NLP)的资源集合。这个项目可能包含了各种工具、数据集、词汇资源和其他有用的材料,为在挪威语环境中进行NLP研究和开发的人们提供支持。描述...
针对特定语言的NLP资源是推动这一领域发展的重要基础,而“Indonesian-NLP-resources”正是一款专为巴哈萨印度尼西亚语设计的NLP数据集,它包含了丰富的资源,为印尼语的自然语言处理任务提供了强有力的支持。...
自然语言处理资源 这个存储库的主要目的是让机器学习和深度学习教育免费,所有人都可以轻松访问。 所有被选中的资源都在那里,因为 Crework 成员对这些免费资源的个人体验。 材料按特定顺序排列,这样你就不必去其他...
A_curated_list_of_resources_for_Chinese_NLP_中文自然语言_Awesome-Chinese-NLP
Python 自然语言处理 原文版 Natural Language Processing with Python Author: Steven Bird, Ewan Klein, and Edward Loper Content: 1. Language Processing and Python 2. Accessing Text Corpora and Lexical ...
**自然语言处理**(Natural Language Processing, NLP)是人工智能的一个子领域,关注如何让计算机理解和生成人类自然语言。NLP的应用广泛,如语音识别、机器翻译、问答系统、情感分析等。NLP的核心技术包括词性标注...
中文自然语言处理相关资料 图片来自复旦大学邱锡鹏教授 Contents 列表 1. 2. 3. 4. Chinese NLP Toolkits 中文NLP工具 Toolkits 综合NLP工具包 by 清华 (C++/Java/Python) by 中科院 (Java) by 哈工大 (C++) by 复旦...
在自然语言处理(NLP)领域,针对特定语言的资源是至关重要的,特别是对于像印度尼西亚语(Bahasa Indonesia)这样的非主流语言。"NLP_bahasa_resources" 是一个专门为印度尼西亚语NLP爱好者和研究者提供的资源集合...
与自然语言处理相关的Python脚本集合 维护者 工具 语料库 与处理语料库文件有关的脚本。 覆铜板 用于将CCL转换为其他格式的脚本。 ccl2iob2.py python corpora/ccl/ccl2iob2.py --input resources/corpora/...