- 浏览: 136656 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
lord_is_layuping:
...
[转] Address already in use: JVM_Bind错误的解决 -
herowj:
不能用。。。。
myeclipse 6.5 注册码 -
wenhai_zhang:
怎么那个链接打不开?和谐掉了?
Win7下安装PostgreSQL8.4遇到的问题 -
anzn20:
感谢ing
myeclipse 6.5 注册码 -
ljmtxlg:
多谢啦,找了好久呢
myeclipse 6.5 注册码
以下工具绝大多数都是开源的,基于GPL、Apache等开源协议,使用时请仔细阅读各工具的license statement
I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine
2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目,基于Apache 2.0协议,完全用java编写,具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/
3. WGet
GNU Wget is a free software package for retrieving files using HTTP,
HTTPS and FTP, the most widely-used Internet protocols. It is a
non-interactive commandline tool, so it may easily be called from
scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html
II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具
2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit
EGYPT) which was developed by the Statistical Machine Translation team
during the summer workshop in 1999 at the Center for Language and
Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++
includes a lot of additional features. The extensions of GIZA++ were
designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学,ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本,对IBM 的model 1-5有很好支持。
3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models
4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多个工具
btw: 这些SMT的工具还都喜欢用埃及相关的名字命名,像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++,PHARAOH也是由来自ISI的Philipp Koehn 开发的,关系还真是复杂啊
5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An
evaluation with the SUSANNE corpus shows that MINIPAR achieves about
88% precision and 80% recall with respect to dependency relationships.
MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it
parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm
6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired
by current psycholinguistic theories of human lexical memory. English
nouns, verbs, adjectives and adverbs are organized into synonym sets,
each representing one underlying lexical concept. Different relations
link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton
University under the direction of Professor George A. Miller (Principal
Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在线版本是http://wordnet.princeton.edu/perl/webwn
7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling
inter-conceptual relations and inter-attribute relations of concepts as
connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发,是一个类似于WordNet的东东
8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of
UNIX software tools to facilitate the construction and testing of
statistical language models.
9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language
models (LMs), primarily for use in speech recognition, statistical
tagging and segmentation. It has been under development in the SRI
Speech Technology and Research Laboratory since 1995.
10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich
Germann. It is a program that translates from one natural languge into
another using statistical machine translation.
11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering
III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外,OpenNLP项目里有一个java的MaxEnt工具,使用GIS估计参数,由东北大学的张乐(目前在英国留学)port为C++版本
2. LibSVM
由国立台湾大学(ntu)的Chih-Jen Lin开发,有C++,Java,perl,C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification,
(C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution
estimation (one-class SVM ). It supports multi-class classification.
3. SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/
4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式,不提供源代码下载
5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域
6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting
multivariate outputs. It performs supervised learning by approximating
a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate
predictions like in classification and regression, SVMstruct can
predict complex objects y like trees, sequences, or sets. Examples of
problems with complex outputs are natural language parsing, sequence
alignment in protein homology detection, and markov models for
part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds
of complex prediction algorithms. Currently, we have implemented the
following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k
mutually exclusive classes. This is probably the simplest possible
instance of SVMstruct and serves as a tutorial example of how to use
the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training
examples (e.g. for natural language parsing) specify the sentence along
with the correct parse tree. The goal is to predict the parse tree of
new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence
pairs align, the goal is to learn the substitution matrix as well as
the insertion and deletion costs of operations so that one can predict
alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g.
for part-of-speech tagging) specify the sequence of words along with
the correct assignment of tags (i.e. states). The goal is to predict
the tag sequences for new sentences.
IV. Misc:
1. Notepad++: 一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net
2. WinMerge: 用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/
3. OpenPerlIDE: 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS
.NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+c
/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还
有...... Visual Studio .NET is really kool:D
4. Berkeley DB
http://www.sleepycat.com/
Berkeley
DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系
统中发展起来的,它更像是一个key-value
pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB
XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌
入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded
database support for both traditional and client/server applications.
It includes b+tree, queue, extended linear hashing, fixed, and
variable-length record access methods, transactions, locking, logging,
shared memory caching, database recovery, and replication for highly
available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high
performance, reliable way of persisting dictionary style data
structures - anything where a piece of data can be stored and looked up
using a unique key. The key and the value can each be up to 4 gigabytes
in length and can consist of anything that can be crammed in to a
string of bytes, so what you do with it is completely up to you. The
only operations available are "store this value under this key", "check
if this key exists" and "retrieve the value for this key" so
conceptually it's pretty simple - the complicated stuff all happens
under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.
5. LaTeX
LATEX, written as LaTeX in plain text, is a document preparation system for the TeX typesetting program.
It offers programmable desktop publishing features and extensive
facilities for automating most aspects of typesetting and desktop
publishing, including numbering and cross-referencing, tables and
figures, page layout, bibliographies, and much more. LaTeX was
originally written in 1984 by Leslie Lamport and has become the
dominant method for using TeX—few people write in plain TeX anymore.
The current version is LaTeX2ε.
中文套装可以在http://www.ctex.org找到
http://learn.tsinghua.edu.cn:8080/2001315450/comp.html by王垠
6. EditPlus
http://www.editplus.com/
EditPlus is an Internet-ready 32-bit text editor, HTML editor and
programmers editor for Windows. While it can serve as a good
replacement for Notepad, it also offers many powerful features for Web
page authors and programmers.
EditPlus当前最新版本是2.21,BrE和AmE的spell checker需要单独下载安装包安装
7. GVim: Vi IMproved
http://www.vim.org/index.php
Vim is an advanced text editor that seeks to provide the power of the
de-facto Unix editor 'Vi', with a more complete feature set. It's
useful whether you're already using vi or using a different editor.
Users of Vim 5 should consider upgrading to Vim 6, which is greatly
enhanced since Vim 5. Vim is often called a "programmer's editor," and
so useful for programming that many consider it an entire IDE. It's not
just for programmers, though. Vim is perfect for all kinds of text
editing, from composing email to editing configuration files.
普通windows用户可以从这个链接下载 ftp://ftp.vim.org/pub/vim/pc/gvim64.exe
8. Cygwin: GNU + Cygnus + Windows
http://www.cygwin.com/
Cygwin is a Linux-like environment for Windows. It consists of two
parts: A DLL (cygwin1.dll) which acts as a Linux API emulation layer
providing substantial Linux API functionality. A collection of tools,
which provide Linux look and feel.
9. MinGW: Minimalistic GNU for Windows
http://www.mingw.org/
MinGW: A collection of freely available and freely distributable
Windows specific header files and import libraries combined with GNU
toolsets that allow one to produce native Windows programs that do not
rely on any 3rd-party C runtime DLLs.
在windows下编译、移植unix/linux平台的软件。cygwin相当于在windows系统层上模拟了一个POSIX-compliant的
layer(库文件是cygwin1.dll);而mingw则是使用windows自身的库文件(msvcrt.dll)实现了一些符合POSIX
spec的功能,并不是完全POSIX-compliant。mingw其实是cygwin的一个branch,由于它没有实现linux
api的模拟层,所以开销要比cygwin低些。
10. CutePDF Writer
http://www.cutepdf.com
Portable Document Format (PDF) is the de facto standard for the secure
and reliable distribution and exchange of electronic documents and
forms around the world. CutePDF Writer (formerly CutePDF Printer) is
the free version of commercial PDF creation software. CutePDF Writer
installs itself as a "printer subsystem". This enables virtually any
Windows applications (must be able to print) to create professional
quality PDF documents - with just a push of a button!
比起acrobat来,一大优点就是它是免费的。而且一般word图表、公式的转换效果很好,what you see is what you get,哈哈。可能需要ps2pdf converter,在该站点有链接提供下载
11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics.
It is a GNU project which is similar to the S language and environment
which was developed at Bell Laboratories (formerly AT&T, now Lucent
Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences,
but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear
modelling, classical statistical tests, time-series analysis,
classification, clustering, ...) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for
research in statistical methodology, and R provides an Open Source
route to participation in that activity.
One of R's strengths is the ease with which well-designed
publication-quality plots can be produced, including mathematical
symbols and formulae where needed. Great care has been taken over the
defaults for the minor design choices in graphics, but the user retains
full control.
R is available as Free Software under the terms of the Free Software
Foundation's GNU General Public License in source code form. It
compiles and runs on a wide variety of UNIX platforms and similar
systems (including FreeBSD and Linux), Windows and MacOS.
R统计软件与MatLab类似,都是用在科学计算领域的。不同的是它是开源的东东:)
12. Judy
http://judy.sourceforge.net/
Judy arrays are fast, sometimes even faster than a hash table. And because Judy
arrays are a type of trie, they consume much less memory than hash tables.
Roughly speaking, it is similar to a highly-optimised 256-ary trie data
structure.
相关推荐
校招/春招/秋招/算法/机器学习(Machine Learning)/深度学习(Deep Learning)/自然语言处理(NLP)/C/C++/Python/面试笔记
校招/春招/秋招/自然语言处理(NLP)/深度学习(Deep Learning)/机器学习(Machine Learning)/C/C++/Python/面试笔记 - 不懂运行,下载完可以私聊问,可远程教学 该资源内项目源码是个人的毕设,代码都测试ok,都是运行...
Natural Language Toolkit,自然语言处理工具包,在NLP领域中,最常使用的一个Python库 NLTK是一个开源的项目,包含:Python模块,数据集和教程,用于NLP的研究和开发 [1] 。 NLTK包括图形演示和示例数据。其提供的...
自然语言处理(NLP)是一门涉及计算机与人类语言交互的学科,而机器学习则是NLP中的关键工具,用于让计算机自动从数据中学习模式。本课程“北大语言学 自然语言处理课程 NLP系列课程”由北大计算语言学研究所提供,...
2018/2019/校招/春招/秋招/算法/机器学习(Machine Learning)/深度学习(Deep Learning)/自然语言处理(NLP)/C/C++/Python/面试笔记
机器学习与自然语言处理 在当今IT行业中,机器学习和自然语言处理是两个非常重要的研究领域。机器学习,作为一种使计算机能够通过数据学习的技术,已经成为众多应用不可或缺的一部分。自然语言处理则是机器学习的一...
(二) 自然语言处理/计算机视觉/智能语音 2.1 自然语言处理 自然语言处理综述 文本向量化 中文分词 关键词提取 文本相似度计算 文本分类 情感分析 主题模型 阅读理解 推荐系统、知识图谱、数据优化、特征融合、...
PySpark 机器学习、自然语言处理与推荐系统配套代码+数据集.zipPySpark 机器学习、自然语言处理与推荐系统配套代码+数据集.zipPySpark 机器学习、自然语言处理与推荐系统配套代码+数据集.zipPySpark 机器学习、自然...
* 自然语言处理 * 预测建模 4.使用 MATLAB 统计与机器学习工具箱的优点 使用 MATLAB 统计与机器学习工具箱可以提高工作效率、提高分析准确性、提高模型性能等。同时,工具箱还提供了大量的示例代码和教程,帮助...
维基百科:Apache OpenNLP库是一个基于机器学习的自然语言文本处理的开发工具包,它支持自然语言处理中一些共有的任务,例如:标记化、句子分割、词性标注、固有实体提取(指在句子中辨认出专有名词,例如:人名)、...
行业词库-nlp/自然语言处理
数据来源[郑州大学全唐诗库]... 这是因为很多诗歌有多位作者,因此在每句话的后面都注解了作者的名字。但是对于机器学习,或者对于机器来说,无法分辨这些到底是作者名字,还是正式的诗句。
汽车行业词库-nlp/自然语言处理
财经行业词库-nlp/自然语言处理
服装行业词库-nlp/自然语言处理
体育行业词库-nlp/自然语言处理
奢侈品行业词库-nlp/自然语言处理
本文来自博客园,本文首先介绍了互联网界与机器学习大牛结合的趋势,以及使用机器学习的相关应用。在本篇文章中,我将对机器学习做个概要的介绍。本文的目的是能让即便全然不了解机器学习的人也能了解机器学习。而且...
OpenNLP 是一个机器学习工具包,用于处理自然语言文本。支持大多数常用的 NLP 任务,例如:标识化、句子切分、部分词性标注、名称抽取、组块、解析等 OpenNLP 是一个机器学习工具包,用于处理自然语言文本。支持...
IT行业-计算机行业-互联网行业词库-nlp/自然语言处理