`
- 浏览:
92763 次
- 性别:
- 来自:
上海
-
以下工具绝大多数都是开源的,基于GPL、Apache等开源协议,使用时请仔细阅读各工具的license statement
I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine
2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目,基于Apache 2.0协议,完全用java编写,具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/
3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html
II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具
2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学,ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本,对IBM 的model 1-5有很好支持。
3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models
4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多个工具
btw: 这些SMT的工具还都喜欢用埃及相关的名字命名,像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++,PHARAOH也是由来自ISI的Philipp Koehn 开发的,关系还真是复杂啊
5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm
6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在线版本是http://wordnet.princeton.edu/perl/webwn
7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发,是一个类似于WordNet的东东
8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.
9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.
10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.
11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering
III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外,OpenNLP项目里有一个java的MaxEnt工具,使用GIS估计参数,由东北大学的张乐(目前在英国留学)port为C++版本
2. LibSVM
由国立台湾大学(ntu)的Chih-Jen Lin开发,有C++,Java,perl,C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.
3. SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/
4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式,不提供源代码下载
5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域
6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.
IV. Misc:
1. Notepad++: 一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net
2. WinMerge: 用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/
3. OpenPerlIDE: 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS .NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还有...... Visual Studio .NET is really kool:D
4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.
11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R统计软件与MatLab类似,都是用在科学计算领域的。
转自:http://kapoc.blogdriver.com/kapoc/1268927.html
[/list]
[/size]
分享到:
Global site tag (gtag.js) - Google Analytics
相关推荐
过去一年令人惊叹的机器学习开源工具和项目 (v.2019) 在过去的一年中,我们比较了近 22,000 个机器学习开源工具和项目,以选出前 49 名(0.22% 的机会)。 工具和项目分为 6 个类别 这是一份极具竞争力的列表,它...
他创建的这个库可能包含了多种机器学习算法的实现,并且是"SGI的一个开源项目",这意味着源代码对公众开放,任何人都可以查看、使用、修改和分发。SGI(Silicon Graphics International)是一家专注于高性能计算和...
机器学习是一个快速发展的领域,持续关注最新的研究论文、开源项目和技术趋势是保持专业素养的重要方式。例如,当前热门的领域包括联邦学习、元学习、半监督学习以及自我监督学习等。 通过这个由资深专家总结的机器...
目前,开源工具已成为机器学习领域的主流,其分类主要包括基于GPU的深度学习工具、分布式计算框架、以及各类数据处理和模型训练平台。 深度学习工具,尤其是基于GPU的工具,已成为大数据环境下机器学习研究的热点。...
3. **libopencv_*.dll**:OpenCV(开源计算机视觉库)的动态链接库文件,用于图像处理和计算机视觉任务,包括机器学习算法的实现。 4. **libopencv_ml230.dll**:OpenCV的机器学习模块,包含SVM、决策树、随机森林...
Pivotal研发的技术总监在GIAC 2017会议上分享了Greenplum的机器学习工具集及其应用案例。 Greenplum大数据平台具有跨多种环境一次打包到处运行的特点,支持本地存储、云对象存储以及与其他多种关系型数据库系统的...
**Tensorflow:谷歌开源的机器学习框架** Tensorflow是由Google Brain团队开发并开源的一个强大的机器学习框架,...Tensorflow不仅提供了丰富的工具和库,还有活跃的社区支持,使得机器学习的探索和实践变得更加便捷。
《Sklearn 与 TensorFlow 机器学习实用指南》是一本深入探讨机器学习技术的书籍,主要聚焦于两个在数据科学领域广泛使用的开源库:Scikit-Learn(简称sklearn)和TensorFlow。这本书针对想要掌握机器学习算法并进行...
标题“20个机器学习开源软件Weka实验数据集”表明了这是一个与机器学习相关的资源包,特别提到了“Weka”这个开源软件,它是一个广泛用于教学、研究和实际应用的数据挖掘工具。数据集的数量为20个,暗示着有丰富的...
### 最新30个最炙手可热的GitHub机器学习开源项目详解 #### 概述 随着技术的进步,机器学习已成为计算机科学领域的热门话题之一。为了帮助开发者和研究人员更好地掌握最新的技术动态,Mybridge发布了2017年度全球...
11. **GitHub**:代码托管平台,用于分享和学习开源项目,也是协作开发的重要工具。 12. **Hadoop**:分布式计算框架,处理大数据,与Spark一起常用于机器学习的批处理任务。 以上工具各有特色,可以根据具体项目...
在机器学习领域,MATLAB作为一个强大的工具,提供了多种经典的算法实现。以下是一些常见的机器学习算法,以及在MATLAB中的应用示例: 1. **快速傅里叶变换(FFT)**: 快速傅里叶变换是数字信号处理中的核心算法,...
Python作为开源且易学的编程语言,提供了丰富的库和工具,使得个人也能轻松进行机器学习实践。 书中详细讲解了如何安装和使用scikit-learn,这是一个强大的机器学习库,包含多种监督和无监督学习算法。此外,还介绍...
目前,有许多强大的开源工具可供选择: 1. Scikit-learn:这是一个广泛使用的Python库,提供了各种机器学习算法,包括监督和无监督学习,以及预处理和评估工具。 2. TensorFlow和PyTorch:这两个深度学习框架分别...
3. 机器学习工具:有许多开源工具和库可供使用,如Python的Scikit-learn用于模型训练,TensorFlow和PyTorch用于深度学习,Pandas和Numpy处理数据。还有一些可视化工具如Matplotlib和Seaborn帮助理解数据。 4. 机器...
Dlib是一个包含机器学习算法的C++开源工具包。Dlib可以帮助您创建很多复杂的机器学习方面的软件来帮助解决实际问题。目前Dlib已经被广泛的用在行业和学术领域,包括机器人,嵌入式设备,移动电话和大型高性能计算环境。
TensorFlow是由Google Brain团队研发的一款开源机器学习框架,自2015年首次发布以来,迅速成为业界领先的深度学习工具之一。该框架不仅被广泛应用于学术研究,在工业界也占据了极其重要的地位。TensorFlow为开发者...
机器学习开发工具包。Weka的全名是怀卡托智能分析环境(Waikato Environment for Knowledge Analysis),是一款免费的,非商业化(与之对应的是SPSS公司商业数据挖掘产品--Clementine )的,基于JAVA环境下开源的...
Petuum 是一个分布式机器学习框架。它致力于提供一个超大型机器学习的通用算法和系统接口。它主要集中在系统上 "plumbing work"和算法加速的优化上面,当简化分布式 ML 程序实现时——允许你聚焦在模型优化和大数据...