`
bbsunchen
  • 浏览: 232115 次
  • 性别: Icon_minigender_1
  • 来自: 天朝帝都
社区版块
存档分类
最新评论

介绍一个bioinformatics的toolkit

阅读更多

最近扫到生物信息学软件的paper,发现有很多bioinformatics的toolkit,这里介绍一个bow,剩下有些我也打不开,但是关于svm等等的toolkit还是很多的

比如 SVM light http://svmlight.joachims.org/

PASBio http://research.nii.ac.jp/~collier/projects/PASBio/

POSTLAB http://rostlab.org/cms/index.php?id=94

http://nlp.stanford.edu/downloads/lex-parser.shtml

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The name of the library rhymes with `low', not `cow'.

About the library

The library provides facilities for:

  • Recursively descending directories, finding text files.
  • Finding `document' boundaries when there are multiple documents per file.
  • Tokenizing a text file, according to several different methods.
  • Including N-grams among the tokens.
  • Mapping strings to integers and back again, very efficiently.
  • Building a sparse matrix of document/token counts.
  • Pruning vocabulary by word counts or by information gain.
  • Building and manipulating word vectors.
  • Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
  • Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
  • Scoring queries for retrieval or classification.
  • Writing all data structures to disk in a compact format.
  • Reading the document/token matrix from disk in an efficient, sparse fashion.
  • Performing test/train splits, and automatic classification tests.
  • Operating in server mode, receiving and answering queries over a socket.

The library does not:

  • Have English parsing or part-of-speech tagging facilities.
  • Do smoothing across N-gram models.
  • Claim to be finished.
  • Have good documentation.
  • Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn't do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system.

The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation

You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:

 

   McCallum, Andrew Kachites.  "Bow: A toolkit for statistical language
   modeling, text retrieval, classification and clustering."
   http://www.cs.cmu.edu/~mccallum/bow.  1996.

Here is a BiBTeX entry:

 

   @unpublished{McCallumLibbow,
      author = "Andrew Kachites McCallum",
      title = "Bow: A toolkit for statistical language modeling, 
               text retrieval, classification and clustering",
      note = "http://www.cs.cmu.edu/~mccallum/bow",
      year = 1996}

Obtaining the Source

Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number.

Unfortunately I do not have time to help rainbow's many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Bow Library Front-Ends

Provided in the library source distribution, there are currently three executable programs based on the library.

  • Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.
  • Arrow is an executable program that does document retrieval. It currently only performs simple TFIDF-based retrieval.
  • Crossbow is a an executable program that does document clustering (and also classification).
分享到:
评论

相关推荐

    SLRI Bioinformatics Toolkit-开源

    该项目已过时/孤立。 Samuel Lunenfeld Research Institute (SLRI) Bioinformatics Toolkit 是一个主要基于 C 的跨平台工具包,用于处理生物信息,尤其是蛋白质结构/功能。 基于 NCBI 工具包

    java版植物大战僵尸源码-awesome-bioinformatics-tools:精选的生物信息学软件、工具和资源列表

    我个人推荐一个网站,上面有很多的工具说明: 1、质量控制Quality Control FastQC( 备注:FastQC用法: Fastx-toolkit( PrinSeq( FastUniq( 不能读取 fastq gzip 压缩文件,需解压。) 其他去除duplicates(不...

    FASTQ格式命令行工具Fqutils.zip

    Fqutils provides a basic set of bioinformatics commandline tools for working with sequence data in FASTQ format. It complements Greg Hannon's fine Fastx Toolkit suite. One characteristic of Fqutils is...

    dicom-read.zip_C++,DICOM_dicom_dicom matlab_matlab dicom

    常见的有DCMTK(DICOM Toolkit),这是一个开源项目,提供了丰富的API,可以用于解析、创建和修改DICOM数据。使用DCMTK,开发者可以编写代码来读取DICOM文件中的元数据,并提取图像数据。基本步骤包括打开文件,使用...

    CSHL_Bioperl_I&&II.pdf

    Topics to cover •Introduction to BioPerl •Using Sequence & Feature modules •Using the modules for BLAST parser •Accessing sequence databases ...doing bioinformatics data manipulation

    pypiper:用于构建可重新启动管道的Python工具包

    5. **python-toolkit** - 表明pypiper是一个全面的工具集,为Python开发人员提供了构建复杂工作流程的便利。 **文件名称列表解析:** 提供的文件名称“pypiper-master”可能是项目源代码的主分支或者最新版本的...

    Python库 | bio-APRICOT-1.1.3.tar.gz

    bio-APRICOT(Bioinformatics Analysis with Protein, RNA and Interaction COmponents Toolkit)是专门为生物信息学领域设计的Python库。它集合了多种功能,帮助研究人员处理、分析蛋白质、RNA和相互作用组件的数据...

    genetic-code-analysis-toolkit:GCAT-遗传密码分析工具包

    有关更多信息,请参考和。 安装 GCAT适用于Windows(7、8、10),MacOS和Linux。 请从下载最新版本。 它要求在您的计算机上安装 。...GCAT将打开一个图形用户界面,使其能够以交互方式使用它。 命令行批处理机 除了

    生物信息学复习题及答案(陶士珩).doc

    生物信息学是一个交叉学科,研究生物体内的信息存储、传输和处理,涉及生物学、计算机科学、数学和信息科学等多个领域。生物信息学的研究方法包括序列比对、系统发育分析和结构预测等,应用工具包括BLAST、PSI-BLAST...

    INSDC-SRA-开源

    SRA是INSDC的一个重要组成部分,它存储了大量的高通量测序数据,如RNA-seq、ChIP-seq、WGS等实验产生的原始序列读取。 开源软件的标签意味着这些工具遵循开放源代码的原则,允许用户自由地查看、使用、修改以及分发...

    P.miliaceum_P.virgatum:在“异源四倍体Panicum的亚基因组组织和重排”主题中使用的一些代码

    Python库如Bioinformatics和Genome Rearrangement Analysis Toolkit (GRAT) 可能被用来检测和可视化这些事件,以理解基因组结构的演变历史。 在处理多倍体数据时,还需要解决一个挑战:基因剂量效应。由于多倍体...

    bioperl-live:Core BioPerl 1.x代码

    BioPerl 是一个强大的开源生物信息学工具包,它由一系列 Perl 模块组成,用于处理生物学数据,如序列比对、基因组分析、蛋白质结构等。"bioperl-live" 特别指的是 Core BioPerl 1.x 的代码库,这个版本是 BioPerl 的...

    生物门:Bio :: Phylo-使用Perl进行植物信息学分析

    系统发育学是进化生物学的一个分支,致力于重建和分析生命树。 此分布提供了有助于处理和分析系统发育数据的对象和方法。 兼容性 Bio :: Phylo在Perl版本> = 5.8.0上的最流行的当前平台(Win32,OSX,Linux,Solaris...

    gpu-applications-catalog.pdf

    - **产品描述**:一个全面的平台,支持从模型训练到部署的全流程深度学习任务。它包括了优化过的深度学习框架(如 TensorFlow、PyTorch),以及用于高性能计算的 NVIDIA CUDA 和 cuDNN 库。 - **支持特性**:支持...

    工具箱:使用Python进行NGS分析的工具箱

    ngs-toolkit 这是我的NGS分析工具包: ngs_toolkit 。 转至以了解如何安装和使用该工具包,并查看可用功能的目录。 安装方式: pip install ngs-toolkit 您可能需要在上述命令中添加--user标志。

    exoma-de-rato:利用GATK4的褐家鼠样品的遗传变异

    北欧鼠 褐变种家鼠变种GATK4 递归计算 Ubuntu 20 ...[sratoolkit.current-win64.zip]( ) 的Ubuntu CentOS的 Mac OS X Projeto e Amostras Utilizadas(SRA) 褐家鼠(Rattus norvegicus)品系

    seqkit:跨平台,超快速的工具包,用于在Golang中操纵FASTAQ文件

    介绍 FASTA和FASTQ是用于存储核苷酸和蛋白质序列的基本且普遍存在的格式。 FASTA / Q文件的常见操作包括转换,搜索,过滤,重复数据删除,拆分,混排和采样。 现有工具仅实现了其中一些操作,而没有特别有效地实现,...

    基因组工具:GenomeTools基因组分析系统

    Genome工具 GenomeTools基因组分析系统是一个免费的生物信息学工具集合(在基因组信息学领域),组合成一个名为gt二进制文件。 它基于一个名为libgenometools的C库,该库包含用于高效,便捷地实现序列和注释处理软件...

    csvtk:Golang中的跨平台,高效实用的CSVTSV工具包

    源代码: : 最新版本:介绍与生物信息学领域的FASTA / Q格式相似,CSV / TSV格式是生物信息学和数据科学中的基本文件格式。 人们通常使用电子表格软件(例如MS Excel)来处理表格数据。 但是,这都是通过单击和键入...

    pysradb:用于从SRAENAGEO提取元数据和下载数据的软件包

    用于从SRA / ENA检索元数据和下载数据集的Python软件包文献资料CLI用法pysradb支持命令行用法。 请参阅说明或。 $ pysradb usage: pysradb [-h] [--version] [--citation] {metadata,download,search,gse-to-gsm,gse...

Global site tag (gtag.js) - Google Analytics