Lucene: the core indexing classes

博客分类：

Lucene

IndexWriter IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. ...

2014-07-03 15:11
浏览 385
评论(0)
分类:开源软件

Lucene: Search Engine Arch

博客分类：

Lucene

Components for indexing ACQUIRE CONTENT The first step, at the bottom of figure 1.4, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files ...

2014-07-03 11:54
浏览 536
评论(0)
分类:开源软件

Hadoop: High Qulity Blog

博客分类：

Hadoop

http://www.cnblogs.com/zhangchaoyang/articles/2647905.html http://blog.pureisle.net/archives/1618.html http://www.csdn.net/article/2014-01-01/2817984-13-tools-let-hadoop-fly http://blog.mortardata.com/post/82002488484/hadoop-weekly-april-7-2014 https://www.mapr.com/company/press-rel ...

2014-07-01 15:01
浏览 453
评论(0)
分类:开源软件

Hadoop: Data Join

博客分类：

Hadoop

Reduce-side joining / repartitioned sort-merge join Note:DataJoinReducerBase, on the other hand, is the workhorse of the datajoin package, and it simplifies our programming by performing a full outer join for us. Our reducer subclass only has to implement the combine() method to filter out u ...

2014-06-30 15:12
浏览 503
评论(0)
分类:开源软件

Mahout: CVB

博客分类：

Mahout

When run cvb, there is a error org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable Solution: the new LDA requires SequenceFile<IntWritable, VectorWritable> as input (the same disk format as DistributedRowMatrix), which you can get out of SequenceFile<Text, V ...

2014-06-19 18:32
浏览 986
评论(0)
分类:开源软件

正则表达式

博客分类：

Java

References http://www.cnblogs.com/elleniou/archive/2012/07/31/2617312.html

2014-06-17 15:28
浏览 422
评论(0)
分类:开源软件

HDFS: API Introduction

博客分类：

Hadoop

References http://blog.csdn.net/lastsweetop/article/details/9001467

2014-06-17 15:27
浏览 485
评论(0)
分类:开源软件

Mahout: Integerate jcseg with mahout seq2parse

博客分类：

Mahout

Google global sites url https://github.com/justjavac/Google-IPs JCSEG http://www.oschina.net/p/jcseg MMSEG http://technology.chtsai.org/mmseg/ //convert maven project to eclipse project ＃mvn eclipse:eclipse -DskipTests //tranfer text docs to seq docs #mahout seqdirectory -c UTF-8 ...

2014-06-16 18:30
浏览 725
评论(0)
分类:开源软件

jar commands

博客分类：

Java

//list files in jar jar tf xx.jar //update file in jar jar uvf xx.jar newfile jar uvf xx.jar org.google.newclassfile // newclassfile in ./org/google/ //extract jar jar xf xx.jar References http://blog.163.com/yde1208@126/blog/static/958727092012101311253447/ ...

2014-06-16 17:57
浏览 520
评论(0)
分类:开源软件

Mahoout: CWSS

博客分类：

Mahout

jcseg http://www.oschina.net/p/jcseg http://technology.chtsai.org/mmseg/ scws http://www.ftphp.com/scws/demo/v48.php http://www.ftphp.com/scws/docs.php#instscws http://www.350351.com/bianchengyuyan/PHP/203527.html cwss http://code.google.com/p/cwss/downloads/list http://www. ...

2014-06-13 14:39
浏览 595
评论(0)
分类:开源软件

Mahout: Batch and online clustering

博客分类：

Mahout

Online news clustering Cluster one million articles, as showed below, and save the cluster centroids for all clusters. Periodically, for each new article, use canopy clustering to assign it to the cluster whose centroid is closest, based on a very small distance threshold. This ensures that ...

2014-06-13 10:47
浏览 493
评论(0)
分类:开源软件

Mahout:Topic modeling using latent Dirichlet allocation (LDA)

博客分类：

Mahout

Introduction To find these topics in a particular set of documents,We’d modify our clustering code to work with word vectors instead of the document vectors we’ve been using so far. A word vector is nothing but a vector for each word, where the features would be IDs of the other words that occur a ...

2014-06-12 14:46
浏览 1180
评论(0)
分类:开源软件

Mahout: Dirichlet clustering

博客分类：

Mahout

Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data ...

2014-06-12 14:08
浏览 644
评论(0)
分类:开源软件

Hadoop: Configuration 1

博客分类：

Hadoop

hadoop-env.sh Must set JAVA_HOME in namenode and secondary namenodes, or the start-dfs.sh will run errors

2014-06-12 11:45
浏览 381
评论(0)
分类:开源软件

Mahout: Fuzzy k-means clustering

博客分类：

Mahout

As the name says, the fuzzy k-means clustering algorithm does a fuzzy form of k-means clustering. Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. In the academic community, it’s also known as the fuzzy c-means algorithm. You can ...

2014-06-12 11:18
浏览 1147
评论(0)
分类:开源软件

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene: the core indexing classes

Lucene: Search Engine Arch

Hadoop: High Qulity Blog

Hadoop: Data Join

Mahout: CVB

正则表达式

HDFS: API Introduction

Mahout: Integerate jcseg with mahout seq2parse

jar commands

Mahoout: CWSS

Mahout: Batch and online clustering

Mahout:Topic modeling using latent Dirichlet allocation (LDA)

Mahout: Dirichlet clustering

Hadoop: Configuration 1

Mahout: Fuzzy k-means clustering

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

最近访客更多访客>>