`
文章列表
IndexWriter  IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. ...
Components for indexing  ACQUIRE CONTENT   The first step, at the bottom of figure 1.4, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files ...
  http://www.cnblogs.com/zhangchaoyang/articles/2647905.html http://blog.pureisle.net/archives/1618.html http://www.csdn.net/article/2014-01-01/2817984-13-tools-let-hadoop-fly http://blog.mortardata.com/post/82002488484/hadoop-weekly-april-7-2014       https://www.mapr.com/company/press-rel ...

Hadoop: Data Join

Reduce-side joining / repartitioned sort-merge join     Note:DataJoinReducerBase, on the other hand, is the workhorse of the datajoin package, and it simplifies our programming by performing a full outer join for us. Our reducer subclass only has to implement the combine() method to filter out u ...

Mahout: CVB

When run cvb, there is a error org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable Solution: the new LDA requires SequenceFile<IntWritable, VectorWritable> as input (the same disk format as DistributedRowMatrix), which you can get out of SequenceFile<Text, V ...

正则表达式

    博客分类:
  • Java
                  References http://www.cnblogs.com/elleniou/archive/2012/07/31/2617312.html

HDFS: API Introduction

                  References http://blog.csdn.net/lastsweetop/article/details/9001467  
Google global sites url https://github.com/justjavac/Google-IPs   JCSEG http://www.oschina.net/p/jcseg MMSEG http://technology.chtsai.org/mmseg/   //convert maven project to eclipse project #mvn eclipse:eclipse -DskipTests   //tranfer text docs to seq docs #mahout seqdirectory -c UTF-8  ...

jar commands

    博客分类:
  • Java
//list files in jar jar tf xx.jar         //update file in jar jar uvf xx.jar newfile jar uvf xx.jar org.google.newclassfile  // newclassfile in ./org/google/   //extract jar jar xf xx.jar               References http://blog.163.com/yde1208@126/blog/static/958727092012101311253447/ ...

Mahoout: CWSS

jcseg http://www.oschina.net/p/jcseg http://technology.chtsai.org/mmseg/       scws http://www.ftphp.com/scws/demo/v48.php http://www.ftphp.com/scws/docs.php#instscws http://www.350351.com/bianchengyuyan/PHP/203527.html     cwss http://code.google.com/p/cwss/downloads/list http://www. ...
Online news clustering Cluster one million articles, as showed below, and save the cluster centroids for all clusters.   Periodically, for each new article, use canopy clustering to assign it to the cluster whose centroid is closest, based on a very small distance threshold. This ensures that ...
Introduction To find these topics in a particular set of documents,We’d modify our clustering code to work with word vectors instead of the document vectors we’ve been using so far. A word vector is nothing but a vector for each word, where the features would be IDs of the other words that occur a ...
Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data ...
hadoop-env.sh Must set JAVA_HOME in namenode and secondary namenodes, or the start-dfs.sh will run errors
As the name says, the fuzzy k-means clustering algorithm does a fuzzy form of k-means clustering. Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. In the academic community, it’s also known as the fuzzy c-means algorithm. You can ...
Global site tag (gtag.js) - Google Analytics