- 浏览: 246270 次
- 性别:
- 来自: 成都
最新评论
-
oldrat:
https://github.com/oldratlee/tr ...
Kafka: High Qulity Posts
文章列表
IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. ...
Components for indexing
ACQUIRE CONTENT
The first step, at the bottom of figure 1.4, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files ...
http://www.cnblogs.com/zhangchaoyang/articles/2647905.html
http://blog.pureisle.net/archives/1618.html
http://www.csdn.net/article/2014-01-01/2817984-13-tools-let-hadoop-fly
http://blog.mortardata.com/post/82002488484/hadoop-weekly-april-7-2014
https://www.mapr.com/company/press-rel ...
Hadoop: Data Join
- 博客分类:
- Hadoop
Reduce-side joining / repartitioned sort-merge join
Note:DataJoinReducerBase, on the other hand, is the workhorse of the datajoin package, and it simplifies our programming by performing a full outer join for us. Our reducer subclass only has to implement the combine() method to filter out u ...
Mahout: CVB
- 博客分类:
- Mahout
When run cvb, there is a error
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
Solution:
the new LDA requires SequenceFile<IntWritable, VectorWritable> as input
(the same disk format as DistributedRowMatrix), which you can get out of
SequenceFile<Text, V ...
References
http://www.cnblogs.com/elleniou/archive/2012/07/31/2617312.html
HDFS: API Introduction
- 博客分类:
- Hadoop
References
http://blog.csdn.net/lastsweetop/article/details/9001467
Google global sites url
https://github.com/justjavac/Google-IPs
JCSEG
http://www.oschina.net/p/jcseg
MMSEG
http://technology.chtsai.org/mmseg/
//convert maven project to eclipse project
#mvn eclipse:eclipse -DskipTests
//tranfer text docs to seq docs
#mahout seqdirectory -c UTF-8 ...
jar commands
- 博客分类:
- Java
//list files in jar
jar tf xx.jar
//update file in jar
jar uvf xx.jar newfile
jar uvf xx.jar org.google.newclassfile // newclassfile in ./org/google/
//extract jar
jar xf xx.jar
References
http://blog.163.com/yde1208@126/blog/static/958727092012101311253447/ ...
Mahoout: CWSS
- 博客分类:
- Mahout
jcseg
http://www.oschina.net/p/jcseg
http://technology.chtsai.org/mmseg/
scws
http://www.ftphp.com/scws/demo/v48.php
http://www.ftphp.com/scws/docs.php#instscws
http://www.350351.com/bianchengyuyan/PHP/203527.html
cwss
http://code.google.com/p/cwss/downloads/list
http://www. ...
Online news clustering
Cluster one million articles, as showed below, and save the cluster centroids for all clusters.
Periodically, for each new article, use canopy clustering to assign it to the cluster whose centroid is closest, based on a very small distance threshold. This ensures that ...
Introduction
To find these topics in a particular set of documents,We’d modify our clustering code to work with word vectors instead of the document vectors we’ve been using so far. A word vector is nothing but a vector for each word, where the features would be IDs of the other words that occur a ...
Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data ...
hadoop-env.sh
Must set JAVA_HOME in namenode and secondary namenodes, or the start-dfs.sh will run errors
As the name says, the fuzzy k-means clustering algorithm does a fuzzy form of k-means clustering. Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. In the academic community, it’s also known as the fuzzy c-means algorithm. You can ...