`

Mahout: Introduction to clustering

 
阅读更多

Clustering a collection involves three things:

  • An algorithm
  • A notion of both similarity and dissimilarity
  • A stopping condition



 

Measuring the similarity of items

 

The most important issue in clustering is finding a function that quantifies the similarity between any two data points as a number.

Euclidean distance

TF-IDF

 

Hello World: running a simple clustering example

There are three steps involved in inputting data for the Mahout clustering algorithms:

  1. you need to preprocess the data,
  2. use that data to create vectors,
  3. and save the vectors in SequenceFile format as input for the algorithm.
package mia.clustering.ch07;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.clustering.kmeans.Kluster;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

public class SimpleKMeansClustering {
	public static final double[][] points = { { 1, 1 }, { 2, 1 }, { 1, 2 },
			{ 2, 2 }, { 3, 3 }, { 8, 8 }, { 9, 8 }, { 8, 9 }, { 9, 9 } };

	public static void writePointsToFile(List<Vector> points, String fileName,
			FileSystem fs, Configuration conf) throws IOException {
		Path path = new Path(fileName);
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				LongWritable.class, VectorWritable.class);
		long recNum = 0;
		VectorWritable vec = new VectorWritable();
		for (Vector point : points) {
			vec.set(point);
			writer.append(new LongWritable(recNum++), vec);
		}
		writer.close();
	}

	public static List<Vector> getPoints(double[][] raw) {
		List<Vector> points = new ArrayList<Vector>();
		for (int i = 0; i < raw.length; i++) {
			double[] fr = raw[i];
			Vector vec = new RandomAccessSparseVector(fr.length);
			vec.assign(fr);
			points.add(vec);
		}
		return points;
	}

	public static void main(String args[]) throws Exception {

		int k = 2;

		List<Vector> vectors = getPoints(points);

		File testData = new File("/home/zhaohj/hadoop/testdata/mahout/testdata");
		if (!testData.exists()) {
			testData.mkdir();
		}
		testData = new File("/home/zhaohj/hadoop/testdata/mahout/testdata/points");
		if (!testData.exists()) {
			testData.mkdir();
		}

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		writePointsToFile(vectors, "/home/zhaohj/hadoop/testdata/mahout/testdata/points/file1", fs, conf);

		Path path = new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/clusters/part-00000");
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				Text.class, Kluster.class);

		for (int i = 0; i < k; i++) {
			Vector vec = vectors.get(i);
			Kluster cluster = new Kluster(vec, i,
					new EuclideanDistanceMeasure());
			writer.append(new Text(cluster.getIdentifier()), cluster);
		}
		writer.close();

		// KMeansDriver.run(conf, new Path("testdata/points"), new
		// Path("testdata/clusters"),
		// new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
		// true, false);

		KMeansDriver.run(conf, 
				new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/points"), 
				new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/clusters"), 
				new Path("/home/zhaohj/hadoop/testdata/mahout/output"), 
				0.2, 
				30, 
				true,
				0.001, 
				false);

		SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(
				"/home/zhaohj/hadoop/testdata/mahout/output/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"),
				conf);

		IntWritable key = new IntWritable();
		WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
		while (reader.next(key, value)) {
			System.out.println(value.toString() + " belongs to cluster "
					+ key.toString());
		}
		reader.close();
	}

}

  

 

 

 

 



 

 

Exploring distance measures
 

 Euclidean distance measure

 
Squared Euclidean distance measure


Manhattan distance measure



Cosine distance measure

Note that this measure of distance doesn’t account for the length of the two vectors;all that matters is that the points are in the same direction from the origin.


Tanimoto distance measure/Jaccard’s distance measure


Weighted distance measure
Mahout also provides a WeightedDistanceMeasure class, and implementations of Euclidean and Manhattan distance measures that use it. A weighted distance measure is an advanced feature in Mahout that allows you to give weights to different dimensions in order to either increase or decrease the effect of a dimension on the value of the distance measure. The weights in a WeightedDistanceMeasure need to be serialized to a file in a Vector format.

 

 

 

 

 


 
 
 

 

 

 

 

  • 大小: 67.1 KB
  • 大小: 27.6 KB
  • 大小: 35.3 KB
  • 大小: 36.6 KB
  • 大小: 3.7 KB
  • 大小: 3.1 KB
  • 大小: 2.5 KB
  • 大小: 35.4 KB
  • 大小: 7.3 KB
  • 大小: 7.8 KB
  • 大小: 42.3 KB
分享到:
评论

相关推荐

    Mahout_in_Action

    - **第7章:聚类简介**(Introduction to clustering):解释了聚类的基本概念和技术,以及它在数据分析中的应用。 - **第8章:表示数据**(Representing data):讲解了如何为聚类分析准备数据,包括数据清洗、特征...

    mahout in action 英文完整版(2012)

    - **Introduction to clustering**:这一章节可能是对聚类技术的总体介绍,包括聚类的基本概念、聚类算法的分类以及Mahout中聚类的应用场景等。 - **Representing data**:本章主要讲述如何在Mahout中表示用于聚类...

    Mahout in Action(2012)

    - **第7章:聚类简介 (Introduction to clustering)** - 聚类的基本概念。 - 聚类的应用场景。 - **第8章:表示数据 (Representing data)** - 如何表示用于聚类的数据。 - 特征选择和数据规范化的重要性。 - *...

    Mahout in Action完整版本.pdf

    Introduction to clustering),再次强调了数据表达的重要性(8. Representing data),详细介绍了Mahout中的聚类算法(9. Clustering algorithms in Mahout),如何评估聚类质量(10. Evaluating clustering ...

    Mahout in Action 最新版+完整版

    Introduction to clustering**:本章首先解释聚类的基本概念,接着介绍聚类分析的目的和应用场景。 - **8. Representing data**:讨论如何将不同类型的数据转换成适合聚类的形式,并介绍几种常用的数据表示方法。 -...

    Practical.Machine.Learning.178439968X

    Harness the capabilities of Spark and Mahout used in conjunction with Hadoop to manage and process data successfully Apply the appropriate Machine learning technique to address a real-world problem ...

Global site tag (gtag.js) - Google Analytics