`
jayghost
  • 浏览: 442816 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

将 Solr 等 data 转换为 Mahout vector

 
阅读更多

参考:http://mylazycoding.blogspot.com/2012/03/cluster-apache-solr-data-using-apache_13.html

 

 

Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.

Minimum Requirement:

  • Basic understanding of Apache Solr and Apache Mahout

  • Understanding of K-Means clustering

  • Up and Running Apache Solr and Apache Mahout on your system

Step 1 – Configure Solr & Index Data:

Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml(schema.xml).

<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” />
  • Add termVector=”true” for the fields which can be clustered

  • Indexing some sample documents into Solr

Step 2 – Convert Lucene Index to Mahout Vectors


mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2


Step 3 – Run K-Means Clustering 

mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering

Here:
  • k: number of clusters/value of K in K-Means clustering

  • x: maximum iterations

  • o: path to output clusters

  • ow: overwrite output directory

  • dm: classname of Distance Measure

Step 4 – Analyze Cluster Output


mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR>

Here:
  • s: Directory containing clusters

  • d:Path of dictionary from step #2

  • dt: Format of dictionary file

  • n: number of top terms

  • output: Path of generated clusters

Mahout Vectors from Lucene Term Vectors

In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors.  A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.

For this example, I’m going to use Solr’s example, located in <Solr Home>/example

In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:

<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>

For pure Lucene, you will need to set the TermVector option on during Field creation, as in:

Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);

From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:

<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2

A few things to note about this command:

  1. This outputs a single vector file, title part-out.vec to the target/foo directory
  2. It uses the title-clustering field.  If you want a combination of fields, then you will have to create a single “merged” field containing those fields.  Solr’s <copyField> syntax can make this easy.
  3. The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
  4. The –dictOut outputs the list of terms that are represented in the Mahout vectors.  Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later.  As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
  5. The –norm tells Mahout how to normalize the vector.  For many Mahout applications, normalization is a necessary process for obtaining good results.  In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity.  Other approaches may require other norms.

https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

Creating Vectors from Text

 

Introduction

For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering Algorithms. These approaches are described below.

From Lucene

NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. Check Mahout's POM file to get the version number, otherwise you will likely get "Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as an error.

Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index.

For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using Solr as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene website or check out Lucene In Action by Erik Hatcher, Otis Gospodnetic and Mike McCandless.

To get started, make sure you get a fresh copy of Mahout from SVN and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the Driver program located in the org.apache.mahout.utils.vectors package. The Driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below:

Generating an output file from a Lucene Index

$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \
   --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \
   <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>

Create 50 Vectors from an Index

$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \
    --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50

This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output.

Normalize 50 Vectors from a Lucene Index using the L_2 Norm

$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \
      --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2

From Directory of Text documents

Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.

You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.

Converting directory of documents to SequenceFile format

Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt

From the examples directory run

$MAHOUT_HOME/bin/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

Creating Vectors from SequenceFile

Mahout_0.3

From the sequence file generated from the above step run the following to generate vectors.

$MAHOUT_HOME/bin/mahout seq2sparse \
-i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \
<-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
<-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
<-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \
<--minSupport <MINIMUM SUPPORT> 2> \
<--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
<--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
<--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
<-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"

--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in
--maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps remove high frequency features like stop words

Background

From a Database

TODO:

Other

Converting existing vectors to Mahout's format

If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable<Vector> (called VectorIterable in the example below) and then reuse the existing VectorWriter classes:

VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class);
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
分享到:
评论

相关推荐

    solr-dataimport-scheduler.jar 可使用于solr7.x版本

    为了保持索引与源数据的一致性,Solr引入了DIH,这是一个内建的机制,用于从关系数据库、XML文件等外部数据源导入数据,并将其转化为Solr可以处理的索引格式。 数据导入调度器(Data Import Scheduler)是DIH的一个...

    solr-data-import-scheduler-1.1.2

    solr-data-import-scheduler-1.1.2,用于solr定时更新索引的jar包,下载后引入到solr本身的dist下面,或者你tomcat项目下面的lib下面

    solr-dataimport-scheduler.jar

    经过测试可以适用solr7.4版本。如果低版本solr(6.*) 可以直接适用网上的solr-dataimport-scheduler 1.1 或者1.0版本。

    solr-data-import-scheduler

    solr 增量更新所需要的包 solr-dataimporthandler-6.5.1 + solr-dataimporthandler-extras-6.5.1 + solr-data-import-scheduler-1.1.2

    支持solr6.1-solr-dataimport-scheduler-1.2.jar

    总的来说,`solr-dataimport-scheduler-1.2.jar` 为 Solr 6.1 增添了定时数据导入的功能,提高了系统的自动化程度和数据更新的实时性。配合 `conf.zip` 中的配置文件,可以灵活地管理和维护 Solr 索引,确保搜索引擎...

    solr-dataimport-scheduler

    1. **配置Solr服务器**:你需要将solr-dataimport-scheduler.jar添加到Solr服务器的lib目录下,以便服务器能够识别并加载这个扩展。 2. **配置Scheduler**:在Solr的配置文件(通常是solrconfig.xml)中,你需要...

    solr-dataimport-scheduler 的jar包

    - 将`solr-dataimport-scheduler.jar`文件复制到Solr的`lib`目录下。 - 在Solr的`solrconfig.xml`配置文件中添加DIH和Scheduler的相关配置。 - 配置数据源、查询、处理程序和调度参数。 - 重启Solr服务以应用...

    solr-dataimport-scheduler.jar定时同步

    使用solr做数据库定时同步更新数据和索引时用到该jar,经过本人测试通过,放心使用. 支持solr5.x,solr6.x

    spring-data-solr-master

    以"spring-data-solr-master"为例,这个项目提供了一个可运行的示例,展示了如何将Spring Data Solr集成到项目中。需要注意的是,由于项目中可能涉及到对Solr配置文件的修改(如Core的名字),因此在使用前需根据...

    solr-dataimporthandler-8.11.2.jar

    solr 检索用包

    solr-dataimport-scheduler(Solr7.x).jar

    这是属于Solr7.X版本的全量、增量更新jar包,有很多版本的这个jar包是没有更新过的,因为这个jar包是爱好者开发的,并不是官方维护,所以很难找到,我是用了两天才找到。

    支持solr高版本定时增量任务fix版本solr-dataimport-scheduler-fix

    这是我自己反编译fix后,支持solr7.4高版本的定时增量任务(亲测solr7.4),下载下来开箱即用。低版本的没试过,估计低版本的solr配合之前apache-solr-dataimportscheduler-1.0.jar这些能行,不行就试试我这个。

    Spring Data for Apache Solr API(Spring Data for Apache Solr 开发文档).CHM

    Spring Data for Apache Solr API。 Spring Data for Apache Solr 开发文档

    快速上手数据挖掘之solr搜索引擎高级教程(Solr集群、KI分词)第20讲 solr之dataimport 共8页.pptx

    solr之MoreLikeThis第20讲 solr之dataimport第21讲 IK分词简介第22讲 IK分词源码分析第23讲 IK与Solr集成第24讲 IK动态词库加载第25讲 项目实战之比比看架构设计第26讲 项目实战之比比看索引设计第27讲 项目实战之...

    SSM+spring-data-solr+solr7.7 全文搜索代码

    本项目结合Spring Data Solr和Solr 7.7,提供了一种高效的搜索解决方案。下面将详细讲解相关知识点。 1. **Spring框架**:Spring是Java领域广泛应用的轻量级容器框架,它提供了依赖注入(DI)和面向切面编程(AOP)...

    Scaling Big Data with Hadoop and Solr

    Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, ...

Global site tag (gtag.js) - Google Analytics