- 浏览: 442816 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
sunwang810812:
万分感谢中!!!!!这么多年终于看到一个可运行可解决的方案!! ...
POI 后台生成Excel,在前台显示进度 -
zzb7728317:
LZ正解
Spring Jackson AjaxFileUpload 没有执行回调函数的解决办法 -
sleeper_qp:
lz是在源码上修改的么? 源码的话你重新编译一遍了么? 可 ...
由nutch readseg -dump 中文编码乱码想到的…… -
shenjian430:
请问你改好的程序在写在哪了?
由nutch readseg -dump 中文编码乱码想到的…… -
yinxusen:
It seems to be the bug occur in ...
Mahout Local模式 执行example的注意点
参考:http://mylazycoding.blogspot.com/2012/03/cluster-apache-solr-data-using-apache_13.html
Minimum Requirement:
- Basic understanding of Apache Solr and Apache Mahout
- Understanding of K-Means clustering
- Up and Running Apache Solr and Apache Mahout on your system
Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml(schema.xml).
<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” /> |
- Add termVector=”true” for the fields which can be clustered
- Indexing some sample documents into Solr
mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2 |
mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering |
Here:
- k: number of clusters/value of K in K-Means clustering
- x: maximum iterations
- o: path to output clusters
- ow: overwrite output directory
- dm: classname of Distance Measure
mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR> |
Here:
- s: Directory containing clusters
- d:Path of dictionary from step #2
- dt: Format of dictionary file
- n: number of top terms
- output: Path of generated clusters
Mahout Vectors from Lucene Term Vectors
In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors. A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.
For this example, I’m going to use Solr’s example, located in <Solr Home>/example
In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:
<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>
For pure Lucene, you will need to set the TermVector option on during Field creation, as in:
Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);
From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:
<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
A few things to note about this command:
- This outputs a single vector file, title part-out.vec to the target/foo directory
- It uses the title-clustering field. If you want a combination of fields, then you will have to create a single “merged” field containing those fields. Solr’s <copyField> syntax can make this easy.
- The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
- The –dictOut outputs the list of terms that are represented in the Mahout vectors. Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later. As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
- The –norm tells Mahout how to normalize the vector. For many Mahout applications, normalization is a necessary process for obtaining good results. In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity. Other approaches may require other norms.
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
Creating Vectors from Text
Introduction
For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering Algorithms. These approaches are described below.
From Lucene
NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. Check Mahout's POM file to get the version number, otherwise you will likely get "Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as an error.
Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index.
For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using Solr as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene website or check out Lucene In Action by Erik Hatcher, Otis Gospodnetic and Mike McCandless.
To get started, make sure you get a fresh copy of Mahout from SVN and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the Driver program located in the org.apache.mahout.utils.vectors package. The Driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below:
Generating an output file from a Lucene Index
$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \ --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \ <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>
Create 50 Vectors from an Index
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50
This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output.
Normalize 50 Vectors from a Lucene Index using the L_2 Norm
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2
From Directory of Text documents
Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.
You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.
Converting directory of documents to SequenceFile format
Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt
From the examples directory run
$MAHOUT_HOME/bin/mahout seqdirectory \ --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \ <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \ <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \ <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
Creating Vectors from SequenceFile
Mahout_0.3
From the sequence file generated from the above step run the following to generate vectors.
$MAHOUT_HOME/bin/mahout seq2sparse \ -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \ <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \ <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \ <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \ <--minSupport <MINIMUM SUPPORT> 2> \ <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \ <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \ <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>" <-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in
--maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps remove high frequency features like stop words
Background
- http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
- http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
From a Database
TODO:
Other
Converting existing vectors to Mahout's format
If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable<Vector> (called VectorIterable in the example below) and then reuse the existing VectorWriter classes:
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class); long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
发表评论
-
mahout将文本数据转化成向量形式
2012-11-14 15:38 3397对于文本信息的向量化,Mahout 已经提供了工具类,它 ... -
混淆矩阵(Confusion Matrix)
2012-11-14 15:10 6014在人工智能中,混淆矩阵(confusion matrix)是可 ... -
Mahout Local模式 执行example的注意点
2012-07-25 19:56 2633在export MAHOUT_LOCAL=true后,执 ... -
solr3.5 整合到tomcat6中
2012-06-16 17:36 1010参考:http://martin3000.iteye.com/ ... -
bin/nutch solrindex报java.io.IOException: Job failed! 错误
2012-06-15 21:26 3294一段时间没有碰nutch和solr后,今天重新用nutch抓取 ... -
Mahout 聚类 Nutch 爬取的网易新闻
2012-05-15 13:57 829爬取163的新闻 bin/nutch crawl ur ... -
【转】mahout应用kmeans进行文本聚类2之——实例分析
2012-05-13 22:47 1792转:http://blog.csdn.net/aidayei/ ... -
【转】mahout应用kmeans进行文本聚类1之——输入输出分析
2012-05-13 22:46 1876转:http://blog.csdn.net/aidayei/ ... -
【转】mahout中的kmeans结果分析
2012-05-13 22:45 2481转:http://blog.csdn.net/aida ... -
【转】将lucene索引转化成mahout输入向量
2012-05-09 14:07 1363转自:http://blog.csdn.net/a ... -
使用 Apache Solr 实现更加灵巧的搜索
2012-04-28 15:29 1259Solr 是一种可供企业使用的、基于 Lucene 的搜索服务 ... -
把Solr导入到eclipse中
2012-04-27 21:21 1425参考http://www.lucidimagination.c ... -
Mahout In Action第7章Clustering的SimpleKMeansClustering例子
2012-04-12 20:29 381环境:Ubuntu10.10,Hadoop1.0.1,Maho ... -
Mahout0.6安装
2012-03-22 00:28 785参考:http://www.docin.com/p ...
相关推荐
为了保持索引与源数据的一致性,Solr引入了DIH,这是一个内建的机制,用于从关系数据库、XML文件等外部数据源导入数据,并将其转化为Solr可以处理的索引格式。 数据导入调度器(Data Import Scheduler)是DIH的一个...
solr-data-import-scheduler-1.1.2,用于solr定时更新索引的jar包,下载后引入到solr本身的dist下面,或者你tomcat项目下面的lib下面
经过测试可以适用solr7.4版本。如果低版本solr(6.*) 可以直接适用网上的solr-dataimport-scheduler 1.1 或者1.0版本。
solr 增量更新所需要的包 solr-dataimporthandler-6.5.1 + solr-dataimporthandler-extras-6.5.1 + solr-data-import-scheduler-1.1.2
总的来说,`solr-dataimport-scheduler-1.2.jar` 为 Solr 6.1 增添了定时数据导入的功能,提高了系统的自动化程度和数据更新的实时性。配合 `conf.zip` 中的配置文件,可以灵活地管理和维护 Solr 索引,确保搜索引擎...
1. **配置Solr服务器**:你需要将solr-dataimport-scheduler.jar添加到Solr服务器的lib目录下,以便服务器能够识别并加载这个扩展。 2. **配置Scheduler**:在Solr的配置文件(通常是solrconfig.xml)中,你需要...
- 将`solr-dataimport-scheduler.jar`文件复制到Solr的`lib`目录下。 - 在Solr的`solrconfig.xml`配置文件中添加DIH和Scheduler的相关配置。 - 配置数据源、查询、处理程序和调度参数。 - 重启Solr服务以应用...
使用solr做数据库定时同步更新数据和索引时用到该jar,经过本人测试通过,放心使用. 支持solr5.x,solr6.x
以"spring-data-solr-master"为例,这个项目提供了一个可运行的示例,展示了如何将Spring Data Solr集成到项目中。需要注意的是,由于项目中可能涉及到对Solr配置文件的修改(如Core的名字),因此在使用前需根据...
solr 检索用包
这是属于Solr7.X版本的全量、增量更新jar包,有很多版本的这个jar包是没有更新过的,因为这个jar包是爱好者开发的,并不是官方维护,所以很难找到,我是用了两天才找到。
这是我自己反编译fix后,支持solr7.4高版本的定时增量任务(亲测solr7.4),下载下来开箱即用。低版本的没试过,估计低版本的solr配合之前apache-solr-dataimportscheduler-1.0.jar这些能行,不行就试试我这个。
Spring Data for Apache Solr API。 Spring Data for Apache Solr 开发文档
solr之MoreLikeThis第20讲 solr之dataimport第21讲 IK分词简介第22讲 IK分词源码分析第23讲 IK与Solr集成第24讲 IK动态词库加载第25讲 项目实战之比比看架构设计第26讲 项目实战之比比看索引设计第27讲 项目实战之...
本项目结合Spring Data Solr和Solr 7.7,提供了一种高效的搜索解决方案。下面将详细讲解相关知识点。 1. **Spring框架**:Spring是Java领域广泛应用的轻量级容器框架,它提供了依赖注入(DI)和面向切面编程(AOP)...
Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, ...