- 浏览: 2552227 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Spark Solr(1)Read Data from SOLR
I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.
Originally it is fork from https://github.com/LucidWorks/spark-solr
>git clone https://github.com/luohuazju/spark-solr
Only some dependencies update in pom.xml
<modelVersion>4.0.0</modelVersion>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
- <version>3.4.0-SNAPSHOT</version>
+ <version>3.4.0.1</version>
<packaging>jar</packaging>
<name>spark-solr</name>
<description>Tools for reading data from Spark into Solr</description>
@@ -39,11 +39,10 @@
<java.version>1.8</java.version>
<spark.version>2.2.1</spark.version>
<solr.version>7.1.0</solr.version>
- <fasterxml.version>2.4.0</fasterxml.version>
+ <fasterxml.version>2.6.7</fasterxml.version>
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<scoverage.plugin.version>1.1.1</scoverage.plugin.version>
- <fasterxml.version>2.4.0</fasterxml.version>
<MaxPermSize>128m</MaxPermSize>
</properties>
<repositories>
Command to build that package
>mvn clean compile install -DskipTests
After build that, I get a driver versioned as <spark.solr.version>3.4.0.1</spark.solr.version>
Set Up SOLR Spark Task
pom.xml to build the dependencies
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sillycat</groupId>
<artifactId>sillycat-spark-solr</artifactId>
<version>1.0</version>
<description>Fetch the Events from Kafka</description>
<name>Spark Streaming System</name>
<packaging>jar</packaging>
<properties>
<spark.version>2.2.1</spark.version>
<spark.solr.version>3.4.0.1</spark.solr.version>
</properties>
<dependencies>
<!-- spark framework -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark SOLR -->
<dependency>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
<version>${spark.solr.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- JUNIT -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.sillycat.sparkjava.SparkJavaApp</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Here is the major implementation class to connect to the zookeeper and SOLR
SeniorJavaFeedApp.java
package com.sillycat.sparkjava.app;
import java.util.List;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;
public class SeniorJavaFeedApp extends SparkBaseApp {
private static final long serialVersionUID = -1219898501920199612L;
protected String getAppName() {
return "SeniorJavaFeedApp";
}
public void executeTask(List<String> params) {
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);
String zkHost = "zookeeper1.us-east-1.elasticbeanstalk.com,zookeeper2.us-east-1.elasticbeanstalk.com,zookeeper3.us-east-1.elasticbeanstalk.com/solr/allJobs";
String collection = "allJobs";
String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
String keyword = "Architect";
logger.info("Prepare the resource from " + solrQuery);
JavaRDD<SolrDocument> rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
logger.info("Executing the calculation based on keyword " + keyword);
List<SolrDocument> results = processRows(rdd, keyword);
for (SolrDocument result : results) {
logger.info("Find some jobs for you:" + result);
}
sc.stop();
}
private JavaRDD<SolrDocument> generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
JavaRDD<SolrDocument> resultsRDD = solrRDD.queryShards(solrQuery);
return resultsRDD;
}
private List<SolrDocument> processRows(JavaRDD<SolrDocument> rows, String keyword) {
JavaRDD<SolrDocument> lines = rows.filter(new Function<SolrDocument, Boolean>() {
private static final long serialVersionUID = 1L;
public Boolean call(SolrDocument s) throws Exception {
Object titleObj = s.getFieldValue("title");
if (titleObj != null) {
String title = titleObj.toString();
if (title.contains(keyword)) {
return true;
}
}
return false;
}
});
return lines.collect();
}
}
Here is the class to run the Spark task on Cluster and Local
#Run the local#
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on local#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on Remote YARN Cluster#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
References:
https://github.com/LucidWorks/spark-solr
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/
https://lucidworks.com/2016/08/16/solr-as-sparksql-datasource-part-ii/
Spark library
http://spark-packages.org/
Write to XML - stax
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
https://www.journaldev.com/892/how-to-write-xml-file-in-java-using-java-stax-api
Spark to s3
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark
I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.
Originally it is fork from https://github.com/LucidWorks/spark-solr
>git clone https://github.com/luohuazju/spark-solr
Only some dependencies update in pom.xml
<modelVersion>4.0.0</modelVersion>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
- <version>3.4.0-SNAPSHOT</version>
+ <version>3.4.0.1</version>
<packaging>jar</packaging>
<name>spark-solr</name>
<description>Tools for reading data from Spark into Solr</description>
@@ -39,11 +39,10 @@
<java.version>1.8</java.version>
<spark.version>2.2.1</spark.version>
<solr.version>7.1.0</solr.version>
- <fasterxml.version>2.4.0</fasterxml.version>
+ <fasterxml.version>2.6.7</fasterxml.version>
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<scoverage.plugin.version>1.1.1</scoverage.plugin.version>
- <fasterxml.version>2.4.0</fasterxml.version>
<MaxPermSize>128m</MaxPermSize>
</properties>
<repositories>
Command to build that package
>mvn clean compile install -DskipTests
After build that, I get a driver versioned as <spark.solr.version>3.4.0.1</spark.solr.version>
Set Up SOLR Spark Task
pom.xml to build the dependencies
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sillycat</groupId>
<artifactId>sillycat-spark-solr</artifactId>
<version>1.0</version>
<description>Fetch the Events from Kafka</description>
<name>Spark Streaming System</name>
<packaging>jar</packaging>
<properties>
<spark.version>2.2.1</spark.version>
<spark.solr.version>3.4.0.1</spark.solr.version>
</properties>
<dependencies>
<!-- spark framework -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark SOLR -->
<dependency>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
<version>${spark.solr.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- JUNIT -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.sillycat.sparkjava.SparkJavaApp</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Here is the major implementation class to connect to the zookeeper and SOLR
SeniorJavaFeedApp.java
package com.sillycat.sparkjava.app;
import java.util.List;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;
public class SeniorJavaFeedApp extends SparkBaseApp {
private static final long serialVersionUID = -1219898501920199612L;
protected String getAppName() {
return "SeniorJavaFeedApp";
}
public void executeTask(List<String> params) {
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);
String zkHost = "zookeeper1.us-east-1.elasticbeanstalk.com,zookeeper2.us-east-1.elasticbeanstalk.com,zookeeper3.us-east-1.elasticbeanstalk.com/solr/allJobs";
String collection = "allJobs";
String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
String keyword = "Architect";
logger.info("Prepare the resource from " + solrQuery);
JavaRDD<SolrDocument> rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
logger.info("Executing the calculation based on keyword " + keyword);
List<SolrDocument> results = processRows(rdd, keyword);
for (SolrDocument result : results) {
logger.info("Find some jobs for you:" + result);
}
sc.stop();
}
private JavaRDD<SolrDocument> generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
JavaRDD<SolrDocument> resultsRDD = solrRDD.queryShards(solrQuery);
return resultsRDD;
}
private List<SolrDocument> processRows(JavaRDD<SolrDocument> rows, String keyword) {
JavaRDD<SolrDocument> lines = rows.filter(new Function<SolrDocument, Boolean>() {
private static final long serialVersionUID = 1L;
public Boolean call(SolrDocument s) throws Exception {
Object titleObj = s.getFieldValue("title");
if (titleObj != null) {
String title = titleObj.toString();
if (title.contains(keyword)) {
return true;
}
}
return false;
}
});
return lines.collect();
}
}
Here is the class to run the Spark task on Cluster and Local
#Run the local#
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on local#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on Remote YARN Cluster#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
References:
https://github.com/LucidWorks/spark-solr
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/
https://lucidworks.com/2016/08/16/solr-as-sparksql-datasource-part-ii/
Spark library
http://spark-packages.org/
Write to XML - stax
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
https://www.journaldev.com/892/how-to-write-xml-file-in-java-using-java-stax-api
Spark to s3
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark
发表评论
-
Update Site will come soon
2021-06-02 04:10 1679I am still keep notes my tech n ... -
Stop Update Here
2020-04-28 09:00 316I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 476NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 369Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 370Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 337Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 431Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 436Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 374Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 455VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 385Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 478NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 424Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 337Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 248GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 452GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 328GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 314Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 319Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 294Serverless with NodeJS and Tenc ...
相关推荐
Solr 数据导入调度器(solr-dataimport-scheduler.jar)是一个专门为Apache Solr 7.x版本设计的组件,用于实现数据的定期索引更新。在理解这个知识点之前,我们需要先了解Solr的基本概念以及数据导入处理...
Lucidworks Spark / Solr集成该项目包括用于从Solr作为Spark DataFrame / RDD读取数据以及使用SolrJ将对象从Spark索引到Solr的工具。 索引编制例子索引和查询Twitter数据索引和查询纽约市黄色出租车CSV数据配置和...
solr 检索用包
1. **Repository接口**:Spring Data Solr引入了Repository抽象,开发者只需定义特定的Repository接口,就能自动获得CRUD(创建、读取、更新、删除)操作的支持。例如,你可以创建一个`SolrDocumentRepository`,...
Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. ...
本项目结合Spring Data Solr和Solr 7.7,提供了一种高效的搜索解决方案。下面将详细讲解相关知识点。 1. **Spring框架**:Spring是Java领域广泛应用的轻量级容器框架,它提供了依赖注入(DI)和面向切面编程(AOP)...
经过测试可以适用solr7.4版本。如果低版本solr(6.*) 可以直接适用网上的solr-dataimport-scheduler 1.1 或者1.0版本。
Spring Data for Apache Solr API。 Spring Data for Apache Solr 开发文档
1. **配置Solr服务器**:你需要将solr-dataimport-scheduler.jar添加到Solr服务器的lib目录下,以便服务器能够识别并加载这个扩展。 2. **配置Scheduler**:在Solr的配置文件(通常是solrconfig.xml)中,你需要...
solr 增量更新所需要的包 solr-dataimporthandler-6.5.1 + solr-dataimporthandler-extras-6.5.1 + solr-data-import-scheduler-1.1.2
在 Solr 的生态系统中,`solr-dataimport-scheduler-1.2.jar` 是一个非常重要的组件,它允许用户定时执行数据导入任务,这对于需要定期更新索引的应用场景尤其有用。这个特定的版本 `1.2` 已经被优化以兼容 `Solr ...
solr-data-import-scheduler-1.1.2,用于solr定时更新索引的jar包,下载后引入到solr本身的dist下面,或者你tomcat项目下面的lib下面
2. 配置Solr的`data-config.xml`文件,指定数据源、数据表、查询语句以及字段映射。 3. 如果需要认证,要在配置文件中添加数据库连接的用户名和密码。 4. 启动定时任务,设置同步频率,这可能涉及到修改Solr的配置...
目前在网上只能找到以前最老spring-data-solr 1.x.x 版本的配置和说明,最新版本的根本找不到,在参考spring-data-solr 官网文档后写的配置和小Demo,spring-data-solr 所使用的是目前最新版本的 spring-data-solr ...
使用solr做数据库定时同步更新数据和索引时用到该jar,经过本人测试通过,放心使用. 支持solr5.x,solr6.x
1. **同义词**:在Solr中,可以使用Solr的同义词扩展来处理词汇的同义关系。例如,"车"和"汽车"可以视为同义词,当用户搜索"车"时,也能匹配到包含"汽车"的文档。这通过配置同义词文件和同义词过滤器实现。 2. **...
1. **SolrCloud模式**:从版本8开始,Solr支持SolrCloud模式,这是一个分布式搜索和索引存储解决方案。它允许Solr集群进行自动故障转移和数据恢复,确保高可用性和容错性。 2. **集合与分片**:在SolrCloud中,数据...