`
sillycat
  • 浏览: 2552227 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Spark Solr(1)Read Data from SOLR

 
阅读更多
Spark Solr(1)Read Data from SOLR

I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.

Originally it is fork from https://github.com/LucidWorks/spark-solr
>git clone https://github.com/luohuazju/spark-solr

Only some dependencies update in pom.xml
<modelVersion>4.0.0</modelVersion>
   <groupId>com.lucidworks.spark</groupId>
   <artifactId>spark-solr</artifactId>
-  <version>3.4.0-SNAPSHOT</version>
+  <version>3.4.0.1</version>
   <packaging>jar</packaging>
   <name>spark-solr</name>
   <description>Tools for reading data from Spark into Solr</description>
@@ -39,11 +39,10 @@
     <java.version>1.8</java.version>
     <spark.version>2.2.1</spark.version>
     <solr.version>7.1.0</solr.version>
-    <fasterxml.version>2.4.0</fasterxml.version>
+    <fasterxml.version>2.6.7</fasterxml.version>
     <scala.version>2.11.8</scala.version>
     <scala.binary.version>2.11</scala.binary.version>
     <scoverage.plugin.version>1.1.1</scoverage.plugin.version>
-    <fasterxml.version>2.4.0</fasterxml.version>
     <MaxPermSize>128m</MaxPermSize>
   </properties>
   <repositories>

Command to build that package
>mvn clean compile install -DskipTests

After build that, I get a driver versioned as <spark.solr.version>3.4.0.1</spark.solr.version>

Set Up SOLR Spark Task
pom.xml to build the dependencies
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.sillycat</groupId>
    <artifactId>sillycat-spark-solr</artifactId>
    <version>1.0</version>
    <description>Fetch the Events from Kafka</description>
    <name>Spark Streaming System</name>
    <packaging>jar</packaging>

    <properties>
        <spark.version>2.2.1</spark.version>
        <spark.solr.version>3.4.0.1</spark.solr.version>
    </properties>

    <dependencies>
        <!-- spark framework -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- spark SOLR -->
        <dependency>
            <groupId>com.lucidworks.spark</groupId>
            <artifactId>spark-solr</artifactId>
            <version>${spark.solr.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!-- JUNIT -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.1</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
        </configuration>
    </plugin>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.4.1</version>
        <configuration>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
                <manifest>
                    <mainClass>com.sillycat.sparkjava.SparkJavaApp</mainClass>
                </manifest>
            </archive>
        </configuration>
        <executions>
            <execution>
                <id>assemble-all</id>
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
            </execution>
        </executions>
    </plugin>
    </plugins>
</build>
</project>

Here is the major implementation class to connect to the zookeeper and SOLR
SeniorJavaFeedApp.java
package com.sillycat.sparkjava.app;

import java.util.List;

import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;

import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;

public class SeniorJavaFeedApp extends SparkBaseApp {

    private static final long serialVersionUID = -1219898501920199612L;

    protected String getAppName() {
        return "SeniorJavaFeedApp";
    }

    public void executeTask(List<String> params) {
        SparkConf conf = this.getSparkConf();
        SparkContext sc = new SparkContext(conf);

        String zkHost = "zookeeper1.us-east-1.elasticbeanstalk.com,zookeeper2.us-east-1.elasticbeanstalk.com,zookeeper3.us-east-1.elasticbeanstalk.com/solr/allJobs";
        String collection = "allJobs";
        String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
        String keyword = "Architect";

        logger.info("Prepare the resource from " + solrQuery);
        JavaRDD<SolrDocument> rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
        logger.info("Executing the calculation based on keyword " + keyword);

        List<SolrDocument> results = processRows(rdd, keyword);
        for (SolrDocument result : results) {
            logger.info("Find some jobs for you:" + result);
        }
        sc.stop();
    }

    private JavaRDD<SolrDocument> generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
        SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
        JavaRDD<SolrDocument> resultsRDD = solrRDD.queryShards(solrQuery);
        return resultsRDD;
    }

    private List<SolrDocument> processRows(JavaRDD<SolrDocument> rows, String keyword) {
        JavaRDD<SolrDocument> lines = rows.filter(new Function<SolrDocument, Boolean>() {
            private static final long serialVersionUID = 1L;

            public Boolean call(SolrDocument s) throws Exception {
                Object titleObj = s.getFieldValue("title");
                if (titleObj != null) {
                    String title = titleObj.toString();
                    if (title.contains(keyword)) {
                        return true;
                    }
                }
                return false;
            }
        });
        return lines.collect();
    }

}

Here is the class to run the Spark task on Cluster and Local
#Run the local#

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on local#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on Remote YARN Cluster#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp



References:
https://github.com/LucidWorks/spark-solr
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/
https://lucidworks.com/2016/08/16/solr-as-sparksql-datasource-part-ii/

Spark library
http://spark-packages.org/

Write to XML - stax
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
https://www.journaldev.com/892/how-to-write-xml-file-in-java-using-java-stax-api
Spark to s3
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark
分享到:
评论

相关推荐

    solr-dataimport-scheduler.jar 可使用于solr7.x版本

    Solr 数据导入调度器(solr-dataimport-scheduler.jar)是一个专门为Apache Solr 7.x版本设计的组件,用于实现数据的定期索引更新。在理解这个知识点之前,我们需要先了解Solr的基本概念以及数据导入处理...

    spark-solr:使用SolrJ从Solr作为Spark RDD读取数据并将对象从Spark索引到Solr的工具

    Lucidworks Spark / Solr集成该项目包括用于从Solr作为Spark DataFrame / RDD读取数据以及使用SolrJ将对象从Spark索引到Solr的工具。 索引编制例子索引和查询Twitter数据索引和查询纽约市黄色出租车CSV数据配置和...

    solr-dataimporthandler-8.11.2.jar

    solr 检索用包

    spring-data-solr-master

    1. **Repository接口**:Spring Data Solr引入了Repository抽象,开发者只需定义特定的Repository接口,就能自动获得CRUD(创建、读取、更新、删除)操作的支持。例如,你可以创建一个`SolrDocumentRepository`,...

    Scaling Big Data with Hadoop and Solr

    Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. ...

    SSM+spring-data-solr+solr7.7 全文搜索代码

    本项目结合Spring Data Solr和Solr 7.7,提供了一种高效的搜索解决方案。下面将详细讲解相关知识点。 1. **Spring框架**:Spring是Java领域广泛应用的轻量级容器框架,它提供了依赖注入(DI)和面向切面编程(AOP)...

    solr-dataimport-scheduler.jar

    经过测试可以适用solr7.4版本。如果低版本solr(6.*) 可以直接适用网上的solr-dataimport-scheduler 1.1 或者1.0版本。

    Spring Data for Apache Solr API(Spring Data for Apache Solr 开发文档).CHM

    Spring Data for Apache Solr API。 Spring Data for Apache Solr 开发文档

    solr-dataimport-scheduler

    1. **配置Solr服务器**:你需要将solr-dataimport-scheduler.jar添加到Solr服务器的lib目录下,以便服务器能够识别并加载这个扩展。 2. **配置Scheduler**:在Solr的配置文件(通常是solrconfig.xml)中,你需要...

    solr-data-import-scheduler

    solr 增量更新所需要的包 solr-dataimporthandler-6.5.1 + solr-dataimporthandler-extras-6.5.1 + solr-data-import-scheduler-1.1.2

    支持solr6.1-solr-dataimport-scheduler-1.2.jar

    在 Solr 的生态系统中,`solr-dataimport-scheduler-1.2.jar` 是一个非常重要的组件,它允许用户定时执行数据导入任务,这对于需要定期更新索引的应用场景尤其有用。这个特定的版本 `1.2` 已经被优化以兼容 `Solr ...

    solr-data-import-scheduler-1.1.2

    solr-data-import-scheduler-1.1.2,用于solr定时更新索引的jar包,下载后引入到solr本身的dist下面,或者你tomcat项目下面的lib下面

    solr定时自动同步数据库需要用到的apache-solr-dataimportscheduler.jar包

    2. 配置Solr的`data-config.xml`文件,指定数据源、数据表、查询语句以及字段映射。 3. 如果需要认证,要在配置文件中添加数据库连接的用户名和密码。 4. 启动定时任务,设置同步频率,这可能涉及到修改Solr的配置...

    spring-data-solr 4.0.5.RELEASE 最新版本配置和小Demo

    目前在网上只能找到以前最老spring-data-solr 1.x.x 版本的配置和说明,最新版本的根本找不到,在参考spring-data-solr 官网文档后写的配置和小Demo,spring-data-solr 所使用的是目前最新版本的 spring-data-solr ...

    solr-dataimport-scheduler.jar定时同步

    使用solr做数据库定时同步更新数据和索引时用到该jar,经过本人测试通过,放心使用. 支持solr5.x,solr6.x

    配置好的solr启动环境

    1. **同义词**:在Solr中,可以使用Solr的同义词扩展来处理词汇的同义关系。例如,"车"和"汽车"可以视为同义词,当用户搜索"车"时,也能匹配到包含"汽车"的文档。这通过配置同义词文件和同义词过滤器实现。 2. **...

    Apache Solr(solr-8.11.1.zip)

    1. **SolrCloud模式**:从版本8开始,Solr支持SolrCloud模式,这是一个分布式搜索和索引存储解决方案。它允许Solr集群进行自动故障转移和数据恢复,确保高可用性和容错性。 2. **集合与分片**:在SolrCloud中,数据...

Global site tag (gtag.js) - Google Analytics