`
sillycat
  • 浏览: 2542929 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Spark(8)Non Fat Jar/Cassandra Cluster Issue and Spark Version 1.3.1

 
阅读更多
Spark(8)Non Fat Jar/Cassandra Cluster Issue and Spark Version 1.3.1

1. Can upgrade to Java8?
Fix the BouncyCastleProvider Problem
Visit https://www.bouncycastle.org/latest_releases.html, download the file bcprov-jdk15on-152.jar
Place the file in directory
/usr/lib/jvm/java-8-oracle/jre/lib/ext

And then go to this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security

edit this file
sudo vi java.security

Add this line
security.provider.10=org.bouncycastle.jce.provider.BouncyCastleProvider

I should download this file
http://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk15%2b/1.46/bcprov-jdk15%2b-1.46.jar

Fix the JCE Problem
Download the file from here
http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html

Unzip the file and place the jars in this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security


2. Fat Jar?
https://github.com/apache/spark/pull/288?
https://issues.apache.org/jira/browse/SPARK-1154
http://apache-spark-user-list.1001560.n3.nabble.com/Clean-up-app-folders-in-worker-nodes-td20889.html
https://spark.apache.org/docs/1.0.1/spark-standalone.html

Based on my understanding, we should keep using assembly jar in scala, submit the task job to master, it will distribute the jobs to spark standalone cluster or YARN cluster. The clients should not require any setting up or jar dependencies.

3. Cluster Sync Issue in Cassandra 1.2.13
http://stackoverflow.com/questions/23345045/cassandra-cas-delete-does-not-work
http://wiki.apache.org/cassandra/DistributedDeletes

Need to use ntpd to sync the clock
https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
https://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/

Cluster of Cassandra, all the nodes will do write operation with timestamp, if the system time are different across the cluster nodes. The cassandra can run into wired status. Sometimes, delete, update can not work.

4. Upgrade to 1.3.1 Version
https://spark.apache.org/docs/latest/

Download the Spark source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz

Unzip and place the spark file in working directory
> sudo ln -s /opt/spark-1.3.1 /opt/spark

My Java version and Scala version are as follow:
> java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

> scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL

Build the binary
> build/sbt clean
> build/sbt compile
Compile is not working for lack of dependencies. I will not spend time on that. I will directly download the binary.
>wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Unzip it and add it to the classpath.

Then my project sillycat-spark can easily run.
Simple Spark Cluster
download the source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz

build the source
> build/sbt clean
> build/sbt compile
Not build on ubuntu as well. Using binary instead.
> wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz

Prepare Configuration
Go to the CONF directory.
> cp spark-env.sh.template spark-env.sh
> cp slaves.template slaves

> cat slaves
# A Spark Worker will be started on each of the machines listed below.
ubuntu-dev1
ubuntu-dev2

>cat spark-env.sh
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl

copy the same settings to all the slaves
> scp -r ubuntu-master:/home/carl/tool/spark-1.3.1-hadoop2.6 ./

Call the shell to start the standalone cluster
> sbin/start-all.sh

How to build
https://spark.apache.org/docs/1.1.0/building-with-maven.html
> mvn -DskipTests clean package
Build successfully.

Build with Yarn and hive and JDBC support
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package

Go to directory
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install

Error Message:
[ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-assembly_2.10: Failed during scalastyle execution: Unable to find configuration file at location scalastyle-config.xml -> [Help 1]

Solution:
copy the [spark_root]/scalastyle-config.xml to [spark_root]/examples/scalastyle-config.xmlcan solve the problem

> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Pbigtop-dist -DskipTests clean package
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install

Changes in Resolver.scala
var mavenLocal = Resolver.mavenLocal

I set it up and running on batch mode on spark single cluster and yarn cluster. I will keep working on streaming mode and dynamic SQL.
All the based core codes are in project sillycat-spark now.

References:
Spark
http://sillycat.iteye.com/blog/1871204
http://sillycat.iteye.com/blog/1872478
http://sillycat.iteye.com/blog/2083193
http://sillycat.iteye.com/blog/2083194
http://sillycat.iteye.com/blog/2103288
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2105430

Spark deployment
http://sillycat.iteye.com/blog/2166583
http://sillycat.iteye.com/blog/2167216
http://sillycat.iteye.com/blog/2183932

spark test
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
http://stackoverflow.com/questions/26170957/using-funsuite-to-test-spark-throws-nullpointerexception
http://blog.quantifind.com/posts/spark-unit-test/

spark docs
http://www.sparkexpert.com/
https://github.com/sujee81/SparkApps
http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
http://dataunion.org/category/tech/spark-tech
http://dataunion.org/6308.html
http://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/spark-sql/README.html
http://zhangyi.farbox.com/post/access-postgresql-based-on-spark-sql

https://github.com/mkuthan/example-spark.git
分享到:
评论

相关推荐

    Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka

    Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka by Raul Estrada, Isaac Ruiz English | ISBN: 1484221745 | 2016 | EPUB | 264 pages | 2.35 MB This book is about how to ...

    spark_jar包

    Spark_JAR包是Apache Spark项目的核心组件之一,它包含了运行Spark应用程序所必需的类库和依赖。Spark作为一个快速、通用且可扩展的数据处理框架,它为大数据处理提供了丰富的API,支持Scala、Java、Python和R等多种...

    bitnami-docker-cassandra:用于Cassandra的Bitnami Docker映像

    $ curl -sSL https://raw.githubusercontent.com/bitnami/bitnami-docker-cassandra/master/docker-compose.yml > docker-compose.yml $ docker-compose up -d 您可以在“部分中找到默认凭据和可用的配置选项。 为...

    在Spark上使用CLI读取Cassandra数据

    在Spark上使用CLI读取Cassandra数据是一种常见的大数据处理场景,Cassandra是一个分布式NoSQL数据库,而Spark则是一个用于大规模数据处理的计算框架。这两者的结合可以提供高效、可扩展的数据处理能力。以下是对这个...

    cassandra jdbc jar包

    根据cassandra 的一个client jdbc源码编译的官方jar包,没有任何修改,官方源码导出,可以使用sql形式进行操作cassandra,使用时请结合Cassandra其他必须jar包测试使用

    spark-cassandra-example:spark cassandra 连接器的使用示例

    /your/path/to/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.HelloWorldExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar 要将 jar 添加到 ...

    spark-2.2.2-bin-hadoop2.7.tgz

    Spark 2.2.2支持多种数据源,包括HDFS(Hadoop分布式文件系统)、Cassandra、HBase等,这使得它能无缝集成到Hadoop生态中。此外,它内置了Spark SQL模块,用于执行SQL查询,同时支持DataFrame和Dataset操作,以及...

    spark-cassandra-connector:DataStax Spark Cassandra连接器

    Spark Cassandra连接器快速链接什么哪里社区在与我们聊天Scala文档最新版本(3.0.0): , 最新生产版本产品特点借助Apache Spark:trade_mark:和ApacheCassandra:registered:实现闪电般的集群计算。 该库使您可以将...

    liunx下cassandra的安装配置

    saved_caches_directory: /var/lib/cassandra/saved_caches ``` - 指定种子节点(集群中的所有机器应具有相同的设置,通常指定最先启动的服务器作为种子节点): ```yaml seed_provider: - class_name: org....

    bitnami-docker-cassandra-exporter:Bitnami Cassandra导出器Docker映像

    什么是Cassandra出口商? Cassandra导出器是一个独立的应用程序,可通过Prometheus友好端点导出Apache Cassandra指标。 TL; DR $ docker run --name cassandra-exporter bitnami/cassandra-exporter:latest 为什么...

    Cassandra jar包

    Cassadnra jar 后端jar包 

    spark-cassandra

    启动 cassandra /opt/apache-cassandra-2.0.14/bin$ sudo ./cassandra 启动 Spark master /opt/spark-1.3.1-bin-hadoop2.6/bin$ ./spark-class org.apache.spark.deploy.master.Master 启动 Spark worker /opt/spark...

    cassandra数据库 java链接 jar包

    在Java环境中与Cassandra交互时,通常需要特定的JAR包来建立连接并执行操作。这些JAR包包含了必要的驱动程序和API,使得Java应用程序能够与Cassandra通信。 在Java中链接Cassandra,主要依赖于DataStax的Java驱动...

    spring boot与cassandra集成,使用JPA方式。

    在本文中,我们将深入探讨如何将Spring Boot框架与Cassandra数据库集成,并利用Java Persistence API (JPA) 进行数据操作。Spring Boot以其简洁的配置和开箱即用的特性,已经成为Java开发中的首选框架之一。而...

    Mastering.Apache.Spark.178397146

    Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. It operates at ...

    Cassandra在Windows上安装及使用方法

    2. **提交日志目录**:将`commitlog_directory`从默认的`/var/lib/cassandra/commitlog`修改为`D:\cassandra\commitlog`。 3. **缓存目录**:更改`saved_caches_directory`从`/var/lib/cassandra/saved_caches`至`D:...

    kafka-sparkstreaming-cassandra, 用于 Kafka Spark流的Docker 容器.zip

    kafka-sparkstreaming-cassandra, 用于 Kafka Spark流的Docker 容器 用于 Kafka Spark流的Docker 容器这里Dockerfile为实验 Kafka 。Spark流( PySpark ) 和Cassandra设置了完整的流环境。 安装Kafka 0.10.2.1用于 ...

    数据处理平台架构中的SMACK组合:Spark、Mesos、Akka、Cassandra以及Kafka

    摘要:在今天的文章中,我们将着重探讨如何利用SMACK(即Spark、Mesos、Akka、Cassandra以及Kafka)堆栈构建可扩展数据处理平台。虽然这套堆栈仅由数个简单部分组成,但其能够实现大量不同系统设计。除了纯粹的批量...

    cassandra-cdc-example:使用Commit Log API读取Apache Cassandra更改数据捕获日志的示例项目

    Apache Cassandra Change-Data-Capture示例项目 对于DataStax Enterprise,请参阅分支 该存储库包含用于在CDC位置读取Apache Cassandra提交日志文件并以JSON格式输出的示例项目。 建造 $ ./mvnw package -DskipTests...

Global site tag (gtag.js) - Google Analytics