- 浏览: 2539137 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Spark(8)Non Fat Jar/Cassandra Cluster Issue and Spark Version 1.3.1
1. Can upgrade to Java8?
Fix the BouncyCastleProvider Problem
Visit https://www.bouncycastle.org/latest_releases.html, download the file bcprov-jdk15on-152.jar
Place the file in directory
/usr/lib/jvm/java-8-oracle/jre/lib/ext
And then go to this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security
edit this file
sudo vi java.security
Add this line
security.provider.10=org.bouncycastle.jce.provider.BouncyCastleProvider
I should download this file
http://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk15%2b/1.46/bcprov-jdk15%2b-1.46.jar
Fix the JCE Problem
Download the file from here
http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html
Unzip the file and place the jars in this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security
2. Fat Jar?
https://github.com/apache/spark/pull/288?
https://issues.apache.org/jira/browse/SPARK-1154
http://apache-spark-user-list.1001560.n3.nabble.com/Clean-up-app-folders-in-worker-nodes-td20889.html
https://spark.apache.org/docs/1.0.1/spark-standalone.html
Based on my understanding, we should keep using assembly jar in scala, submit the task job to master, it will distribute the jobs to spark standalone cluster or YARN cluster. The clients should not require any setting up or jar dependencies.
3. Cluster Sync Issue in Cassandra 1.2.13
http://stackoverflow.com/questions/23345045/cassandra-cas-delete-does-not-work
http://wiki.apache.org/cassandra/DistributedDeletes
Need to use ntpd to sync the clock
https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
https://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/
Cluster of Cassandra, all the nodes will do write operation with timestamp, if the system time are different across the cluster nodes. The cassandra can run into wired status. Sometimes, delete, update can not work.
4. Upgrade to 1.3.1 Version
https://spark.apache.org/docs/latest/
Download the Spark source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz
Unzip and place the spark file in working directory
> sudo ln -s /opt/spark-1.3.1 /opt/spark
My Java version and Scala version are as follow:
> java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
> scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Build the binary
> build/sbt clean
> build/sbt compile
Compile is not working for lack of dependencies. I will not spend time on that. I will directly download the binary.
>wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Unzip it and add it to the classpath.
Then my project sillycat-spark can easily run.
Simple Spark Cluster
download the source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz
build the source
> build/sbt clean
> build/sbt compile
Not build on ubuntu as well. Using binary instead.
> wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Prepare Configuration
Go to the CONF directory.
> cp spark-env.sh.template spark-env.sh
> cp slaves.template slaves
> cat slaves
# A Spark Worker will be started on each of the machines listed below.
ubuntu-dev1
ubuntu-dev2
>cat spark-env.sh
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl
copy the same settings to all the slaves
> scp -r ubuntu-master:/home/carl/tool/spark-1.3.1-hadoop2.6 ./
Call the shell to start the standalone cluster
> sbin/start-all.sh
How to build
https://spark.apache.org/docs/1.1.0/building-with-maven.html
> mvn -DskipTests clean package
Build successfully.
Build with Yarn and hive and JDBC support
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package
Go to directory
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install
Error Message:
[ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-assembly_2.10: Failed during scalastyle execution: Unable to find configuration file at location scalastyle-config.xml -> [Help 1]
Solution:
copy the [spark_root]/scalastyle-config.xml to [spark_root]/examples/scalastyle-config.xmlcan solve the problem
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Pbigtop-dist -DskipTests clean package
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install
Changes in Resolver.scala
var mavenLocal = Resolver.mavenLocal
I set it up and running on batch mode on spark single cluster and yarn cluster. I will keep working on streaming mode and dynamic SQL.
All the based core codes are in project sillycat-spark now.
References:
Spark
http://sillycat.iteye.com/blog/1871204
http://sillycat.iteye.com/blog/1872478
http://sillycat.iteye.com/blog/2083193
http://sillycat.iteye.com/blog/2083194
http://sillycat.iteye.com/blog/2103288
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2105430
Spark deployment
http://sillycat.iteye.com/blog/2166583
http://sillycat.iteye.com/blog/2167216
http://sillycat.iteye.com/blog/2183932
spark test
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
http://stackoverflow.com/questions/26170957/using-funsuite-to-test-spark-throws-nullpointerexception
http://blog.quantifind.com/posts/spark-unit-test/
spark docs
http://www.sparkexpert.com/
https://github.com/sujee81/SparkApps
http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
http://dataunion.org/category/tech/spark-tech
http://dataunion.org/6308.html
http://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/spark-sql/README.html
http://zhangyi.farbox.com/post/access-postgresql-based-on-spark-sql
https://github.com/mkuthan/example-spark.git
1. Can upgrade to Java8?
Fix the BouncyCastleProvider Problem
Visit https://www.bouncycastle.org/latest_releases.html, download the file bcprov-jdk15on-152.jar
Place the file in directory
/usr/lib/jvm/java-8-oracle/jre/lib/ext
And then go to this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security
edit this file
sudo vi java.security
Add this line
security.provider.10=org.bouncycastle.jce.provider.BouncyCastleProvider
I should download this file
http://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk15%2b/1.46/bcprov-jdk15%2b-1.46.jar
Fix the JCE Problem
Download the file from here
http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html
Unzip the file and place the jars in this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security
2. Fat Jar?
https://github.com/apache/spark/pull/288?
https://issues.apache.org/jira/browse/SPARK-1154
http://apache-spark-user-list.1001560.n3.nabble.com/Clean-up-app-folders-in-worker-nodes-td20889.html
https://spark.apache.org/docs/1.0.1/spark-standalone.html
Based on my understanding, we should keep using assembly jar in scala, submit the task job to master, it will distribute the jobs to spark standalone cluster or YARN cluster. The clients should not require any setting up or jar dependencies.
3. Cluster Sync Issue in Cassandra 1.2.13
http://stackoverflow.com/questions/23345045/cassandra-cas-delete-does-not-work
http://wiki.apache.org/cassandra/DistributedDeletes
Need to use ntpd to sync the clock
https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
https://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/
Cluster of Cassandra, all the nodes will do write operation with timestamp, if the system time are different across the cluster nodes. The cassandra can run into wired status. Sometimes, delete, update can not work.
4. Upgrade to 1.3.1 Version
https://spark.apache.org/docs/latest/
Download the Spark source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz
Unzip and place the spark file in working directory
> sudo ln -s /opt/spark-1.3.1 /opt/spark
My Java version and Scala version are as follow:
> java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
> scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Build the binary
> build/sbt clean
> build/sbt compile
Compile is not working for lack of dependencies. I will not spend time on that. I will directly download the binary.
>wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Unzip it and add it to the classpath.
Then my project sillycat-spark can easily run.
Simple Spark Cluster
download the source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz
build the source
> build/sbt clean
> build/sbt compile
Not build on ubuntu as well. Using binary instead.
> wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Prepare Configuration
Go to the CONF directory.
> cp spark-env.sh.template spark-env.sh
> cp slaves.template slaves
> cat slaves
# A Spark Worker will be started on each of the machines listed below.
ubuntu-dev1
ubuntu-dev2
>cat spark-env.sh
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl
copy the same settings to all the slaves
> scp -r ubuntu-master:/home/carl/tool/spark-1.3.1-hadoop2.6 ./
Call the shell to start the standalone cluster
> sbin/start-all.sh
How to build
https://spark.apache.org/docs/1.1.0/building-with-maven.html
> mvn -DskipTests clean package
Build successfully.
Build with Yarn and hive and JDBC support
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package
Go to directory
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install
Error Message:
[ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-assembly_2.10: Failed during scalastyle execution: Unable to find configuration file at location scalastyle-config.xml -> [Help 1]
Solution:
copy the [spark_root]/scalastyle-config.xml to [spark_root]/examples/scalastyle-config.xmlcan solve the problem
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Pbigtop-dist -DskipTests clean package
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install
Changes in Resolver.scala
var mavenLocal = Resolver.mavenLocal
I set it up and running on batch mode on spark single cluster and yarn cluster. I will keep working on streaming mode and dynamic SQL.
All the based core codes are in project sillycat-spark now.
References:
Spark
http://sillycat.iteye.com/blog/1871204
http://sillycat.iteye.com/blog/1872478
http://sillycat.iteye.com/blog/2083193
http://sillycat.iteye.com/blog/2083194
http://sillycat.iteye.com/blog/2103288
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2105430
Spark deployment
http://sillycat.iteye.com/blog/2166583
http://sillycat.iteye.com/blog/2167216
http://sillycat.iteye.com/blog/2183932
spark test
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
http://stackoverflow.com/questions/26170957/using-funsuite-to-test-spark-throws-nullpointerexception
http://blog.quantifind.com/posts/spark-unit-test/
spark docs
http://www.sparkexpert.com/
https://github.com/sujee81/SparkApps
http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
http://dataunion.org/category/tech/spark-tech
http://dataunion.org/6308.html
http://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/spark-sql/README.html
http://zhangyi.farbox.com/post/access-postgresql-based-on-spark-sql
https://github.com/mkuthan/example-spark.git
发表评论
-
Update Site will come soon
2021-06-02 04:10 1672I am still keep notes my tech n ... -
Hadoop Docker 2019 Version 3.2.1
2019-12-10 07:39 287Hadoop Docker 2019 Version 3.2. ... -
Nginx and Proxy 2019(1)Nginx Enable Lua and Parse JSON
2019-12-03 04:17 437Nginx and Proxy 2019(1)Nginx En ... -
Data Solution 2019(13)Docker Zeppelin Notebook and Memory Configuration
2019-11-09 07:15 279Data Solution 2019(13)Docker Ze ... -
Data Solution 2019(10)Spark Cluster Solution with Zeppelin
2019-10-29 08:37 243Data Solution 2019(10)Spark Clu ... -
AMAZON Kinesis Firehose 2019(1)Firehose Buffer to S3
2019-10-01 10:15 313AMAZON Kinesis Firehose 2019(1) ... -
Rancher and k8s 2019(3)Clean Installation on CentOS7
2019-09-19 23:25 300Rancher and k8s 2019(3)Clean In ... -
Pacemaker 2019(1)Introduction and Installation on CentOS7
2019-09-11 05:48 333Pacemaker 2019(1)Introduction a ... -
Crontab-UI installation and Introduction
2019-08-30 05:54 441Crontab-UI installation and Int ... -
Spiderkeeper 2019(1)Installation and Introduction
2019-08-29 06:49 492Spiderkeeper 2019(1)Installatio ... -
Supervisor 2019(2)Ubuntu and Multiple Services
2019-08-19 10:53 362Supervisor 2019(2)Ubuntu and Mu ... -
Supervisor 2019(1)CentOS 7
2019-08-19 09:33 320Supervisor 2019(1)CentOS 7 Ins ... -
Redis Cluster 2019(3)Redis Cluster on CentOS
2019-08-17 04:07 364Redis Cluster 2019(3)Redis Clus ... -
Amazon Lambda and Version Limit
2019-08-02 01:42 430Amazon Lambda and Version Limit ... -
MySQL HA Solution 2019(1)Master Slave on MySQL 5.7
2019-07-27 22:26 504MySQL HA Solution 2019(1)Master ... -
RabbitMQ Cluster 2019(2)Cluster HA and Proxy
2019-07-11 12:41 455RabbitMQ Cluster 2019(2)Cluster ... -
Running Zeppelin with Nginx Authentication
2019-05-25 21:35 315Running Zeppelin with Nginx Aut ... -
Running Zeppelin with Nginx Authentication
2019-05-25 21:34 315Running Zeppelin with Nginx Aut ... -
ElasticSearch(3)Version Upgrade and Cluster
2019-05-20 05:00 319ElasticSearch(3)Version Upgrade ... -
Jetty Server and Cookie Domain Name
2019-04-28 23:59 388Jetty Server and Cookie Domain ...
相关推荐
Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka by Raul Estrada, Isaac Ruiz English | ISBN: 1484221745 | 2016 | EPUB | 264 pages | 2.35 MB This book is about how to ...
Spark_JAR包是Apache Spark项目的核心组件之一,它包含了运行Spark应用程序所必需的类库和依赖。Spark作为一个快速、通用且可扩展的数据处理框架,它为大数据处理提供了丰富的API,支持Scala、Java、Python和R等多种...
$ curl -sSL https://raw.githubusercontent.com/bitnami/bitnami-docker-cassandra/master/docker-compose.yml > docker-compose.yml $ docker-compose up -d 您可以在“部分中找到默认凭据和可用的配置选项。 为...
在Spark上使用CLI读取Cassandra数据是一种常见的大数据处理场景,Cassandra是一个分布式NoSQL数据库,而Spark则是一个用于大规模数据处理的计算框架。这两者的结合可以提供高效、可扩展的数据处理能力。以下是对这个...
根据cassandra 的一个client jdbc源码编译的官方jar包,没有任何修改,官方源码导出,可以使用sql形式进行操作cassandra,使用时请结合Cassandra其他必须jar包测试使用
/your/path/to/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.HelloWorldExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar 要将 jar 添加到 ...
Spark 2.2.2支持多种数据源,包括HDFS(Hadoop分布式文件系统)、Cassandra、HBase等,这使得它能无缝集成到Hadoop生态中。此外,它内置了Spark SQL模块,用于执行SQL查询,同时支持DataFrame和Dataset操作,以及...
Spark Cassandra连接器快速链接什么哪里社区在与我们聊天Scala文档最新版本(3.0.0): , 最新生产版本产品特点借助Apache Spark:trade_mark:和ApacheCassandra:registered:实现闪电般的集群计算。 该库使您可以将...
saved_caches_directory: /var/lib/cassandra/saved_caches ``` - 指定种子节点(集群中的所有机器应具有相同的设置,通常指定最先启动的服务器作为种子节点): ```yaml seed_provider: - class_name: org....
什么是Cassandra出口商? Cassandra导出器是一个独立的应用程序,可通过Prometheus友好端点导出Apache Cassandra指标。 TL; DR $ docker run --name cassandra-exporter bitnami/cassandra-exporter:latest 为什么...
Cassadnra jar 后端jar包
启动 cassandra /opt/apache-cassandra-2.0.14/bin$ sudo ./cassandra 启动 Spark master /opt/spark-1.3.1-bin-hadoop2.6/bin$ ./spark-class org.apache.spark.deploy.master.Master 启动 Spark worker /opt/spark...
在Java环境中与Cassandra交互时,通常需要特定的JAR包来建立连接并执行操作。这些JAR包包含了必要的驱动程序和API,使得Java应用程序能够与Cassandra通信。 在Java中链接Cassandra,主要依赖于DataStax的Java驱动...
在本文中,我们将深入探讨如何将Spring Boot框架与Cassandra数据库集成,并利用Java Persistence API (JPA) 进行数据操作。Spring Boot以其简洁的配置和开箱即用的特性,已经成为Java开发中的首选框架之一。而...
Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. It operates at ...
2. **提交日志目录**:将`commitlog_directory`从默认的`/var/lib/cassandra/commitlog`修改为`D:\cassandra\commitlog`。 3. **缓存目录**:更改`saved_caches_directory`从`/var/lib/cassandra/saved_caches`至`D:...
kafka-sparkstreaming-cassandra, 用于 Kafka Spark流的Docker 容器 用于 Kafka Spark流的Docker 容器这里Dockerfile为实验 Kafka 。Spark流( PySpark ) 和Cassandra设置了完整的流环境。 安装Kafka 0.10.2.1用于 ...
摘要:在今天的文章中,我们将着重探讨如何利用SMACK(即Spark、Mesos、Akka、Cassandra以及Kafka)堆栈构建可扩展数据处理平台。虽然这套堆栈仅由数个简单部分组成,但其能够实现大量不同系统设计。除了纯粹的批量...
Apache Cassandra Change-Data-Capture示例项目 对于DataStax Enterprise,请参阅分支 该存储库包含用于在CDC位置读取Apache Cassandra提交日志文件并以JSON格式输出的示例项目。 建造 $ ./mvnw package -DskipTests...