- 浏览: 2543429 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Spark Streaming 2020(1)Investigation
On my local I have spark cluster with Zeppelin Notebook. Kafka 3 Nodes Cluster on rancher-home, rancher-worker1, rancher-worker2.
Start Kafka Cluster
Start Zookeeper Cluster on 3 machines
> cd /opt/zookeeper
> /opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg
Check status
>zkServer.sh status conf/zoo.cfg
1 leader, 2 followers
Start Kafka on all 3 machines
> cd /opt/kafka
> nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &
Check the Producer and Consumer
> bin/kafka-console-producer.sh --broker-list rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1
> bin/kafka-console-consumer.sh --bootstrap-server rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1 --from-beginning
Go to sparkmaster_service, start on rancher-home, go to sparkslave_service, start on rancher-worker1 and rancher-worker2
Check web console
http://rancher-home:8088/
Check who is using 8080 ports on CentOS7
> netstat -nlp | grep 8080
tcp6 0 0 :::8080 :::* LISTEN 6206/java
> ps -ef | grep 6206
It seems that zookeeper is using 8080 ports
Package
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.4.0
Kafka Version
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
One updated example
https://community.cloudera.com/t5/Support-Questions/Scala-Spark-Streaming-with-Kafka-Integration-in-Zeppelin-not/td-p/233507
It seems that it is hard to make streaming working well on Zeppelin, so I try to make it working in java and Scala in projects.
https://github.com/luohuazju/kiko-spark-java
Issues about HDFS is not a type, error message:
No FileSystem for scheme: hdfs
Solution:
https://www.cnblogs.com/justinzhang/p/4983673.html
https://www.edureka.co/community/3320/java-mapreduce-error-saying-no-filesystem-for-scheme-hdfs
https://brucebcampbell.wordpress.com/2014/12/11/fix-hadoop-hdfs-error-java-io-ioexception-no-filesystem-for-scheme-hdfs-at-org-apache-hadoop-fs-filesystem-getfilesystemclassfilesystem-java2385/
https://www.codelast.com/%E5%8E%9F%E5%88%9B-%E8%A7%A3%E5%86%B3%E8%AF%BB%E5%86%99hdfs%E6%96%87%E4%BB%B6%E7%9A%84%E9%94%99%E8%AF%AF%EF%BC%9Ano-filesystem-for-scheme-hdfs/
https://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file
My changes in pom.xml and settings are as follow:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<properties>
<spark.version>2.4.4</spark.version>
<hadoop.version>3.2.1</hadoop.version>
</properties>
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");
Going to solve the HDFS issue and upgrade the project library
But after I solve the HDFS problem, I found that I run the Hadoop in Docker, the Spark Cluster has issue to talk to 9866 port which is the dis.datanode.address
https://kontext.tech/column/hadoop/265/default-ports-used-by-hadoop-services-hdfs-mapreduce-yarn
https://www.stefaanlippens.net/hadoop-3-default-ports.html
I may need to set up a HDFS Cluster outside of the Dockers
https://acadgild.com/blog/hadoop-3-x-installation-guide
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
References:
https://github.com/luohuazju/sillycat-spark
https://www.iteye.com/blog/sillycat-2215237
https://www.iteye.com/blog/sillycat-2406572
https://www.iteye.com/blog/sillycat-2370527
Third parties
https://www.cnblogs.com/luweiseu/p/8045863.html
zepplelin stream
https://www.cnblogs.com/luweiseu/p/8045863.html
https://henning.kropponline.de/2016/12/25/simple-spark-streaming-kafka-example-in-a-zeppelin-notebook/
https://juejin.im/post/5c997d9e5188252da22514e6
https://blog.csdn.net/qwemicheal/article/details/71082663
On my local I have spark cluster with Zeppelin Notebook. Kafka 3 Nodes Cluster on rancher-home, rancher-worker1, rancher-worker2.
Start Kafka Cluster
Start Zookeeper Cluster on 3 machines
> cd /opt/zookeeper
> /opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg
Check status
>zkServer.sh status conf/zoo.cfg
1 leader, 2 followers
Start Kafka on all 3 machines
> cd /opt/kafka
> nohup /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &
Check the Producer and Consumer
> bin/kafka-console-producer.sh --broker-list rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1
> bin/kafka-console-consumer.sh --bootstrap-server rancher-home:9092,rancher-worker1:9092,rancher-worker2:9092 --topic cluster1 --from-beginning
Go to sparkmaster_service, start on rancher-home, go to sparkslave_service, start on rancher-worker1 and rancher-worker2
Check web console
http://rancher-home:8088/
Check who is using 8080 ports on CentOS7
> netstat -nlp | grep 8080
tcp6 0 0 :::8080 :::* LISTEN 6206/java
> ps -ef | grep 6206
It seems that zookeeper is using 8080 ports
Package
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.12/2.4.4
https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.4.0
Kafka Version
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
One updated example
https://community.cloudera.com/t5/Support-Questions/Scala-Spark-Streaming-with-Kafka-Integration-in-Zeppelin-not/td-p/233507
It seems that it is hard to make streaming working well on Zeppelin, so I try to make it working in java and Scala in projects.
https://github.com/luohuazju/kiko-spark-java
Issues about HDFS is not a type, error message:
No FileSystem for scheme: hdfs
Solution:
https://www.cnblogs.com/justinzhang/p/4983673.html
https://www.edureka.co/community/3320/java-mapreduce-error-saying-no-filesystem-for-scheme-hdfs
https://brucebcampbell.wordpress.com/2014/12/11/fix-hadoop-hdfs-error-java-io-ioexception-no-filesystem-for-scheme-hdfs-at-org-apache-hadoop-fs-filesystem-getfilesystemclassfilesystem-java2385/
https://www.codelast.com/%E5%8E%9F%E5%88%9B-%E8%A7%A3%E5%86%B3%E8%AF%BB%E5%86%99hdfs%E6%96%87%E4%BB%B6%E7%9A%84%E9%94%99%E8%AF%AF%EF%BC%9Ano-filesystem-for-scheme-hdfs/
https://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file
My changes in pom.xml and settings are as follow:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<properties>
<spark.version>2.4.4</spark.version>
<hadoop.version>3.2.1</hadoop.version>
</properties>
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");
Going to solve the HDFS issue and upgrade the project library
But after I solve the HDFS problem, I found that I run the Hadoop in Docker, the Spark Cluster has issue to talk to 9866 port which is the dis.datanode.address
https://kontext.tech/column/hadoop/265/default-ports-used-by-hadoop-services-hdfs-mapreduce-yarn
https://www.stefaanlippens.net/hadoop-3-default-ports.html
I may need to set up a HDFS Cluster outside of the Dockers
https://acadgild.com/blog/hadoop-3-x-installation-guide
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
References:
https://github.com/luohuazju/sillycat-spark
https://www.iteye.com/blog/sillycat-2215237
https://www.iteye.com/blog/sillycat-2406572
https://www.iteye.com/blog/sillycat-2370527
Third parties
https://www.cnblogs.com/luweiseu/p/8045863.html
zepplelin stream
https://www.cnblogs.com/luweiseu/p/8045863.html
https://henning.kropponline.de/2016/12/25/simple-spark-streaming-kafka-example-in-a-zeppelin-notebook/
https://juejin.im/post/5c997d9e5188252da22514e6
https://blog.csdn.net/qwemicheal/article/details/71082663
发表评论
-
Update Site will come soon
2021-06-02 04:10 1672I am still keep notes my tech n ... -
Stop Update Here
2020-04-28 09:00 310I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 468NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 362Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 364Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 330Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 424Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 430Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 367Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 445VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 377Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 469NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 415Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 332Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 243GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 446GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 321GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 307Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 313Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 288Serverless with NodeJS and Tenc ...
相关推荐
一个完善的Spark Streaming二次封装开源框架,包含:实时流任务调度、kafka偏移量管理,web后台管理,web api启动、停止spark streaming,宕机告警、自动重启等等功能支持,用户只需要关心业务代码,无需关注繁琐的...
Apache Spark Streaming是Apache Spark用于处理实时流数据的一个组件。它允许用户使用Spark的高度抽象概念处理实时数据流,并且可以轻松地与存储解决方案、批处理数据和机器学习算法集成。Spark Streaming提供了一种...
kafka+Spark Streaming开发文档 本文档主要讲解了使用Kafka和Spark Streaming进行实时数据处理的开发文档,涵盖了Kafka集群的搭建、Spark Streaming的配置和开发等内容。 一、Kafka集群搭建 首先,需要安装Kafka...
1. Spark Streaming简介与渊源 Spark Streaming是Spark生态中用于实时数据处理的一个组件。它通过微批处理模型来实现流式数据的处理。作为一种新兴的流处理框架,它能够让用户利用Spark的高效性和可靠性进行流数据的...
sparkStreaming消费数据不丢失,sparkStreaming消费数据不丢失
1. scala-compiler-2.10.5.jar:这是 Scala 编译器的 jar 文件,Scala 是 Spark 使用的主要编程语言,它提供了函数式编程和面向对象编程的结合。此 jar 文件在 Flume 集成 Spark Streaming 时用于编译和运行 Scala ...
(1)利用SparkStreaming从文件目录读入日志信息,日志内容包含: ”日志级别、函数名、日志内容“ 三个字段,字段之间以空格拆分。请看数据源的文件。 (2)对读入都日志信息流进行指定筛选出日志级别为error或warn...
Spark Streaming 入门案例 Spark Streaming 是一种构建在 Spark 上的实时计算框架,用来处理大规模流式数据。它将从数据源(如 Kafka、Flume、Twitter、ZeroMQ、HDFS 和 TCP 套接字)获得的连续数据流,离散化成一...
1. **实时数据接收**:Spark Streaming接收实时输入数据流。 2. **数据分批**:将数据流分割成一系列微小的批次数据。 3. **数据处理**:利用Spark Engine处理这些微小批次数据。 4. **结果生成**:产生最终的处理...
Spark Streaming 是 Apache Spark 的一个模块,它允许开发者处理实时数据流。这个强大的工具提供了一种弹性、容错性好且易于编程的模型,用于构建实时流处理应用。在这个"Spark Streaming 示例"中,我们将深入探讨...
Spark Streaming是Apache Spark的重要组成部分,它提供了一种高吞吐量、可容错的实时数据处理方式。Spark Streaming的核心是一个执行模型,这个执行模型基于微批处理(micro-batch processing)的概念,允许将实时数据...
1. Spark Streaming是Spark核心API的扩展组件,设计用于高吞吐量、容错的实时流数据处理。 2. 它采用微批处理模型,将数据流切分成一系列小批次,每个批次作为Spark的批处理任务处理。 3. 支持多种数据源的接入,...
1.理解Spark Streaming的工作流程。 2.理解Spark Streaming的工作原理。 3.学会使用Spark Streaming处理流式数据。 二、实验环境 Windows 10 VMware Workstation Pro虚拟机 Hadoop环境 Jdk1.8 三、实验内容 (一)...
流处理系统如Apache Spark Streaming和Apache Storm,都致力于提供高吞吐量、低延迟的数据处理能力。尽管它们的目的是类似的,但各自的设计哲学、运行模型、容错机制等方面存在着显著差异。以下将详细介绍Spark ...
Spark Streaming 是 Apache Spark 的一个模块,专为实时数据流处理设计。它允许开发人员使用类似于批处理的 API 来处理连续的数据流。本资源集合包含了15篇与Spark Streaming相关的学术论文,其中涵盖了几篇硕士论文...
根据提供的文件信息,本文将详细解析“Spark Streaming与Kafka的整合”这一主题,并结合代码片段探讨其在实际场景中的应用。尽管标签中提到“数学建模”,但从标题和描述来看,这部分内容与数学建模无关,因此我们将...
**Spark Streaming:实时大数据处理** Spark Streaming是Apache Spark框架的一部分,专为实时数据处理而设计。它构建在Spark核心之上,提供了对实时数据流的高吞吐量、容错性和可伸缩性处理能力。Spark Streaming...
1. **Spark Streaming基本概念**: - DStreams:Spark Streaming中的核心抽象,代表连续的数据流,由一系列时间窗口内的RDDs组成。 - Windowing:用于定义时间间隔,决定数据如何分组和处理。例如,可以按每分钟、...
1.Spark Streaming整合Flume需要的安装包. 2. Spark Streaming拉取Flume数据的flume配置文件.conf 3. Flume向Spark Streaming推数据的flume配置文件.conf
SparkStreaming之Dstream入门 Spark Streaming是Apache Spark中的一个组件,用于处理流式数据。它可以从多种数据源中接收数据,如Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等,并使用Spark的高度抽象原语如...