- 浏览: 2564545 次
- 性别:
- 来自: 成都
-
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Spark 2017 BigData Update(3)Notebook Example
Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)
The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0
Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age
Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age
SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age
Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms
Spark Driver - sparkContext
Spark Executor -
Spark Streaming - StreamingContext —> DStream
Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect()) // [1, 2, 3]
print(y.collect()) // [(1, 1), (2, 4), (3, 9)]
mapPartition
Operation on RDD based on partitions
%pyspark
x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect()) # glom() flattens elements on the same partition
print(y.glom().collect())
output
[[1, 2], [3, 4]]
[[3], [7]]
Filter
%pyspark
# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1) # filters out even elements
print(x.collect()) // [1, 2, 3, 4, 5, 6]
print(y.collect()) // [1, 3, 5]
Distinct
%pyspark
x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect()) //[2, 4, 1, 3, 5]
More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/
Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)
The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0
Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age
Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age
SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age
Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms
Spark Driver - sparkContext
Spark Executor -
Spark Streaming - StreamingContext —> DStream
Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect()) // [1, 2, 3]
print(y.collect()) // [(1, 1), (2, 4), (3, 9)]
mapPartition
Operation on RDD based on partitions
%pyspark
x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect()) # glom() flattens elements on the same partition
print(y.glom().collect())
output
[[1, 2], [3, 4]]
[[3], [7]]
Filter
%pyspark
# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1) # filters out even elements
print(x.collect()) // [1, 2, 3, 4, 5, 6]
print(y.collect()) // [1, 3, 5]
Distinct
%pyspark
x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect()) //[2, 4, 1, 3, 5]
More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/
发表评论
-
Stop Update Here
2020-04-28 09:00 325I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 486NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 375Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 376Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 345Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 436Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 446Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 382Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 469VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 396Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 491NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 433Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 343Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 257GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 458GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 334GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 318Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 326Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 303Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 316Serverless with NodeJS and Tenc ...
相关推荐
Spark: Big Data Cluster Computing in Production English | 2016 | ISBN: 1119254019 | 216 pages | PDF | 5 MB Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster ...
Scala and Spark for Big Data Analytics 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
Scala and Spark for Big Data Analytics by Md. Rezaul Karim English | 25 July 2017 | ISBN: 1785280848 | ASIN: B072J4L8FQ | 898 Pages | AZW3 | 20.56 MB Harness the power of Scala to program Spark and ...
Scala and Spark for Big Data Analytics 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
Chapter 1, Introduction to Scala, will teach big data analytics using the Scalabased APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief ...
Big Data with Apache Spark and Python 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
《Big Data Analytics with Spark》这本书由Mohammed Guller撰写,旨在为读者提供一个实用指南,帮助大家利用Apache Spark进行大规模数据分析。 ### Apache Spark简介 Apache Spark是一种开源的大规模数据处理框架...
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...
Spark-The Definitive Guide Big Data Processing Made Simple 完美true pdf。 Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of ...
Frank Kane's Taming Big Data with Apache Spark and Python English | 2017 | ISBN-10: 1787287947 | 296 pages | AZW3/PDF/EPUB (conv) | 6.12 Mb Key Features Understand how Spark can be distributed across...
Big Data with Apache Spark and Python 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
这本书《Taming Big Data with Apache Spark and Python》由Frank Kane所著,主要讲解了如何使用Apache Spark和Python来分析大规模数据集,并提供了真实的案例帮助读者理解和实践。Apache Spark是一个开源的分布式...
《Spark: The Definitive Guide: Big Data Processing Made Simple》是大数据处理领域的经典著作,由Databricks的创始人之一Michael Armbrust等专家撰写。这本书深入浅出地介绍了Apache Spark的核心概念、架构以及...
Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how...