- 浏览: 2539315 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Spark 2017 BigData Update(3)Notebook Example
Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)
The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0
Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age
Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age
SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age
Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms
Spark Driver - sparkContext
Spark Executor -
Spark Streaming - StreamingContext —> DStream
Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect()) // [1, 2, 3]
print(y.collect()) // [(1, 1), (2, 4), (3, 9)]
mapPartition
Operation on RDD based on partitions
%pyspark
x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect()) # glom() flattens elements on the same partition
print(y.glom().collect())
output
[[1, 2], [3, 4]]
[[3], [7]]
Filter
%pyspark
# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1) # filters out even elements
print(x.collect()) // [1, 2, 3, 4, 5, 6]
print(y.collect()) // [1, 3, 5]
Distinct
%pyspark
x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect()) //[2, 4, 1, 3, 5]
More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/
Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)
The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0
Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age
Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age
SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age
Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms
Spark Driver - sparkContext
Spark Executor -
Spark Streaming - StreamingContext —> DStream
Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect()) // [1, 2, 3]
print(y.collect()) // [(1, 1), (2, 4), (3, 9)]
mapPartition
Operation on RDD based on partitions
%pyspark
x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect()) # glom() flattens elements on the same partition
print(y.glom().collect())
output
[[1, 2], [3, 4]]
[[3], [7]]
Filter
%pyspark
# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1) # filters out even elements
print(x.collect()) // [1, 2, 3, 4, 5, 6]
print(y.collect()) // [1, 3, 5]
Distinct
%pyspark
x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect()) //[2, 4, 1, 3, 5]
More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/
发表评论
-
Stop Update Here
2020-04-28 09:00 310I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 465NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 361Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 363Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 328Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 419Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 428Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 364Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 444VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 376Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 463NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 413Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 330Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 242GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 443GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 320GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 306Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 310Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 284Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 302Serverless with NodeJS and Tenc ...
相关推荐
Spark: Big Data Cluster Computing in Production English | 2016 | ISBN: 1119254019 | 216 pages | PDF | 5 MB Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster ...
Scala and Spark for Big Data Analytics by Md. Rezaul Karim English | 25 July 2017 | ISBN: 1785280848 | ASIN: B072J4L8FQ | 898 Pages | AZW3 | 20.56 MB Harness the power of Scala to program Spark and ...
Scala and Spark for Big Data Analytics 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
Chapter 1, Introduction to Scala, will teach big data analytics using the Scalabased APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief ...
Big Data with Apache Spark and Python 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
《Big Data Analytics with Spark》这本书由Mohammed Guller撰写,旨在为读者提供一个实用指南,帮助大家利用Apache Spark进行大规模数据分析。 ### Apache Spark简介 Apache Spark是一种开源的大规模数据处理框架...
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...
Spark-The Definitive Guide Big Data Processing Made Simple 完美true pdf。 Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of ...
Scala and Spark for Big Data Analytics 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
Frank Kane's Taming Big Data with Apache Spark and Python English | 2017 | ISBN-10: 1787287947 | 296 pages | AZW3/PDF/EPUB (conv) | 6.12 Mb Key Features Understand how Spark can be distributed across...
Big Data with Apache Spark and Python 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
这本书《Taming Big Data with Apache Spark and Python》由Frank Kane所著,主要讲解了如何使用Apache Spark和Python来分析大规模数据集,并提供了真实的案例帮助读者理解和实践。Apache Spark是一个开源的分布式...
Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how...
- Springer出版社的书籍系列,例如"Realtime Data Mining Self-Learning Techniques for Recommendation Engines",尽管这不是直接关于Scala和Spark的,但是关于数据分析和推荐系统方面的内容可能会对相关领域的学习...