`
sillycat
  • 浏览: 2539410 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Spark 2017 BigData Update(3)Notebook Example

 
阅读更多
Spark 2017 BigData Update(3)Notebook Example

Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)

The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"

Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0

Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age

Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age

SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age

Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/

Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms

Spark Driver - sparkContext
Spark Executor -

Spark Streaming - StreamingContext —> DStream

Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark

x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect())   // [1, 2, 3]
print(y.collect())   // [(1, 1), (2, 4), (3, 9)]

mapPartition
Operation on RDD based on partitions
%pyspark

x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect())  # glom() flattens elements on the same partition
print(y.glom().collect())

output
[[1, 2], [3, 4]]
[[3], [7]]

Filter
%pyspark

# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1)  # filters out even elements
print(x.collect())        // [1, 2, 3, 4, 5, 6]
print(y.collect())        // [1, 3, 5]

Distinct
%pyspark

x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect())  //[2, 4, 1, 3, 5]

More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html

References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113

https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/

分享到:
评论

相关推荐

    Spark: Big Data Cluster Computing in Production

    Spark: Big Data Cluster Computing in Production English | 2016 | ISBN: 1119254019 | 216 pages | PDF | 5 MB Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster ...

    Scala and Spark for Big Data Analytics

    Scala and Spark for Big Data Analytics by Md. Rezaul Karim English | 25 July 2017 | ISBN: 1785280848 | ASIN: B072J4L8FQ | 898 Pages | AZW3 | 20.56 MB Harness the power of Scala to program Spark and ...

    Scala and Spark for Big Data Analytics epub

    Scala and Spark for Big Data Analytics 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除

    Scala and Spark for Big Data Analytics.pdf

    Chapter 1, Introduction to Scala, will teach big data analytics using the Scalabased APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief ...

    Big Data with Apache Spark and Python 无水印pdf

    Big Data with Apache Spark and Python 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...

    Big Data Analytics with Spark PDF

    《Big Data Analytics with Spark》这本书由Mohammed Guller撰写,旨在为读者提供一个实用指南,帮助大家利用Apache Spark进行大规模数据分析。 ### Apache Spark简介 Apache Spark是一种开源的大规模数据处理框架...

    Big Data Processing Using Spark in Cloud

    The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...

    Big Data Analytics with Spark 无水印pdf 0分

    What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...

    Spark-The Definitive Guide Big Data Processing Made Simple

    Spark-The Definitive Guide Big Data Processing Made Simple 完美true pdf。 Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of ...

    Scala and Spark for Big Data Analytics 无水印pdf

    Scala and Spark for Big Data Analytics 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...

    Frank Kane's Taming Big Data with Apache Spark and Python 【含代码】

    Frank Kane's Taming Big Data with Apache Spark and Python English | 2017 | ISBN-10: 1787287947 | 296 pages | AZW3/PDF/EPUB (conv) | 6.12 Mb Key Features Understand how Spark can be distributed across...

    Big Data with Apache Spark and Python epub

    Big Data with Apache Spark and Python 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除

    Big Data Processing Using Spark in Cloud 2018

    The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...

    Frank Kane's Taming Big Data with Apache Spark and Python

    这本书《Taming Big Data with Apache Spark and Python》由Frank Kane所著,主要讲解了如何使用Apache Spark和Python来分析大规模数据集,并提供了真实的案例帮助读者理解和实践。Apache Spark是一个开源的分布式...

    Big Data Analytics with Spark(Apress,2016)

    Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how...

    scala and spark for big data analytics

    - Springer出版社的书籍系列,例如"Realtime Data Mining Self-Learning Techniques for Recommendation Engines",尽管这不是直接关于Scala和Spark的,但是关于数据分析和推荐系统方面的内容可能会对相关领域的学习...

Global site tag (gtag.js) - Google Analytics