[spark-src-core] 2.5 core concepts in Spark - 莱布尼兹 - ITeye博客

`

leibnitz

浏览: 289698 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

社区版块

存档分类

最新评论

jpsb： ...
为什么需要分布式？
leibnitz： hi guy, this is used as develo ...
compile hadoop-2.5.x on OS X(macbook)
string2020：撸主真土豪,在苹果里面玩大数据.
compile hadoop-2.5.x on OS X(macbook)
youngliu_liu：怎样运行这个脚本啊？？大牛，我刚进入搜索引擎行业，希望你能不吝 ...
nutch 数据增量更新
leibnitz： also, there is a similar bug ...
２。hbase CRUD--Lease in hbase

[spark-src-core] 2.5 core concepts in Spark

博客分类：

spark

阅读更多

1.overview in wordcount

-memory tips:

Job > Stage > Rdd > Dependency
RDDs are linked by Dependencies.

2.terms

-RDD is associated by Dependency,ie Dependency is a warpper of RDD.

Stage contains corresponding rdd; dependency contains parent rdd also

-Stage is a wrapper of same-func tasks; second,used to schedule by DAGScheduler

-job.numTasks = resultStage.numPartitions.if ‘spark.default.parallelism’ not set,resultStage.numPartitions = ShuffleRDD.parts=hadoopRDD.splits

so resultstage.numPartitons is deternmined by ShuffleRDD#getDependencies()

TODO check in page

-ShuffleMapStage.partitions = rdd.partition (ie. MapPartitionsRDD[3] here,similar to resultStage except ‘spark.default.parallelism’ case)

-ShuffleRDD will be parted into the ResultStage instead of ShuffleMapStage

cirtical concepts

-spark.default.parallelism-determine how many result tasks to be run,see PariRDDFunctions#reduceByKey()

-each stage has one direct corresponding rdd.

-spark.cores.max in standalnoe mode,this property will limit the maximum of cores in cluster to be utilized .

-ShuffleDependecy.shuffleId == ShuffleMapStage.id?yes,ie ShuffleDependency:ShuffleMapStage = 1:1,

see shuffleToMapStage(shuffleDep.shuffleId) = stage

查看图片附件

分享到：

[spark-src-core] 3.run spark in cluster( ... | [spark-src-core] 2.4 communications b/t ...

2016-08-25 17:38
浏览 396
评论(0)
论坛回复 / 浏览 (0 / 819)
分类:开源软件
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

beginning-apache-spark-2.pdf: There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack

learning-spark-streaming.pdf: •Learn core concepts such as Spark RDDs, Spark Streaming clusters, and the fundamentals of a DStream •Discover how to create a robust deployment •Dive into streaming algorithmics •Learn how to ...

Structured Spark Streaming-as-a-Service with Hopsworks: Structured Streaming is a high-level API built on top of Apache Spark's core engine. It allows developers to express streaming computations as if they were static batch computations. This makes it ...

Mastering Spark for Data Science: Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced ...

Scala and Spark for Big Data Analytics: Work with RDD and DataFrame to learn Spark's core abstractions Analysing structured and unstructured data using SparkSQL and GraphX Scalable and fault-tolerant streaming application development using ...

Apache Spark 2.x for Java Developers: Process data using different file formats such as XML, JSON, CSV, and plain and delimited text, using the Spark core Library. Perform analytics on data from various data sources such as Kafka, and ...

Learning Apache Spark 2: The core challenge are they key capabilities in Spark (Spark SQL, Spark Streaming, Spark ML, Spark R, Graph X) etc. Having understood the key capabilities, it is important to understand how Spark can...

Practical Apache Spark: This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic....

Hands-On Data Science and Python Machine Learning: Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your ...

Data Analytics with Hadoop(O'Reilly,2016): Understand core concepts behind Hadoop and cluster computing Use design patterns and parallel analytical algorithms to create distributed data analysis jobs Learn about data management, mining, and ...

Big Data Analytics with Java: It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naïve Bayes, a deep ...

Data Analytics with Hadoop: An Introduction for Data Scientists: Understand core concepts behind Hadoop and cluster computing Use design patterns and parallel analytical algorithms to create distributed data analysis jobs Learn about data management, mining, and ...

Practical Data Science with Python3: The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code. As data science projects gets continuously larger and more complex, software engineering ...

Introduction to JVM Languages: The practical examples will help you understand the core concepts of Java, Scala, Kotlin, Clojure, and Groovy Work with various programming paradigms and gain knowledge about imperative, object ...

第3集-Hadoop环境搭建 - linux（centos7） - 安装配置hadoop2.7.7.pdf: In this article, we have introduced how to build a Hadoop environment on Linux (CentOS7), including installing and configuring Hadoop2.7.7. We have also covered the basic concepts of Hadoop and its ...

Global site tag (gtag.js) - Google Analytics