`

[spark-src-core] 2.5 core concepts in Spark

 
阅读更多

1.overview in wordcount


-memory tips:

Job > Stage > Rdd > Dependency
RDDs are linked by Dependencies.

 

2.terms

-RDD is associated by Dependency,ie Dependency is a warpper of RDD.

  Stage contains corresponding rdd; dependency contains parent rdd also

-Stage is a wrapper of same-func tasks; second,used to schedule by DAGScheduler

-job.numTasks = resultStage.numPartitions.if ‘spark.default.parallelism’ not set,resultStage.numPartitions = ShuffleRDD.parts=hadoopRDD.splits

  so resultstage.numPartitons is deternmined by ShuffleRDD#getDependencies()

 TODO check in page

-ShuffleMapStage.partitions = rdd.partition (ie. MapPartitionsRDD[3] here,similar to resultStage except ‘spark.default.parallelism’ case)

-ShuffleRDD will be parted into the ResultStage instead of ShuffleMapStage

 

 

cirtical concepts

-spark.default.parallelism-determine how many result tasks to be run,see PariRDDFunctions#reduceByKey()

-each stage has one direct corresponding rdd.

-spark.cores.max in standalnoe mode,this property will limit the maximum of cores in cluster to be utilized .

-ShuffleDependecy.shuffleId == ShuffleMapStage.id?yes,ie ShuffleDependency:ShuffleMapStage = 1:1,

  see shuffleToMapStage(shuffleDep.shuffleId) = stage

 

 

  • 大小: 361.7 KB
分享到:
评论

相关推荐

    beginning-apache-spark-2.pdf

    There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack

    learning-spark-streaming.pdf

    •Learn core concepts such as Spark RDDs, Spark Streaming clusters, and the fundamentals of a DStream •Discover how to create a robust deployment •Dive into streaming algorithmics •Learn how to ...

    Structured Spark Streaming-as-a-Service with Hopsworks

    Structured Streaming is a high-level API built on top of Apache Spark's core engine. It allows developers to express streaming computations as if they were static batch computations. This makes it ...

    Mastering Spark for Data Science

    Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced ...

    Scala and Spark for Big Data Analytics

    Work with RDD and DataFrame to learn Spark's core abstractions Analysing structured and unstructured data using SparkSQL and GraphX Scalable and fault-tolerant streaming application development using ...

    Apache Spark 2.x for Java Developers

    Process data using different file formats such as XML, JSON, CSV, and plain and delimited text, using the Spark core Library. Perform analytics on data from various data sources such as Kafka, and ...

    Learning Apache Spark 2

    The core challenge are they key capabilities in Spark (Spark SQL, Spark Streaming, Spark ML, Spark R, Graph X) etc. Having understood the key capabilities, it is important to understand how Spark can...

    Practical Apache Spark

    This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic....

    Hands-On Data Science and Python Machine Learning

    Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your ...

    Data Analytics with Hadoop(O'Reilly,2016)

    Understand core concepts behind Hadoop and cluster computing Use design patterns and parallel analytical algorithms to create distributed data analysis jobs Learn about data management, mining, and ...

    Big Data Analytics with Java

    It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naïve Bayes, a deep ...

    Data Analytics with Hadoop: An Introduction for Data Scientists

    Understand core concepts behind Hadoop and cluster computing Use design patterns and parallel analytical algorithms to create distributed data analysis jobs Learn about data management, mining, and ...

    Practical Data Science with Python3

    The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code. As data science projects gets continuously larger and more complex, software engineering ...

    Introduction to JVM Languages

    The practical examples will help you understand the core concepts of Java, Scala, Kotlin, Clojure, and Groovy Work with various programming paradigms and gain knowledge about imperative, object ...

    第3集-Hadoop环境搭建 - linux(centos7) - 安装配置hadoop2.7.7.pdf

    In this article, we have introduced how to build a Hadoop environment on Linux (CentOS7), including installing and configuring Hadoop2.7.7. We have also covered the basic concepts of Hadoop and its ...

Global site tag (gtag.js) - Google Analytics