1、 Resilient Distributed Datasets(RDDs)
Immutable,partitioned collections of objects
不可变,对象分区
Created through parallel transformations(map,filter,groupBy,join…) on data in stable storage
在固定存储上的数据创建并行转换
Can be cached for efficient reuse
数据cache在内存中为再次重用
2、 RDD Fault Tolerance
故障容忍
RDDs maintain lineage information that can be used to reconstruct lost partitions
RDDs维护血统信息用来重建丢失的分区
3、 Aggregations on many keys w/ same WHERE clause
同样where条件,多键值聚集,比Hive快40倍。原因是:
Not re-reading unused columns or filtered records
不用预读不用到的列和 过滤记录
Avoiding repeated decompression
避免重复的解压
In-memory storage of deserialized objects
串行对象存放在内存存储中
4、 Runs on Apache Mesos to share resources with Hadoop & other apps
运行在Mesos,可以和hadoop等其他应用共享资源
Can read from any Hadoop input source (e.g. HDFS)
可以从任何hadoop读取资源,比如hdfs
No changes to Scala compiler
原生scala
5.Spark scheduler(spark调度)
Dryad-like DAGs
类似DAG调度
Pipelines functions within a stage
Stage内部通过管道函数传输
Cache-aware work reuse & locality
缓存感知工作的重用和本地化
Partitioning-aware to avoid shuffles
分区感知避免shuffles
相关推荐
Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution ...
Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4. Along the way, you’ll learn about the design and implementation of V2 of theData Source API and ...
The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...
Overview 总体介绍 Job logical plan 介绍 job 的逻辑执行图(数据依赖图) Job physical plan 介绍 job 的物理执行图 Shuffle details 介绍 shuffle 过程 Architecture 介绍系统模块如何协调完成整个 job 的执行 ...
The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x. You will understand how memory ...
首先,"SparkInternal1-Overview.pdf"应该是Spark的总体概述,它可能会介绍Spark的基本概念,如弹性分布式数据集(Resilient Distributed Datasets, RDDs)、DataFrame和Dataset,以及Spark的主要组件,如Spark Core...
Overview Big Data Analytics and its importance for organizations and data professionals. Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of ...
Alpine Data Overview: Alpine Data 是一个数据科学平台,提供了从数据预处理到模型部署的一整套解决方案。该平台支持 Spark Autotuning,允许数据科学家们快速地开发和部署机器学习模型。 Spark Configuration:...
There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack
The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...
根据官方文档(http://spark.apache.org/docs/latest/cluster-overview.html),Spark支持多种集群管理器,包括Standalone、Apache Mesos以及Hadoop YARN。 **1.1 Standalone** - **定义**: Standalone是Spark自带...
Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets-Spark's core APIs-through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and ...
Memory Usage Overview Spark 使用内存来缓存数据,以便在将来使用。这种缓存机制可以提高数据处理速度。在 Spark 中,内存主要用于三个方面:存储、执行和其他。其中,存储内存用于缓存数据,执行内存用于计算、...
Catalyst Optimizer: An Overview Catalyst 是 Spark SQL 的优化器,负责将用户查询转换为执行计划。Catalyst 优化器的目标是选择合适的执行计划,以最小化查询响应时间。Catalyst 优化器的主要组件包括: * ...
#### Hopsworks Platform Overview Hopsworks is a comprehensive data management and machine learning platform that simplies the deployment and management of Spark applications. Key components of ...
Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it ...