`
jiezhu2007
  • 浏览: 245165 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
博客专栏
Cfa1f850-3fc3-3a36-9cd8-c3415c9610c6
hadoop技术学习
浏览量:143975
Group-logo
大数据产业分析
浏览量:2977
社区版块
存档分类
最新评论

spark overview

 
阅读更多

1、  Resilient Distributed DatasetsRDDs

Immutable,partitioned collections of objects

不可变,对象分区

Created through parallel transformations(map,filter,groupBy,join…) on data in stable storage

         在固定存储上的数据创建并行转换

Can be cached for efficient reuse

数据cache在内存中为再次重用

 

2、  RDD Fault Tolerance

故障容忍

RDDs maintain lineage information that can be used to reconstruct lost partitions

RDDs维护血统信息用来重建丢失的分区

 

 

3、  Aggregations on many keys w/ same WHERE clause

同样where条件,多键值聚集,比Hive40倍。原因是:

Not re-reading unused columns or filtered records

不用预读不用到的列和 过滤记录

Avoiding repeated decompression

避免重复的解压

In-memory storage of deserialized objects

串行对象存放在内存存储中

 

 

4、  Runs on Apache Mesos to share resources with Hadoop & other apps

运行在Mesos,可以和hadoop等其他应用共享资源

Can read from any Hadoop input source (e.g. HDFS)

可以从任何hadoop读取资源,比如hdfs

No changes to Scala compiler

原生scala

 

5Spark schedulerspark调度)

Dryad-like DAGs

类似DAG调度

Pipelines functions within a stage

Stage内部通过管道函数传输

Cache-aware work reuse & locality

缓存感知工作的重用和本地化

Partitioning-aware to avoid shuffles

分区感知避免shuffles

 

 

 

分享到:
评论

相关推荐

    spark 1.2.0 文档(spark-1.2.0-doc)

    Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution ...

    Apache Spark 2.4 and beyond

    Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4. Along the way, you’ll learn about the design and implementation of V2 of theData Source API and ...

    Mastering.Apache.Spark.178397146

    The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...

    Apache Spark的设计与实现 PDF中文版

    Overview 总体介绍 Job logical plan 介绍 job 的逻辑执行图(数据依赖图) Job physical plan 介绍 job 的物理执行图 Shuffle details 介绍 shuffle 过程 Architecture 介绍系统模块如何协调完成整个 job 的执行 ...

    Mastering Apache Spark 2.x - Second Edition

    The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x. You will understand how memory ...

    Spark相关资料.zip

    首先,"SparkInternal1-Overview.pdf"应该是Spark的总体概述,它可能会介绍Spark的基本概念,如弹性分布式数据集(Resilient Distributed Datasets, RDDs)、DataFrame和Dataset,以及Spark的主要组件,如Spark Core...

    Learning Apache Spark 2

    Overview Big Data Analytics and its importance for organizations and data professionals. Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of ...

    藏经阁-Spark Autotuning.pdf

    Alpine Data Overview: Alpine Data 是一个数据科学平台,提供了从数据预处理到模型部署的一整套解决方案。该平台支持 Spark Autotuning,允许数据科学家们快速地开发和部署机器学习模型。 Spark Configuration:...

    beginning-apache-spark-2.pdf

    There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack

    Mastering Apache Spark(PACKT,2015)

    The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...

    Spark大数据内核天机解密- to 丁立清.pdf

    根据官方文档(http://spark.apache.org/docs/latest/cluster-overview.html),Spark支持多种集群管理器,包括Standalone、Apache Mesos以及Hadoop YARN。 **1.1 Standalone** - **定义**: Standalone是Spark自带...

    Spark: The Definitive Guide: Big Data Processing Made Simple 英文高清.pdf版

    Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets-Spark's core APIs-through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and ...

    藏经阁-Deep Dive_How Spark Uses Memor.pdf

    Memory Usage Overview Spark 使用内存来缓存数据,以便在将来使用。这种缓存机制可以提高数据处理速度。在 Spark 中,内存主要用于三个方面:存储、执行和其他。其中,存储内存用于缓存数据,执行内存用于计算、...

    藏经阁-Cost-Based Optimizer in Apache Spark 2.2.pdf

    Catalyst Optimizer: An Overview Catalyst 是 Spark SQL 的优化器,负责将用户查询转换为执行计划。Catalyst 优化器的目标是选择合适的执行计划,以最小化查询响应时间。Catalyst 优化器的主要组件包括: * ...

    Structured Spark Streaming-as-a-Service with Hopsworks

    #### Hopsworks Platform Overview Hopsworks is a comprehensive data management and machine learning platform that simplies the deployment and management of Spark applications. Key components of ...

    Spark GraphX In Action

    Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it ...

Global site tag (gtag.js) - Google Analytics