`

spark-basic demo from book 'learning spark'

 
阅读更多

  after a heavy cost time(primary at download huge number of jars),the first example from book 'learning spark' is run through.

 

  the source code is very simple

/**
 * Illustrates flatMap + countByValue for wordcount.
 */
package com.oreilly.learningsparkexamples.scala

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
    def main(args: Array[String]) {
      val master = args.length match {
        case x: Int if x > 0 => args(0)
        case _ => "local"
      }
      println("**spark-home :" + System.getenv("SPARK_HOME")) //null if not set
      val sc = new SparkContext(master, "WordCount", System.getenv("SPARK_HOME"))
      val input = args.length match {
        case x: Int if x > 1 => sc.textFile(args(1))
        case _ => sc.parallelize(List("pandas", "i like pandas")) //-else generate a list as input data
      }
      val words = input.flatMap(line => line.split(" "))
      args.length match {
        case x: Int if x > 2 => {
          val counts = words.map(word => (word, 1)).reduceByKey{case (x,y) => x + y} //-same as xxByKey((x,y)=> x+y)
          counts.saveAsTextFile(args(2))
        }
        case _ => { //-else count by words number
          val wc = words.countByValue()
          println(wc.mkString(","))
        }
      }
    }
}

 

and the project outline will figure like this:



 

  below is the running logs in local mode:

JHLinMacBook:learning-spark-src userxx$ spark-submit --verbose --class com.oreilly.learningsparkexamples.scala.WordCount target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar 
Using properties file: null
Parsed arguments:
  master                  local[*]
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               com.oreilly.learningsparkexamples.scala.WordCount
  primaryResource         file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar
  name                    com.oreilly.learningsparkexamples.scala.WordCount
  childArgs               []
  jars                    null
  packages                null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  

    
Main class:
com.oreilly.learningsparkexamples.scala.WordCount
Arguments:

System properties:
SPARK_SUBMIT -> true
spark.app.name -> com.oreilly.learningsparkexamples.scala.WordCount
spark.jars -> file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar
spark.master -> local[*]
Classpath elements:
file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar


/Users/userxx/Cloud/Spark/spark-1.4.1-bin-hadoop2.4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/22 23:20:13 INFO SparkContext: Running Spark version 1.4.1
2015-09-22 23:20:13.411 java[918:1903] Unable to load realm info from SCDynamicStore
15/09/22 23:20:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 23:20:13 WARN Utils: Your hostname, JHLinMacBook resolves to a loopback address: 127.0.0.1; using 192.168.1.144 instead (on interface en0)
15/09/22 23:20:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/09/22 23:20:13 INFO SecurityManager: Changing view acls to: userxx
15/09/22 23:20:13 INFO SecurityManager: Changing modify acls to: userxx
15/09/22 23:20:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userxx); users with modify permissions: Set(userxx)
15/09/22 23:20:14 INFO Slf4jLogger: Slf4jLogger started
15/09/22 23:20:14 INFO Remoting: Starting remoting
15/09/22 23:20:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.144:49613]
15/09/22 23:20:14 INFO Utils: Successfully started service 'sparkDriver' on port 49613.
15/09/22 23:20:14 INFO SparkEnv: Registering MapOutputTracker
15/09/22 23:20:14 INFO SparkEnv: Registering BlockManagerMaster
15/09/22 23:20:14 INFO DiskBlockManager: Created local directory at /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6
15/09/22 23:20:14 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/09/22 23:20:15 INFO HttpFileServer: HTTP File server directory is /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/httpd-38c60499-b2cf-4fe1-b94f-f3d04fcae1b3
15/09/22 23:20:15 INFO HttpServer: Starting HTTP Server
15/09/22 23:20:15 INFO Utils: Successfully started service 'HTTP file server' on port 49614.
15/09/22 23:20:15 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/22 23:20:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/22 23:20:15 INFO SparkUI: Started SparkUI at http://192.168.1.144:4040
15/09/22 23:20:15 INFO SparkContext: Added JAR file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar at http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534
15/09/22 23:20:15 INFO Executor: Starting executor ID driver on host localhost
15/09/22 23:20:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49616.
15/09/22 23:20:15 INFO NettyBlockTransferService: Server created on 49616
15/09/22 23:20:15 INFO BlockManagerMaster: Trying to register BlockManager
15/09/22 23:20:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49616 with 265.4 MB RAM, BlockManagerId(driver, localhost, 49616)
15/09/22 23:20:15 INFO BlockManagerMaster: Registered BlockManager
15/09/22 23:20:16 INFO SparkContext: Starting job: countByValue at WordCount.scala:28
15/09/22 23:20:16 INFO DAGScheduler: Registering RDD 3 (countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO DAGScheduler: Got job 0 (countByValue at WordCount.scala:28) with 1 output partitions (allowLocal=false)
15/09/22 23:20:16 INFO DAGScheduler: Final stage: ResultStage 1(countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
15/09/22 23:20:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
15/09/22 23:20:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28), which has no missing parents
15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(3072) called with curMem=0, maxMem=278302556
15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KB, free 265.4 MB)
15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(1755) called with curMem=3072, maxMem=278302556
15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1755.0 B, free 265.4 MB)
15/09/22 23:20:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49616 (size: 1755.0 B, free: 265.4 MB)
15/09/22 23:20:16 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/09/22 23:20:16 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/09/22 23:20:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1465 bytes)
15/09/22 23:20:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/09/22 23:20:16 INFO Executor: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534
15/09/22 23:20:17 INFO Utils: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar to /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/fetchFileTemp4094197159668194166.tmp
15/09/22 23:20:17 INFO Executor: Adding file:/private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/learning-spark-examples_2.10-0.0.1.jar to class loader
15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 879 bytes result sent to driver
15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 300 ms on localhost (1/1)
15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/09/22 23:20:17 INFO DAGScheduler: ShuffleMapStage 0 (countByValue at WordCount.scala:28) finished in 0.332 s
15/09/22 23:20:17 INFO DAGScheduler: looking for newly runnable stages
15/09/22 23:20:17 INFO DAGScheduler: running: Set()
15/09/22 23:20:17 INFO DAGScheduler: waiting: Set(ResultStage 1)
15/09/22 23:20:17 INFO DAGScheduler: failed: Set()
15/09/22 23:20:17 INFO DAGScheduler: Missing parents for ResultStage 1: List()
15/09/22 23:20:17 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28), which is now runnable
15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(2304) called with curMem=4827, maxMem=278302556
15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 265.4 MB)
15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(1371) called with curMem=7131, maxMem=278302556
15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1371.0 B, free 265.4 MB)
15/09/22 23:20:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49616 (size: 1371.0 B, free: 265.4 MB)
15/09/22 23:20:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/09/22 23:20:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28)
15/09/22 23:20:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/09/22 23:20:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1245 bytes)
15/09/22 23:20:17 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1076 bytes result sent to driver
15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 84 ms on localhost (1/1)
15/09/22 23:20:17 INFO DAGScheduler: ResultStage 1 (countByValue at WordCount.scala:28) finished in 0.084 s
15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/09/22 23:20:17 INFO DAGScheduler: Job 0 finished: countByValue at WordCount.scala:28, took 0.702187 s
pandas -> 2,i -> 1,like -> 1
15/09/22 23:20:17 INFO SparkContext: Invoking stop() from shutdown hook
15/09/22 23:20:17 INFO SparkUI: Stopped Spark web UI at http://192.168.1.144:4040
15/09/22 23:20:17 INFO DAGScheduler: Stopping DAGScheduler
15/09/22 23:20:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/09/22 23:20:17 INFO Utils: path = /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6, already present as root for deletion.
15/09/22 23:20:17 INFO MemoryStore: MemoryStore cleared
15/09/22 23:20:17 INFO BlockManager: BlockManager stopped
15/09/22 23:20:17 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/22 23:20:17 INFO SparkContext: Successfully stopped SparkContext
15/09/22 23:20:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/09/22 23:20:17 INFO Utils: Shutdown hook called
15/09/22 23:20:17 INFO Utils: Deleting directory /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2

   as u see, i use 'verbose' mode to submit this application,so certain details are more clear.

 

tips:

  one thing u should know is that command spark-submit parameters are ugly:

spark-submit [options] <app jar | python file> [app arguments]

   so the app jar should follow the options(if any),then app args.

   e.g. submit a wordcount app to standalone-client mode spark ensemble:

 spark-submit  --master spark://gzsw-02:7077 --class org.apache.spark.examples.JavaWordCount --executor-memory 600m --total-executor-cores 16 --verbose --deploy-mode client /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-examples-1.4.1-hadoop2.4.0.jar /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/RELEASE 2 output-result

   param-2:minpartitions; output-result:output all the wc detailed result

 

supplement

  yep,spark can use do certain 'memory-computings',though,we have already used memory to do something before,but may be not to form some theorys yet.

  • 大小: 168.7 KB
0
3
分享到:
评论

相关推荐

    spark-hive-2.11和spark-sql-以及spark-hadoop包另付下载地址

    在标题"spark-hive-2.11和spark-sql-以及spark-hadoop包另付下载地址"中,我们关注的是Spark与Hive的特定版本(2.11)的集成,以及Spark SQL和Spark对Hadoop的支持。这里的2.11可能指的是Scala的版本,因为Spark是用...

    spark-3.1.2.tgz & spark-3.1.2-bin-hadoop2.7.tgz.rar

    Spark-3.1.2.tgz和Spark-3.1.2-bin-hadoop2.7.tgz是两个不同格式的Spark发行版,分别以tar.gz和rar压缩格式提供。 1. Spark核心概念: - RDD(弹性分布式数据集):Spark的基础数据结构,是不可变、分区的数据集合...

    编译的spark-hive_2.11-2.3.0和 spark-hive-thriftserver_2.11-2.3.0.jar

    spark-hive_2.11-2.3.0 spark-hive-thriftserver_2.11-2.3.0.jar log4j-2.15.0.jar slf4j-api-1.7.7.jar slf4j-log4j12-1.7.25.jar curator-client-2.4.0.jar curator-framework-2.4.0.jar curator-recipes-2.4.0....

    spark-assembly-1.5.2-hadoop2.6.0.jar

    《Spark编程核心组件:spark-assembly-1.5.2-hadoop2.6.0.jar详解》 在大数据处理领域,Spark以其高效、易用和灵活性脱颖而出,成为了许多开发者的首选框架。Spark-assembly-1.5.2-hadoop2.6.0.jar是Spark中的一个...

    spark-3.1.3-bin-without-hadoop.tgz

    这个"spark-3.1.3-bin-without-hadoop.tgz"压缩包是Spark的3.1.3版本,不含Hadoop依赖的二进制发行版。这意味着在部署时,你需要自行配置Hadoop环境,或者在不依赖Hadoop的环境中运行Spark。 Spark的核心特性包括...

    spark-2.4.8-bin-hadoop2.7.tgz

    安装和配置Spark 2.4.8时,你需要根据你的环境调整配置文件,如`spark-env.sh`或`spark-defaults.conf`,以适应你的Hadoop集群或本地环境。在使用Spark时,你可以通过`spark-submit`命令提交应用程序,或者直接在...

    spark-2.0.0-bin-hadoop2.6.tgz

    本资源是spark-2.0.0-bin-hadoop2.6.tgz百度网盘资源下载,本资源是spark-2.0.0-bin-hadoop2.6.tgz百度网盘资源下载

    apache-doris-spark-connector-2.3_2.11-1.0.1

    Spark Doris Connector(apache-doris-spark-connector-2.3_2.11-1.0.1-incubating-src.tar.gz) Spark Doris Connector Version:1.0.1 Spark Version:2.x Scala Version:2.11 Apache Doris是一个现代MPP分析...

    spark--bin-hadoop3-without-hive.tgz

    本压缩包“spark--bin-hadoop3-without-hive.tgz”提供了Spark二进制版本,针对Hadoop 3.1.3进行了编译和打包,这意味着它已经与Hadoop 3.x兼容,但不包含Hive组件。在CentOS 8操作系统上,这个版本的Spark已经被...

    spark-1.6.0-bin-hadoop2.6.tgz

    Spark-1.6.0-bin-hadoop2.6.tgz 是针对Linux系统的Spark安装包,包含了Spark 1.6.0版本以及与Hadoop 2.6版本兼容的构建。这个安装包为在Linux环境中搭建Spark集群提供了必要的组件和库。 **1. Spark基础知识** ...

    spark-2.4.7-bin-hadoop2.6.tgz

    在解压`spark-2.4.7-bin-hadoop2.6.tgz`后,您会得到一个名为`spark-2.4.7-bin-hadoop2.6`的目录,其中包括以下组件: - `bin/`:包含可执行文件,如`spark-submit`,`pyspark`,`spark-shell`等,用于启动和管理...

    spark-3.2.4-bin-hadoop3.2-scala2.13 安装包

    在本安装包“spark-3.2.4-bin-hadoop3.2-scala2.13”中,包含了用于运行Spark的核心组件以及依赖的Hadoop版本和Scala编程语言支持。以下是对这些关键组成部分的详细解释: 1. **Spark**: Spark的核心在于它的弹性...

    spark-3.1.2-bin-hadoop3.2.tgz

    5. **交互式Shell**:Spark提供了一个名为`spark-shell`的交互式环境,方便开发人员测试和调试代码。 **Spark与Hadoop 3.2的兼容性** Hadoop 3.2引入了许多新特性,如: 1. **多命名空间**:支持多个HDFS命名空间...

    spark-2.2.2-bin-hadoop2.7.tgz

    1. `bin`:存放可执行脚本,如`spark-submit`用于提交Spark应用,`spark-shell`提供交互式Shell环境。 2. `conf`:配置文件夹,存放默认配置模板,如`spark-defaults.conf`,用户可以根据需求自定义配置。 3. `jars`...

    spark-2.3.1-bin-hadoop2.7.zip

    - `bin`:包含Spark的可执行脚本,如`spark-shell`(Scala交互式环境)、`pyspark`(Python交互式环境)和`spark-submit`(提交Spark应用)等。 - `conf`:配置文件目录,其中`spark-defaults.conf`是默认配置,可以...

    spark-3.2.1-bin-hadoop2.7.tgz

    这个名为"spark-3.2.1-bin-hadoop2.7.tgz"的压缩包是Spark的一个特定版本,即3.2.1,与Hadoop 2.7版本兼容。在Linux环境下,这样的打包方式方便用户下载、安装和运行Spark。 Spark的核心设计理念是快速数据处理,...

    spark-3.2.0-bin-hadoop3.2.tgz

    这个压缩包"spark-3.2.0-bin-hadoop3.2.tgz"包含了Spark 3.2.0版本的二进制文件,以及针对Hadoop 3.2的兼容构建。 Spark的核心组件包括:Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图...

    spark-3.1.2-bin-hadoop2.7.tar

    spark-3.1.2-bin-hadoop2.7.tar

    spark-2.4.0-bin-hadoop2.7.tgz

    然后,你可以通过`spark-submit`命令提交Spark作业到集群,或者使用`pyspark`或`spark-shell`启动交互式环境。 在实际应用中,Spark常被用于大数据分析、实时数据处理、机器学习模型训练和图数据分析。由于其内存...

    spark-2.1.0-bin-without-hadoop版本的压缩包,直接下载到本地解压后即可使用

    在Ubuntu里安装spark,spark-2.1.0-bin-without-hadoop该版本直接下载到本地后解压即可使用。 Apache Spark 是一种用于大数据工作负载的分布式开源处理系统。它使用内存中缓存和优化的查询执行方式,可针对任何规模...

Global site tag (gtag.js) - Google Analytics