`
WindyQin
  • 浏览: 32398 次
  • 性别: Icon_minigender_1
  • 来自: 郑州
社区版块
存档分类
最新评论

Spark的这些事<一>——Windows下spark开发环境搭建

 
阅读更多

一、首先准备需要安装的软件
scala-2.10.4
下载地址:http://www.scala-lang.org/download/2.10.4.html
scala-SDK-4.4.1-vfinal-2.11-win32.win32.x86_64
下载地址:http://scala-ide.org/
spark-1.6.2-bin-hadoop2.6
下载地址:http://spark.apache.org/

当然还有jdk这里就不说了
scala-2.10.4下载后直接安装~

scala-SDK-4.4.1-vfinal-2.11-win32.win32.x86_64和spark-1.6.2-bin-hadoop2.6下载后直接解压

scalaIDE解压后就是一个eclipse,这个大家都比较熟悉了。打开IDE,在解压后的spark包中的lib文件夹下找到spark-assembly-1.6.2-hadoop2.6.0,添加到IDE中。
然后环境就搭建完成了~
下面就来开发一个测试程序试一下:

package com.day1.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def  main(args: Array[String]): Unit = {

    val conf = new SparkConf()
    conf.setAppName("MY frist Spark App")
    conf.setMaster("local")

    var sc = new SparkContext(conf)

    val lines = sc.textFile("D:\\test.txt", 1)

    val words = lines.flatMap { line => line.split(" ") }

    val pairs = words.map { word => (word,1) }

    val WordCount = pairs.reduceByKey(_+_)

    WordCount.foreach(wordNumberPair=>println(wordNumberPair._1 + ":" + wordNumberPair._2))

    sc.stop()
  }
}

右键运行:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/18 19:07:04 INFO SparkContext: Running Spark version 1.6.2
16/07/18 19:07:13 INFO SecurityManager: Changing view acls to: qinchaofeng
16/07/18 19:07:13 INFO SecurityManager: Changing modify acls to: qinchaofeng
16/07/18 19:07:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(qinchaofeng); users with modify permissions: Set(qinchaofeng)
16/07/18 19:07:14 INFO Utils: Successfully started service 'sparkDriver' on port 50513.
16/07/18 19:07:14 INFO Slf4jLogger: Slf4jLogger started
16/07/18 19:07:14 INFO Remoting: Starting remoting
16/07/18 19:07:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.108.207:50526]
16/07/18 19:07:14 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 50526.
16/07/18 19:07:14 INFO SparkEnv: Registering MapOutputTracker
16/07/18 19:07:14 INFO SparkEnv: Registering BlockManagerMaster
16/07/18 19:07:14 INFO DiskBlockManager: Created local directory at C:\Users\qinchaofeng\AppData\Local\Temp\blockmgr-25c047c4-505d-4cfb-addd-d777e5a430d7
16/07/18 19:07:14 INFO MemoryStore: MemoryStore started with capacity 797.6 MB
16/07/18 19:07:14 INFO SparkEnv: Registering OutputCommitCoordinator
16/07/18 19:07:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/07/18 19:07:15 INFO SparkUI: Started SparkUI at http://192.168.108.207:4040
16/07/18 19:07:15 INFO Executor: Starting executor ID driver on host localhost
16/07/18 19:07:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50533.
16/07/18 19:07:15 INFO NettyBlockTransferService: Server created on 50533
16/07/18 19:07:15 INFO BlockManagerMaster: Trying to register BlockManager
16/07/18 19:07:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50533 with 797.6 MB RAM, BlockManagerId(driver, localhost, 50533)
16/07/18 19:07:15 INFO BlockManagerMaster: Registered BlockManager
16/07/18 19:07:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)
16/07/18 19:07:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)
16/07/18 19:07:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:50533 (size: 13.9 KB, free: 797.6 MB)
16/07/18 19:07:15 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:15
16/07/18 19:07:16 WARN : Your hostname, qinchaofeng1 resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:c0a8:6ccf%12, but we couldn't find any external IP address!
16/07/18 19:07:24 INFO FileInputFormat: Total input paths to process : 1
16/07/18 19:07:24 INFO SparkContext: Starting job: foreach at WordCount.scala:23
16/07/18 19:07:24 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:19)
16/07/18 19:07:24 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:23) with 1 output partitions
16/07/18 19:07:24 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:23)
16/07/18 19:07:24 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/07/18 19:07:24 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/07/18 19:07:24 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:19), which has no missing parents
16/07/18 19:07:24 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 171.6 KB)
16/07/18 19:07:25 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 173.8 KB)
16/07/18 19:07:25 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50533 (size: 2.3 KB, free: 797.6 MB)
16/07/18 19:07:25 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/18 19:07:25 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:19)
16/07/18 19:07:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/18 19:07:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2108 bytes)
16/07/18 19:07:25 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/18 19:07:25 INFO HadoopRDD: Input split: file:/D:/test.txt:0+214
16/07/18 19:07:25 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/07/18 19:07:25 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/07/18 19:07:25 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/07/18 19:07:25 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/07/18 19:07:25 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/07/18 19:07:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
16/07/18 19:07:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 156 ms on localhost (1/1)
16/07/18 19:07:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/07/18 19:07:25 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:19) finished in 0.156 s
16/07/18 19:07:25 INFO DAGScheduler: looking for newly runnable stages
16/07/18 19:07:25 INFO DAGScheduler: running: Set()
16/07/18 19:07:25 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/07/18 19:07:25 INFO DAGScheduler: failed: Set()
16/07/18 19:07:25 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:21), which has no missing parents
16/07/18 19:07:25 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 176.3 KB)
16/07/18 19:07:25 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1583.0 B, free 177.9 KB)
16/07/18 19:07:25 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:50533 (size: 1583.0 B, free: 797.6 MB)
16/07/18 19:07:25 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/07/18 19:07:25 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:21)
16/07/18 19:07:25 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/07/18 19:07:25 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
16/07/18 19:07:25 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/07/18 19:07:25 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/18 19:07:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 16 ms
avg_flow:1
avg_flow_fee:1
local_city_nm:1
cust_num:1
birthday:1
Gender:1
cust_nm:1
join_net_time:1
avg_call_fee:1
net_age:1
prov_code:1
cmpgn:1
buy_time:1
using_meal:1
chnl:1
arpu:1
term_brand:1
term_model:1
avg_call_times:1
flow_pkg:1
16/07/18 19:07:25 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
16/07/18 19:07:25 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:23) finished in 0.062 s
16/07/18 19:07:25 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 62 ms on localhost (1/1)
16/07/18 19:07:25 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/07/18 19:07:25 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:23, took 0.356812 s
16/07/18 19:07:25 INFO SparkUI: Stopped Spark web UI at http://192.168.108.207:4040
16/07/18 19:07:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/07/18 19:07:25 INFO MemoryStore: MemoryStore cleared
16/07/18 19:07:25 INFO BlockManager: BlockManager stopped
16/07/18 19:07:25 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/18 19:07:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/18 19:07:25 INFO SparkContext: Successfully stopped SparkContext
16/07/18 19:07:25 INFO ShutdownHookManager: Shutdown hook called
16/07/18 19:07:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/18 19:07:25 INFO ShutdownHookManager: Deleting directory C:\Users\qinchaofeng\AppData\Local\Temp\spark-c54f4cb1-7351-47d1-9fd4-fbdac0fae228

可以看到已经执行成功了!

源码

Spark的这些事系列文章:
Spark的这些事<一>——Windows下spark开发环境搭建
Spark的这些事<二>——几个概念
Spark的这些事<三>——spark常用的Transformations 和Actions

<script type="text/javascript"> $(function () { $('pre.prettyprint code').each(function () { var lines = $(this).text().split('\n').length; var $numbering = $('<ul/>').addClass('pre-numbering').hide(); $(this).addClass('has-numbering').parent().append($numbering); for (i = 1; i <= lines; i++) { $numbering.append($('<li/>').text(i)); }; $numbering.fadeIn(1700); }); }); </script>

分享到:
评论

相关推荐

    大数据技术实践——Spark词频统计

    【Spark技术实践——词频统计】在大数据领域,Spark作为一种高效的数据处理框架,以其快速、通用和可扩展性而受到广泛关注。本实践旨在基于已经搭建的Hadoop平台,利用Spark组件进行文本词频统计,以此深入理解Scala...

    hadoop 2.6.0 及Spark1.3.1平台搭建20150505-优化版

    在单机模式下,Hadoop 提供了一个简单的本地运行环境,用于测试和学习。这包括安装 SSH 和 rsync 用于远程文件复制,然后下载并解压 Hadoop 2.6.0。接着,我们需要编辑 Hadoop 的配置文件,如 `core-site.xml`、`...

    用户画像系统解决方案——搭建开发环境.pdf

    本资源旨在解决用户画像系统的开发环境搭建问题,提供了一个完整的解决方案,涵盖了用户画像基础知识、标签指标体系、搭建开发环境、标签数据存储、标签数据开发、开发性能调优、作业流程调度、用户画像产品化、用户...

    Spark环境搭建——HA高可用模式

    1. **基于文件系统的单点恢复(Single-Node Recovery with Local File System)**:这是一个适用于开发或测试环境的简单方案。Master节点的状态会被定期保存到本地文件系统,当Master故障时,可以从文件系统中恢复。...

    spark入门相关文档,适用于初学者

    Spark是Apache软件基金会下的一个开源大数据处理框架,以其高效、灵活和易用的特性而闻名。这个"spark入门相关文档,适用于初学者"的压缩包很可能是为了帮助那些刚接触Spark的人快速上手。让我们深入了解一下Spark的...

    非常好的大数据入门目资源,分享出来.zip

    大数据 大数据学习路线 大数据技术栈思维导图 ...Spark 开发环境搭建 弹性式数据集 RDD RDD 常用算子详解 Spark 运行模式与作业提交 Spark 累加器与广播变量 基于 Zookeeper 搭建 Spark 高可用集群 Spark SQL :

    hadoop+spark机器学习实例

    通过学习和实践这些示例,你可以了解到如何在Hadoop和Spark环境下搭建和运行机器学习项目,理解如何利用这两个强大的工具进行数据处理、特征提取、模型训练和验证。这不仅有助于提升大数据处理技能,还能为未来的...

    spark从入门到实战

    ### Spark从入门到实战——SCALA编程篇 #### 一、大数据分析框架概要与Spark课程学习计划 在大数据时代,高效处理海量数据成为企业和组织的核心需求之一。Spark作为当前最受欢迎的大数据分析框架之一,以其高性能...

    安卓Android源码——AdXmpp(Openfire+asmack+spark).zip

    总结来说,AdXmpp项目提供了一个Android平台上实现XMPP通信的完整示例,通过结合Openfire服务器、asmack库和Spark组件,开发者能够快速搭建起一个功能完善的即时通讯应用。这个项目的源码对于学习XMPP协议、Android...

    spark学习总结

    - **测试环境搭建**:详细介绍如何设置测试环境和准备测试数据。 - **基础应用**:介绍如何使用`SQLContext`处理RDD、JSON、Parquet等数据格式,以及如何使用`HiveContext`。 - **Thrift Server和CLI使用**:如何...

    大数据精选入门指南,包括大数据学习路线、大数据技术栈思维导图

    Spark开发环境搭建 弹性式数据集 RDD RDD使用算子详解 Spark运行模式与作业提交 Spark 累加器与广播信号 基于Zookeeper搭建Spark高可用服务 火花 SQL: 日期帧和数据集 结构化API的基本使用 Spark SQL 外部数据源 ...

    Spark大数据技术与应用教学大纲.docx

    学生将学习环境搭建前的准备工作,包括修改profile文件和Spark配置文件,以及如何启动和关闭Spark集群。此外,还会介绍如何提交应用程序到集群,并利用Sparkweb监控页面进行监控。 **第三章 使用Python开发Spark...

    PyCharm搭建Spark开发环境实现第一个pyspark程序

    ### PyCharm 搭建 Spark 开发环境与首个 PySpark 程序实现 ...通过上述步骤,我们不仅完成了PyCharm中的Spark开发环境搭建,还实现了第一个PySpark程序。这为后续更复杂的大数据分析任务打下了坚实的基础。

    hadoop和spark集群安装(centos)

    首先,为了确保集群的稳定性和一致性,我们通常会创建一个新的用户——`hadoop`,专门用于管理Hadoop和Spark的相关服务。通过`sudo useradd hadoop`命令创建新用户,然后使用`sudo passwd hadoop`设置密码。在集群...

    《大数据开发工程师系列:Hadoop & Spark大数据开发实战》1

    Spark部分,将涵盖SparkCore——Spark的基础模块,它提供了一种快速、通用和可扩展的数据处理方式。Spark SQL是Spark的SQL接口,支持结构化数据处理,使SQL开发者能轻松地使用Spark。Spark Streaming则用于实时数据...

    python基于spark开发插件库、用于离线安装,在线安装都行

    Python中的Spark开发插件库是大数据处理领域的重要工具,它为开发者提供了强大的分布式计算能力。在本主题中,我们将深入探讨如何使用Python的Spark库,尤其是pyspark,以及如何进行离线安装以提高效率。 首先,让...

    winutils各版本-留着自用1.rar

    在Windows上搭建Spark开发环境或者进行大数据实验时,掌握Winutils的使用技巧是非常必要的。 总结来说,"winutils各版本-留着自用1.rar"是一个实用的资源,包含不同版本的Winutils工具,方便Windows用户根据自己的...

    Practical_Data_Science_with_Hadoop_and_Spark

    - **环境搭建**:指导读者如何安装配置Hadoop和Spark集群,以及设置开发环境。 - **代码示例**:书中包含了大量可运行的代码示例,覆盖了从数据读取到结果输出的完整流程。 - **实战项目**:提供了一些具体的实战...

    hadoop2.7.6插件.zip

    六、详细开发环境搭建步骤 参考链接中给出的博客文章(https://blog.csdn.net/sxjxrxm/article/details/95198202),作者详细介绍了从下载到运行的每一步,包括环境配置、文件上传、数据读取和Spark程序编写等。按照...

    Spark入门基础--简介及环境搭建

    ## Spark开发环境搭建 在本地或开发环境中,通常使用集成开发环境(IDE)如IntelliJ IDEA或Eclipse,配合Scala、Java、Python或R语言进行Spark应用开发。以下是一般步骤: 1. **安装编程语言**: 根据选择的语言...

Global site tag (gtag.js) - Google Analytics