bit1129

浏览: 1073081 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

【Spark七十六】Spark计算结果存到MySQL

博客分类：

Spark

package spark.examples.db

import java.sql.{PreparedStatement, Connection, DriverManager}

import com.mysql.jdbc.Driver
import org.apache.spark.{SparkContext, SparkConf}

object SparkMySQLIntegration {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkRDDCount").setMaster("local");
    val sc = new SparkContext(conf);
    val data = sc.parallelize(List(("Tom", 31), ("Jack", 22), ("Mary", 25)))
    def func(iter: Iterator[(String, Int)]): Unit = {
//      Class.forName("com.mysql.jdbc.Driver ")
      var conn:Connection = null
      val d :Driver = null
      var pstmt:PreparedStatement = null
      try {
        val url="jdbc:mysql://localhost:3306/person";
        val user="root";
        val password=""
        //在forPartition函数内打开连接，这样连接将在worker上打开
        conn = DriverManager.getConnection(url, user, password)
        while (iter.hasNext) {
          val item = iter.next()
          println(item._1 + "," + item._2)
          val sql = "insert into TBL_PERSON(name, age) values (?, ?)";
          pstmt = conn.prepareStatement(sql);
          pstmt.setString(1, item._1)
          pstmt.setInt(2, item._2)
          pstmt.executeUpdate();
        }
      } catch {
        case e: Exception => e.printStackTrace()
      } finally {
        if (pstmt != null) {
          pstmt.close()
        }
        if (conn != null) {
          conn.close()
        }
      }
    }
    data.foreachPartition(func);
  }

}

这个代码遇到了两个坑，

1. 按照Java程序员使用JDBC的习惯，首先通过Class.forName("com.mysql.jdbc.Driver ")注册MySQL的JDBC驱动，但是在Scala中却不需要这么做，这么做还出错，包ClassNotFoundExeception（但是com.mysql.jdbc.Driver明明在classpath上）

所以代码中添加了注释

2. 在本地运行这个代码时，反反复复报错说sql语句的（?,?）附近有语法错误，反反复复的看也没看出来哪里有错，后来发现原来是pstmt.executeUpdate();写成了pstmt.executeUpdate(sql);如此严重的编译错，Intellij Idea竟然编译不报错！！！

Spark RDD存入MySQL等存储系统最佳实践

将Spark的RDD写入数据存储系统，不管是关系型数据库如MySQL，还是NoSQL，如MongoDB，HBase，都面临着比较大的存储压力，因为每个RDD的每个partition的数据量可能非常大，因为必须节省有限的存储服务器连接，如下是一些最佳实践：

You can write your own custom writer and call a transform on your RDD to write each element to a database of your choice, but there's a lot of ways to write something that looks like it would work, but does not work well in a distributed environment. Here are some things to watch out for:
A common naive mistake is to open a connection on the Spark driver program, and then try to use that connection on the Spark workers. The connection should be opened on the Spark worker, such as by calling forEachPartition and opening the connection inside that function.
Use partitioning to control the parallelism for writing to your data storage. Your data storage may not support too many concurrent connections.
Use batching for writing out multiple objects at a time if batching is optimal for your data storage.
Make sure your write mechanism is resilient to failures.
Writing out a very large dataset can take a long time, which increases the chance something can go wrong - a network failure, etc.
Consider utilizing a static pool of database connections on your Spark workers.
If you are writing to a sharded data storage, partition your RDD to match your sharding strategy. That way each of your Spark workers only connects to one database shard, rather than each Spark worker connecting to every database shard.
Be cautious when writing out so much data, and make sure you understand the distributed nature of Spark!

**上面提到了batch操作，batch应该是一个节省连接资源非常有效的手段，将多个更新或者插入操作组成一个batch，使用一个连接将数据传送到存储系统引擎，关注下MySQL和MongoDB的batch操作**

分享到：

【Spark七十七】Spark分析Nginx和Apache ... | 【Spark七十五】Spark Streaming整合Flum ...

2015-02-21 19:37
浏览 4907
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark七十六】Spark计算结果存到MySQL

Spark RDD存入MySQL等存储系统最佳实践

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark七十六】Spark计算结果存到MySQL

Spark RDD存入MySQL等存储系统最佳实践

评论

发表评论

相关推荐

【Spark109】Windows上运行spark-shell

【Spark108】Spark SQL动态代码生成四

【Spark107】Spark SQL动态代码生成三

【Spark106】Spark SQL动态代码生成二

【Spark105】Spark SQL动态代码生成一

【Spark105】Spark任务调度

【Spark104】Spark源代码构建打包

【Spark103】Task not serializable

【Spark102】Spark存储模块BlockManager剖析

【Spark101】Scala Promise/Future在Spark中的应用

【Spark100】Spark Streaming Checkpoint的一个坑

【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

【Spark九十八】Standalone Cluster Mode下的资源调度源代码分析

【Spark九十七】RDD API之aggregateByKey

【Spark九十六】RDD API之combineByKey

【Spark九十五】Spark Shell操作Spark SQL

【Spark九十四】spark-sql工具的使用

【Spark九十三】Spark读写Sequence File

【Spark九十二】Spark SQL操作Parquet格式的数据

【Spark九十一】Spark Streaming整合Kafka一些值得关注的问题

最近访客更多访客>>