MongoDB: Hadoop Integerateion 2

ylzhj02

浏览: 243896 次
性别:
来自: 成都

最近访客更多访客>>

daqin

bbpopeye

也许on

learnmore

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

MongoDB

Prepare ENV

download mongo-java-driver from http://central.maven.org/maven2/org/mongodb/mongo-java-driver

compile mongo-hadoop-connector for hadoop2.3.0

alter build.gradle to change hadoop-version to 2.3 and delete related download dependencies task

#./gradlew jar

distribute the aboves jars to hadoop clustert's nodes

#cp core/build/libs/mongo-hadoop-core-1.2.1-SNAPSHOT-hadoop_2.3.jar hadoop-2.3.0/share/hadoop/common/lib

#cp mongo-java-driver-2.12.2.jar hadoop-2.3.0/share/hadoop/common/lib

or add the above two jars into project lib dir and add them to build path

Note: the destination dir is not hadoop-2.3.0/lib

Run MongoDB Hadoop examples

#cd /path/to/mongodb-hadoop

#./gradlew historicalYield

The above command will download hadoop and install it. But,I want to run the example on my existed hadoop cluster.

To see what happens for this command, first find the files which execute this commands

#find . |xargs grep 'historicalYield' -sl

task historicalYield(dependsOn: 'configureCluster') << {
    exec() {
        commandLine "mongoimport", "-d", "mongo_hadoop", "-c", "yield_historical.in", "--drop",
                    "examples/treasury_yield/src/main/resources/yield_historical_in.json"
    }

    hadoop("examples/treasury_yield/build/libs/treasury_yield-${project(':core').version}-hadoop_${hadoop_version}.jar",
           "com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig", [
                "mongo.input.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.in",
                "mongo.output.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.out"
           ])
}

task configureCluster(dependsOn: ['copyFiles']) << {
}

task copyFiles(dependsOn: ['installHadoop', 'installHive']) << {
    def hadoopEtc
    def hadoopLib
    if (hadoop_version.startsWith("1")) {
        hadoopLib = "${hadoopHome}/lib"
        hadoopEtc = "${hadoopHome}/conf"
    } else {
        hadoopLib = "${hadoopHome}/share/hadoop/common"
        hadoopEtc = "${hadoopHome}/etc/hadoop"
    }

    println "Updating mongo jars"
    copy {
        from "core/build/libs/mongo-hadoop-core-${project(':core').version}-hadoop_${hadoop_version}.jar"
        into hadoopLib
        rename { "mongo-hadoop-core.jar" }
    }
    copy {
        from "hive/build/libs/mongo-hadoop-hive-${project(':core').version}-hadoop_${hadoop_version}.jar"
        into hiveHome + '/lib'
        rename { "mongo-hadoop-hive.jar" }
    }
    download {
        src "http://central.maven.org/maven2/org/mongodb/mongo-java-driver/${javaDriverVersion}/mongo-java-driver-${javaDriverVersion}.jar"
        dest "${hadoopLib}/mongo-java-driver.jar"
        onlyIfNewer true
    }
    println "Updating cluster configuration"
    copy {
        from 'clusterConfigs'
        into hadoopEtc
    }
}

def hadoop(jar, className, args) {
    def line = ["${hadoopHome}/bin/hadoop",
                "jar", jar, className,
                //Split settings
                "-Dmongo.input.split_size=8",
                "-Dmongo.job.verbose=true",
    ]
    args.each {
        line << "-D${it}"
    }
    println "Executing hadoop job:\n ${line.join(' \\\n\t')}"
    def hadoopEnv = [:]
    if (hadoop_version.startsWith("cdh")) {
        hadoopEnv.MAPRED_DIR = 'share/hadoop/mapreduce2'
    }
    exec() {
        environment << hadoopEnv
        commandLine line
    }
}

----------------------------

so,understood the details behind the screen. We can run the example by hand

1. load sample data into mongoDB

#mongoimport -d mongo_hadoop -c yield_historical.in --drop <examples/treasury_yield/src/main/resources/yield_historical_in.json

2.run example

#cd mongo-hadoop/examples/treasury_yield/build/libs

#hadoop jar treasury_yield-1.2.1-SNAPSHOT-hadoop_2.3.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfigV2 -Dmongo.input.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.in -Dmongo.output.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.out -Dmongo.input.split_size=8 -Dmongo.job.verbose=true

References

http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/

https://github.com/mongodb/mongo-hadoop/blob/master/CONFIG.md

https://github.com/mongodb/mongo-hadoop/blob/master/examples/README.md

http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

http://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/getting-started-with-hadoop.html

http://www.mongodb.com/press/integration-hadoop-and-mongodb-big-data%E2%80%99s-two-most-popular-technologies-gets-significant

http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how

http://help.mortardata.com/data_apps/mongo_hadoop

http://www.severalnines.com/blog/big-data-integration-etl-clickstream-mongodb-hadoop-analytics

分享到：

MongoDB: Queries and aggregation 1 | log4j configuration

2014-06-09 16:11
浏览 1132
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论