Data Solution 2019(4)Investigate my Data in Spark

sillycat

浏览: 2564582 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary
JAVA

Data Solution 2019(4)Investigate my Data in Spark

Create a HDFS directory to store my CSV
> hdfs dfs -mkdir hdfs://localhost:9000/meeting
Put the file there
> hdfs dfs -put ./meetings_only.csv hdfs://localhost:9000/meeting/meetings.csv
> hdfs dfs -put ./users_only.csv hdfs://localhost:9000/meeting/users.csv
> hdfs dfs -ls hdfs://localhost:9000/meeting/
Found 2 items
-rw-r--r--   1 carl supergroup   72937014 2019-02-25 15:41 hdfs://localhost:9000/meeting/meetings.csv
-rw-r--r--   1 carl supergroup   67428023 2019-02-25 15:42 hdfs://localhost:9000/meeting/users.csv
After I restart the Docker, my HDFS data is gone. Oh, forget to take that data directory out…..
Solution:
Mapping the directory
-v $(shell pwd)/hadoop/hdfs/namenode:/data/hadoop/hdfs/namenode -v $(shell pwd)/hadoop/hdfs/datanode:/data/hadoop/hdfs/datanode
Add this configuration in hdfs-site.xml
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hadoop/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hadoop/hdfs/datanode</value>
    </property>
Overwrite the file
> hdfs dfs -put -f ./elasticsearch_meeting_uuids.txt hdfs://localhost:9000/meeting/meetings_search.csv
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS
Load data from JSON
val meetingsearchsRawDF = sqlContext.read.json("hdfs://localhost:9000/meetings3/meetings_search.json")
val meetingsearchsDF = meetingsearchsRawDF.toDF(meetingsearchsRawDF.columns map(_.toLowerCase): _*)
meetingsearchsDF.printSchema()
meetingsearchsDF.createOrReplaceTempView("meetingsearchs")
Sometimes the JSON is multiple JSON Object
val usersRawDF = sqlContext.read
    .option("multiLine", true)
    .json("hdfs://localhost:9000/users/dupedusers.json")
val usersDF = usersRawDF.toDF(usersRawDF.columns map(_.toLowerCase): _*)
usersDF.printSchema()
usersDF.createOrReplaceTempView("users")
Generate the Output File
val meetingsToDeleteDF = sqlContext.sql("""
SELECT
    'DELETED' as action,
    meetingsearchs.pinRequired as pinRequired,
    meetingsearchs.lastCallDate as lastCallDate,
FROM meetingsearchs
LEFT OUTER JOIN meetings on meetingsearchs.uuid = meetings.uuid
LEFT OUTER JOIN users on meetingsearchs.owneruuid = users.uuid
WHERE meetings.uuid IS NULL and users.uuid IS NULL""")
meetingsToDeleteDF.show(2)
meetingsToDeleteDF.coalesce(1).write.csv("hdfs://localhost:9000/meeting3/uuids_to_delete.csv")
meetingsToDeleteDF.repartition(1).write.json("hdfs://localhost:9000/meetings3/meetings_to_delete.json")
Load the File and Contents from HDFS
hdfs dfs -get hdfs://localhost:9000/meeting2/uuids_to_delete.csv ./uuids_to_delete.csv
There is some issue in my Docker HDFS, my file will be lost after reboot and format
I saw some error messages as follow:
mesg: ttyname failed: Inappropriate ioctl for device

References:
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/core-default.xml
https://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS

分享到：

Scrapy Tool Greapy and SpiderKeeper | SpringBoot2 and Tomcat Connection Timeou ...

2019-03-03 12:56
浏览 345
评论(0)
分类:Web前端
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论