`
sillycat
  • 浏览: 2536110 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Data Solution 2019(4)Investigate my Data in Spark

 
阅读更多
Data Solution 2019(4)Investigate my Data in Spark

Create a HDFS directory to store my CSV
> hdfs dfs -mkdir hdfs://localhost:9000/meeting
Put the file there
> hdfs dfs -put ./meetings_only.csv hdfs://localhost:9000/meeting/meetings.csv
> hdfs dfs -put ./users_only.csv hdfs://localhost:9000/meeting/users.csv
> hdfs dfs -ls hdfs://localhost:9000/meeting/
Found 2 items
-rw-r--r--   1 carl supergroup   72937014 2019-02-25 15:41 hdfs://localhost:9000/meeting/meetings.csv
-rw-r--r--   1 carl supergroup   67428023 2019-02-25 15:42 hdfs://localhost:9000/meeting/users.csv
After I restart the Docker, my HDFS data is gone. Oh, forget to take that data directory out…..
Solution:
Mapping the directory
-v $(shell pwd)/hadoop/hdfs/namenode:/data/hadoop/hdfs/namenode -v $(shell pwd)/hadoop/hdfs/datanode:/data/hadoop/hdfs/datanode
Add this configuration in hdfs-site.xml
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hadoop/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hadoop/hdfs/datanode</value>
    </property>
Overwrite the file
> hdfs dfs -put -f ./elasticsearch_meeting_uuids.txt hdfs://localhost:9000/meeting/meetings_search.csv
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS
Load data from JSON
val meetingsearchsRawDF = sqlContext.read.json("hdfs://localhost:9000/meetings3/meetings_search.json")
val meetingsearchsDF = meetingsearchsRawDF.toDF(meetingsearchsRawDF.columns map(_.toLowerCase): _*)
meetingsearchsDF.printSchema()
meetingsearchsDF.createOrReplaceTempView("meetingsearchs")
Sometimes the JSON is multiple JSON Object
val usersRawDF = sqlContext.read
    .option("multiLine", true)
    .json("hdfs://localhost:9000/users/dupedusers.json")
val usersDF = usersRawDF.toDF(usersRawDF.columns map(_.toLowerCase): _*)
usersDF.printSchema()
usersDF.createOrReplaceTempView("users")
Generate the Output File
val meetingsToDeleteDF = sqlContext.sql("""
SELECT
    'DELETED' as action,
    meetingsearchs.pinRequired as pinRequired,
    meetingsearchs.lastCallDate as lastCallDate,
FROM meetingsearchs
LEFT OUTER JOIN meetings on meetingsearchs.uuid = meetings.uuid
LEFT OUTER JOIN users on meetingsearchs.owneruuid = users.uuid
WHERE meetings.uuid IS NULL and users.uuid IS NULL""")
meetingsToDeleteDF.show(2)
meetingsToDeleteDF.coalesce(1).write.csv("hdfs://localhost:9000/meeting3/uuids_to_delete.csv")
meetingsToDeleteDF.repartition(1).write.json("hdfs://localhost:9000/meetings3/meetings_to_delete.json")
Load the File and Contents from HDFS
hdfs dfs -get hdfs://localhost:9000/meeting2/uuids_to_delete.csv ./uuids_to_delete.csv
There is some issue in my Docker HDFS, my file will be lost after reboot and format
I saw some error messages as follow:
mesg: ttyname failed: Inappropriate ioctl for device

References:
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/core-default.xml
https://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics