Data Solution 2019(4)Investigate my Data in Spark
Create a HDFS directory to store my CSV
> hdfs dfs -mkdir hdfs://localhost:9000/meeting
Put the file there
> hdfs dfs -put ./meetings_only.csv hdfs://localhost:9000/meeting/meetings.csv
> hdfs dfs -put ./users_only.csv hdfs://localhost:9000/meeting/users.csv
> hdfs dfs -ls hdfs://localhost:9000/meeting/
Found 2 items
-rw-r--r-- 1 carl supergroup 72937014 2019-02-25 15:41 hdfs://localhost:9000/meeting/meetings.csv
-rw-r--r-- 1 carl supergroup 67428023 2019-02-25 15:42 hdfs://localhost:9000/meeting/users.csv
After I restart the Docker, my HDFS data is gone. Oh, forget to take that data directory out…..
Solution:
Mapping the directory
-v $(shell pwd)/hadoop/hdfs/namenode:/data/hadoop/hdfs/namenode -v $(shell pwd)/hadoop/hdfs/datanode:/data/hadoop/hdfs/datanode
Add this configuration in hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hadoop/hdfs/datanode</value>
</property>
Overwrite the file
> hdfs dfs -put -f ./elasticsearch_meeting_uuids.txt hdfs://localhost:9000/meeting/meetings_search.csv
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS
Load data from JSON
val meetingsearchsRawDF = sqlContext.read.json("hdfs://localhost:9000/meetings3/meetings_search.json")
val meetingsearchsDF = meetingsearchsRawDF.toDF(meetingsearchsRawDF.columns map(_.toLowerCase): _*)
meetingsearchsDF.printSchema()
meetingsearchsDF.createOrReplaceTempView("meetingsearchs")
Sometimes the JSON is multiple JSON Object
val usersRawDF = sqlContext.read
.option("multiLine", true)
.json("hdfs://localhost:9000/users/dupedusers.json")
val usersDF = usersRawDF.toDF(usersRawDF.columns map(_.toLowerCase): _*)
usersDF.printSchema()
usersDF.createOrReplaceTempView("users")
Generate the Output File
val meetingsToDeleteDF = sqlContext.sql("""
SELECT
'DELETED' as action,
meetingsearchs.pinRequired as pinRequired,
meetingsearchs.lastCallDate as lastCallDate,
FROM meetingsearchs
LEFT OUTER JOIN meetings on meetingsearchs.uuid = meetings.uuid
LEFT OUTER JOIN users on meetingsearchs.owneruuid = users.uuid
WHERE meetings.uuid IS NULL and users.uuid IS NULL""")
meetingsToDeleteDF.show(2)
meetingsToDeleteDF.coalesce(1).write.csv("hdfs://localhost:9000/meeting3/uuids_to_delete.csv")
meetingsToDeleteDF.repartition(1).write.json("hdfs://localhost:9000/meetings3/meetings_to_delete.json")
Load the File and Contents from HDFS
hdfs dfs -get hdfs://localhost:9000/meeting2/uuids_to_delete.csv ./uuids_to_delete.csv
There is some issue in my Docker HDFS, my file will be lost after reboot and format
I saw some error messages as follow:
mesg: ttyname failed: Inappropriate ioctl for device
References:
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/core-default.xml
https://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/53411954/Spark+Scala+-+Read+Write+files+from+HDFS
分享到:
相关推荐
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
The book describes the emergence of big data technologies and the role of Spark in the entire big data stack. It compares Spark and Hadoop and identifies the shortcomings of Hadoop that have been ...
In this paper, we investigate and characterize currently-achievable data transfer methods of cutting-edge GPU technology. We implement these methods using open-source software to compare their ...
In particular, through the lens of surveillance, one will also investigate how the use and abuse of big data can easily lead to monitoring and controlling the behaviour of people affected by crises....
This book is aimed at anyone who is interested in big data stacks based on Apache Mesos and Spark. It would be useful to have some basic knowledge of Centos Linux and Scala. But don’t be deterred if ...
"Investigate-Data"项目专注于为医疗保健数据集制定一套完整的EDA流程,旨在揭示数据背后的洞见,这对于医疗行业的决策支持、疾病预测和患者健康管理具有重要意义。 在进行EDA时,我们首先会关注以下几个方面: 1....
Understanding Complex Datasets: Data Mining with Matrix Decompositions discusses the most common matrix decompositions and shows how they can be used to analyze large datasets in a broad range of ...
- Changes to investigate a BurnInTest crash problem on XP SP3. Release 5.3 build 1028 WIN32 release 11 September 2008 - Two 2D Video memory test crash bug workarounds implemented. Crashes in (i) ...
5.1.9 Packet Tracer - Investigate STP Loop Prevention Cisco Packet Tracer 思科模拟器 正确答案文件 5.1.9为问答实验 包含问答答案 可直接上交正确答案文件 本答案版权归mewhaku所有,严禁再次转载!!! ...
Using Mechanical Stress to Investigate the Rashba Effect in Organic Inorganic Hybrid Perovskites
Investigate_TMDB_Movie_data_set 调查数据集 项目:调查TMDb电影数据库 目录: 评估 结果 描述 关于该项目 在这个项目中,我必须从4个中选择任何一个要调查的数据集。单击此处以打开一个文档,其中包含有关我...
allows data services to be carried in the vertical ancillary data space of a bit-serial component television signal conforming with SMPTE 292M or ANSI/SMPTE 259M. This includes data broadcast services...
"基于Matlab的滚动体轴承打滑动力学建模研究:揭示加速工况下的滑移速度与摩擦力特性",An analytical model to investigate skidding in rolling element bearings during acceleration matlab轴承动力学建模,轴承...
An approach to investigate anxiety and hostility in consultee-centered consultation 292 Response to Bongiovanni and H y m a n and that perhaps corporal punishment is not an appropriate ...
We report on a data center network (DCN) architecture based on hybrid optical circuit switching (OCS) and optical burst switching (OBS) interconnect for dynamic DCN connectivity provisioning....
For the following two modules we'll begin to investigate machine learning algorithms in more detail. To build upon the basics, you'll get to work on three different projects that will test your skills...
该项目是为Udacity的Data Analyst纳米级程序编写的。 主要目标是研究使用Python 3和诸如numpy,pandas和matplotlib之类的库执行整个数据分析过程的数据集。 为此选择了TMDb电影数据集。 该数据集已从原始数据中清除...
What you will learn Discover new data extraction, data recovery, and reverse engineering techniques in mobile forensics Understand iOS, Windows, and Android security mechanisms Identify sensitive ...