Data Solution(1)Prepare ENV to Parse CSV Data on Single Ubuntu

sillycat

浏览: 2578036 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary
JAVA

Data Solution(1)Prepare ENV to Parse CSV Data on Single Ubuntu

Java Version
> java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Maven Version
> mvn --version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T13:41:47-05:00)
Prepare Protobuf
> git clone https://github.com/google/protobuf.git
> ./autogen.sh
Exception:
Can't exec "aclocal": No such file or directory at /usr/local/Cellar/autoconf/2.69/share/autoconf/Autom4te/FileUtils.pm line 326.
autoreconf: failed to run aclocal: No such file or directory
Possible Solution:
https://github.com/meritlabs/merit/issues/344
> brew install autoconf automake libtool berkeley-db4 pkg-config openssl boost boost-build libevent
Success this time
> ./autogen.sh
> ./configure --prefix=/Users/hluo/tool/protobuf-3.6.1
Make and Make install to place in the working directory under PATH
Check Version
> protoc --version
libprotoc 3.6.1
Prepare CMake ENV
> wget https://github.com/Kitware/CMake/releases/download/v3.14.0-rc2/cmake-3.14.0-rc2.tar.gz
Unzip and go to the directory
> ./bootstrap
Then make and make install, check version
> cmake --version
cmake version 3.14.0-rc2
Get Hadoop Source Codes
> wget http://apache.osuosl.org/hadoop/common/hadoop-3.2.0/hadoop-3.2.0-src.tar.gz
Unzip and build
> mvn package -Pdist.native -DskipTests -Dtar
Haha, Exception
org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.6.1', expected version is '2.5.0'
Solution:
> git checkout tags/v2.5.0
>> git checkout tags/v2.5.0
> ./autogen.sh
> ./configure --prefix=/home/carl/tool/protobuf-2.5.0
> protoc --version
libprotoc 2.5.0
Build again
> mvn package -Pdist.native -DskipTests -Dtar
Read this document to figure out how to build
https://github.com/apache/hadoop/blob/trunk/BUILDING.txt
> mvn package -Pdist,native,docs -DskipTests -Dtar
Do not build native package on MAC
> mvn package -Pdist,docs -DskipTests -Dtar
Still not build like last time. I will directly use the binary then.
> wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
Unzip the file and place in the working directory
> cat etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
> cat etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Format the disk
> hdfs namenode -format
Set up SSH access on MAC
> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
> ssh localhost
Open System Reference —> Sharing —> Remote Login
HDFS
> sbin/start-dfs.sh
Last PORT NUMBER
https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Installation
http://localhost:9870/dfshealth.html#tab-overview
Start YARN
> sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
Something went wrong here
> less hadoop-hluo-nodemanager-machluo.local.log
2019-02-20 22:23:40,483 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: NMWebapps failed to start.
Caused by: com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error injecting constructor, java.lang.NoClassDefFoundError: javax/activation/DataSource
at org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver.<init>(JAXBContextResolver.java:52)
Solution:
https://salmanzg.wordpress.com/2018/02/20/webhdfs-on-hadoop-3-with-java-9/
> vi etc/hadoop/hadoop-env.sh
export HADOOP_OPTS="--add-modules java.activation"
Module java.activation not found
Maybe it is because my local installation JAVA is jdk10 or jdk11.
Let me try on my ubuntu virtual machine.
Generate key pair if needed
> ssh-keygen
> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Add JAVA_HOME into
> vi etc/hadoop/hadoop-env.sh
export JAVA_HOME=/opt/jdk
Start DFS
>hdfs namenode -format
>sbin/start-dfs.sh
http://ubuntu-master:9870/dfshealth.html#tab-overview
Start YARN
> sbin/start-yarn.sh
http://ubuntu-master:8088/cluster
Install Spark
> wget http://ftp.wayne.edu/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
Unzip the file and place in the working place
> cp conf/spark-env.sh.template conf/spark-env.sh
> vi conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
> echo $SPARK_HOME
/opt/spark
Try Shell
> MASTER=yarn bin/spark-shell
Install Zeppelin
Download Binary
> wget http://apache.claz.org/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-all.tgz
Unzip and place in working directory, Prepare the configuration file
> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
export SPARK_HOME="/opt/spark"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
> bin/zeppelin-daemon.sh start
Then we can visit the web app
Visit the page
http://ubuntu-master:8080/#/
I am running single mode right now
http://ubuntu-master:4040/jobs/
spark.master is ‘local

’, that is why it runs on local machine, not on remote YARN, we can easily change that in the setting page

Put my File on ubuntu-master HDFS
Put the file into HDFS
Check the directory
> hdfs dfs -ls /
Create directory
> hdfs dfs -mkdir /user
> hdfs dfs -mkdir /user
Upload file
> hdfs dfs -put ./new-printing-austin.csv /user/yiyi/austin1.csv
After that we can see the file here
http://ubuntu-master:9870/explorer.html#/user/yiyi
> hdfs dfs -ls /user/yiyi/
Found 1 items
-rw-r--r-- 1 carl supergroup 105779 2019-02-21 12:44 /user/yiyi/austin1.csv
Other command lines
https://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html
Change the core-site.xml to accept 0.0.0.0, then I can access the HDFS
> hdfs dfs -ls hdfs://ubuntu-master:9000/user/yiyi/
Found 1 items
-rw-r--r-- 1 carl supergroup 105779 2019-02-21 12:44 hdfs://ubuntu-master:9000/user/yiyi/austin1.csv

These code works pretty well on NoteBook
val companyRawDF = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("hdfs://ubuntu-master:9000/user/yiyi/austin1.csv")
val companyDF = companyRawDF.columns.foldLeft(companyRawDF)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
companyDF.printSchema()
companyDF.createOrReplaceTempView("company")
sqlContext.sql("select businessId, title, company_name, phone, email, bbbRating, bbbRatingScore from company where bbbRating = 'A+' limit 10 ").show()
%sql
select bbbRatingScore, count(1) value
from company
where phone is not null
group by bbbRatingScore
order by bbbRatingScore
Security
https://makeling.github.io/bigdata/39395030.html

References:
https://spark.apache.org/
https://hadoop.apache.org/releases.html
https://spark.apache.org/docs/latest/index.html
https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Installation
Some other documents
Spark 2017 BigData Update(2)CentOS Cluster
Spark 2017 BigData Upadate(3)Notebook Example
Spark 2017 BigData Update(4)Spark Core in JAVA
Spark 2017 BigData Update(5)Spark Streaming in Java

分享到：

Data Solution 2019(2)Kafka Single on Doc ... | Java Object to XML in Order with Annotat ...

2019-02-22 13:13
浏览 414
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论