`
sillycat
  • 浏览: 2536039 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Data Solution(1)Prepare ENV to Parse CSV Data on Single Ubuntu

 
阅读更多
Data Solution(1)Prepare ENV to Parse CSV Data on Single Ubuntu

Java Version
> java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Maven Version
> mvn --version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T13:41:47-05:00)
Prepare Protobuf
> git clone https://github.com/google/protobuf.git
> ./autogen.sh
Exception:
Can't exec "aclocal": No such file or directory at /usr/local/Cellar/autoconf/2.69/share/autoconf/Autom4te/FileUtils.pm line 326.
autoreconf: failed to run aclocal: No such file or directory
Possible Solution:
https://github.com/meritlabs/merit/issues/344
> brew install autoconf automake libtool berkeley-db4 pkg-config openssl boost boost-build libevent
Success this time
> ./autogen.sh
> ./configure --prefix=/Users/hluo/tool/protobuf-3.6.1
Make and Make install to place in the working directory under PATH
Check Version
> protoc --version
libprotoc 3.6.1
Prepare CMake ENV
> wget https://github.com/Kitware/CMake/releases/download/v3.14.0-rc2/cmake-3.14.0-rc2.tar.gz
Unzip and go to the directory
> ./bootstrap
Then make and make install, check version
> cmake --version
cmake version 3.14.0-rc2
Get Hadoop Source Codes
> wget http://apache.osuosl.org/hadoop/common/hadoop-3.2.0/hadoop-3.2.0-src.tar.gz
Unzip and build
> mvn package -Pdist.native -DskipTests -Dtar
Haha, Exception
org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.6.1', expected version is '2.5.0'
Solution:
> git checkout tags/v2.5.0
>> git checkout tags/v2.5.0
> ./autogen.sh
> ./configure --prefix=/home/carl/tool/protobuf-2.5.0
> protoc --version
libprotoc 2.5.0
Build again
> mvn package -Pdist.native -DskipTests -Dtar
Read this document to figure out how to build
https://github.com/apache/hadoop/blob/trunk/BUILDING.txt
> mvn package -Pdist,native,docs -DskipTests -Dtar
Do not build native package on MAC
> mvn package -Pdist,docs -DskipTests -Dtar
Still not build like last time. I will directly use the binary then.
> wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
Unzip the file and place in the working directory
> cat etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
> cat etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Format the disk
> hdfs namenode -format
Set up SSH access on MAC
> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
> ssh localhost
Open System Reference —> Sharing —> Remote Login
HDFS
> sbin/start-dfs.sh
Last PORT NUMBER
https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Installation
http://localhost:9870/dfshealth.html#tab-overview
Start YARN
> sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
Something went wrong here
> less hadoop-hluo-nodemanager-machluo.local.log
2019-02-20 22:23:40,483 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: NMWebapps failed to start.
Caused by: com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error injecting constructor, java.lang.NoClassDefFoundError: javax/activation/DataSource
  at org.apache.hadoop.yarn.server.nodemanager.webapp.JAXBContextResolver.<init>(JAXBContextResolver.java:52)
Solution:
https://salmanzg.wordpress.com/2018/02/20/webhdfs-on-hadoop-3-with-java-9/
> vi etc/hadoop/hadoop-env.sh
export HADOOP_OPTS="--add-modules java.activation"
Module java.activation not found
Maybe it is because my local installation JAVA is jdk10 or jdk11.
Let me try on my ubuntu virtual machine.
Generate key pair if needed
> ssh-keygen
> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Add JAVA_HOME into
> vi etc/hadoop/hadoop-env.sh
export JAVA_HOME=/opt/jdk
Start DFS
>hdfs namenode -format
>sbin/start-dfs.sh
http://ubuntu-master:9870/dfshealth.html#tab-overview
Start YARN
> sbin/start-yarn.sh
http://ubuntu-master:8088/cluster
Install Spark
> wget http://ftp.wayne.edu/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
Unzip the file and place in the working place
> cp conf/spark-env.sh.template conf/spark-env.sh
> vi conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
> echo $SPARK_HOME
/opt/spark
Try Shell
> MASTER=yarn bin/spark-shell
Install Zeppelin
Download Binary
> wget http://apache.claz.org/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-all.tgz
Unzip and place in working directory, Prepare the configuration file
> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
export SPARK_HOME="/opt/spark"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
> bin/zeppelin-daemon.sh start
Then  we can visit the web app
Visit the page
http://ubuntu-master:8080/#/
I am running single mode right now
http://ubuntu-master:4040/jobs/
spark.master is ‘local
  • ’, that is why it runs on local machine, not on remote YARN, we can easily change that in the setting page
  • Put my File on ubuntu-master HDFS
    Put the file into HDFS
    Check the directory
    > hdfs dfs -ls /
    Create directory
    > hdfs dfs -mkdir /user
    > hdfs dfs -mkdir /user
    Upload file
    > hdfs dfs -put ./new-printing-austin.csv /user/yiyi/austin1.csv
    After that we can see the file here
    http://ubuntu-master:9870/explorer.html#/user/yiyi
    > hdfs dfs -ls  /user/yiyi/
    Found 1 items
    -rw-r--r--   1 carl supergroup     105779 2019-02-21 12:44 /user/yiyi/austin1.csv
    Other command lines
    https://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html
    Change the core-site.xml to accept 0.0.0.0, then I can access the HDFS
    > hdfs dfs -ls  hdfs://ubuntu-master:9000/user/yiyi/
    Found 1 items
    -rw-r--r--   1 carl supergroup     105779 2019-02-21 12:44 hdfs://ubuntu-master:9000/user/yiyi/austin1.csv

    These code works pretty well on NoteBook
    val companyRawDF = sqlContext.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("hdfs://ubuntu-master:9000/user/yiyi/austin1.csv") 
    val companyDF = companyRawDF.columns.foldLeft(companyRawDF)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
    companyDF.printSchema()
    companyDF.createOrReplaceTempView("company")
    sqlContext.sql("select businessId, title, company_name, phone, email, bbbRating, bbbRatingScore  from company where bbbRating = 'A+' limit 10 ").show()
    %sql
    select bbbRatingScore, count(1) value
    from company
    where phone is not null
    group by bbbRatingScore
    order by bbbRatingScore
    Security
    https://makeling.github.io/bigdata/39395030.html

    References:
    https://spark.apache.org/
    https://hadoop.apache.org/releases.html
    https://spark.apache.org/docs/latest/index.html
    https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Installation
    Some other documents
    Spark 2017 BigData Update(2)CentOS Cluster
    Spark 2017 BigData Upadate(3)Notebook Example
    Spark 2017 BigData Update(4)Spark Core in JAVA
    Spark 2017 BigData Update(5)Spark Streaming in Java

    分享到:
    评论

    相关推荐

      Go-env-将环境变量解析成Go结构体

      如果环境变量不存在或者无法转换为预期的类型,`env.Parse`函数将返回一个错误。在上面的代码中,我们用`panic`来处理错误,但在实际项目中,根据错误的类型和严重程度,你可能需要采取更适当的错误处理策略。 除了...

      parse-env:从 env 解析配置

      parse-env - 从 env 解析配置 parse-env使您的项目支持基于环境变量的配置变得容易。 它根据给定的模板推断环境变量名称。 该模板可能是配置本身。 如果找到与约定匹配的环境变量,它将使用该变量而不是原始配置中的...

      python将excel转换为csv的代码方法总结

      data.to_csv('data.csv',encoding='utf-8') 将Excel文件转为csv文件的python脚本 #!/usr/bin/env python __author__ = lrtao2010 ''' Excel文件转csv文件脚本 需要将该脚本直接放到要转换的Excel文件同级目录下 ...

      ubuntu-dev-env:设置ubuntu开发环境

      ubuntu-dev-env 设置ubuntu开发环境

      Laravel开发-laravel-parse-sdk

      接着,在`.env`文件中填写相应的Parse服务环境变量,包括App ID、REST Key、Master Key以及自托管Parse服务器的URL。 为了便于使用,需要创建一个服务提供者和门面。在`app/Providers`目录下创建一个新的服务提供者...

      cadence ENV常用快捷键

      ENV常用快捷键,覆盖X:\Cadence17.4\Cadence\SPB_Data\pcbenv中ENV文件即可, W/w拉线,A/a更改,S/s修线,D/d删除,C/c复制,T/t修改文本,R/r旋转器件,H/h高亮,等等有好几个,如果要修改,用文本打开自己修改...

      ENV快捷键文件.zip

      "env快捷键(中文说明).env"和"env快捷键.env"这两个文件都是用于存储快捷键设置的ENV文件。其中,“env快捷键(中文说明).env”可能包含中文注释,方便不熟悉英文的用户理解各个快捷键的功能。这些文件通常包含了...

      cross-env-7.0.3.zip

      npx cross-env VAR1=value1 VAR2=value2 ``` 4. **高级用法** `cross-env`还支持一些高级特性,如`--shell`选项,可以指定命令行解析器,这对于某些特定场景非常有用。另外,`cross-env`也提供了`--no-shell`...

      IDEA中使用.env文件配置信息的EnvFile插件.rar

      .env文件是一种常见的配置文件格式,它可以存储键值对形式的配置信息,并且具有良好的可读性和易用性。 有时候IDEA网络不好下载不了,我这里提供一个下载包,下载到电脑上,从IDEA直接就能导入。 插件版本:3.4.2 ...

      nCoV_data_analyse2.zip

      标题中的“nCoV_data_analyse2.zip”暗示了一个关于新冠病毒(nCoV)数据的分析项目。这个项目的重点可能在于收集、处理和可视化与新冠病毒相关的数据。它以一个ZIP压缩包的形式提供,通常这样的文件包含多个相关...

      cadence快捷键env文件设置

      1. **别名与命令组合**:除了单一命令,你还可以在`env`文件中定义一系列命令的组合,例如,可以定义一个快捷键同时完成放置元件和自动布线: ``` ! alias PlaceAndRoute = "place -select last; route" ``` 2....

      Parse.comAPI服务器ParseServer.zip

      // and the location to your Parse cloud code var api = new ParseServer({  databaseURI: 'mongodb://localhost:27017/dev',  cloud: '/home/myApp/cloud/main.js', // Provide an ...

      allegro快捷键文件env

      allegro快捷键文件env

      Laravel开发-laravel-parse

      在Laravel的`.env`文件中,添加Parse的相关配置项,如APP_ID、MASTER_KEY、SERVER_URL等。同时,在`config/services.php`文件中创建一个parse配置数组,存放这些密钥。 一旦配置完成,我们就可以在Laravel的应用中...

      redis-desktop-manager-2019.4.0.exe

      Copy data between databases (copy data from production env to dev env for debugging or migrate your project to another cloud provider) Import data from RDB files - you can easily split data from ...

      RT-Thread的env工具无法下载软件包

      RT-Thread 的 Env 工具无法下载软件包 RT-Thread 是一个开源的实时操作系统, Env 工具是 RT-Thread 的一个重要组件,负责管理软件包。然而,在使用 Env 工具时,有时可能会遇到无法下载软件包的问题。本文将详细...

      phpEnv 5.3.0

      软件官网:https://www.phpenv.cn 为什么开发这款集成环境 2017年phpstudy被 php.cn 收购,软件用c++重写了,现在逐渐商业化,用户体验和2016版不能比,决定自己开发一款php集成环境,于是用C# WPF 开发了phpEnv。...

      cross-env-7.0.2.zip

      1. **跨平台兼容性**:cross-env的主要功能是解决在不同操作系统之间设置环境变量的不一致性。在Windows系统中,环境变量的设置与Unix或Linux系统有很大区别,cross-env通过抽象这些差异,使得开发者可以编写无须...

      cadence快捷键设置env文件

      cadence快捷键设置env文件

    Global site tag (gtag.js) - Google Analytics