`
sillycat
  • 浏览: 2550912 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Data Solution 2019(9)CentOS Installation

 
阅读更多
Data Solution 2019(9)CentOS Installation

Join two DataFrame and get the count

import org.apache.spark.sql.functions._

val addressCountDF = addressesRawDF.groupBy("user_id").agg(count("user_id").as("times"))

val userWithCountDF = usersRawDF.join(
addressCountDF,
usersRawDF("id") <=> addressCountDF("user_id")
    && usersRawDF("id") <=> addressCountDF("user_id"),
"left"
)

userWithCountDF.select("id", "times").filter("times > 0").show(100)

Adjust the latest version for Ubuntu
prepare:
    wget http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz -P install/
    wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz -P install/
    wget http://www.gtlib.gatech.edu/pub/apache/zeppelin/zeppelin-0.8.2/zeppelin-0.8.2-bin-all.tgz -P install/

On new ubuntu machine
> sudo apt install make
> sudo snap install docker
> sudo groupadd docker
> sudo gpasswd -a $USER docker
> sudo usermod -aG docker $USER

Here is the Dockerfile
#Run a kafka server side

#Prepare the OS
FROM            ubuntu:16.04
MAINTAINER      Yiyi Kang <yiyikangrachel@gmail.com>

ENV DEBIAN_FRONTEND noninteractive
ENV JAVA_HOME       /usr/lib/jvm/java-8-openjdk-amd64
ENV LANG            en_US.UTF-8
ENV LC_ALL          en_US.UTF-8

RUN apt-get -qq  update
RUN apt-get -qqy dist-upgrade

#Prepare the denpendencies
RUN apt-get install -qy wget unzip vim
RUN apt-get install -qy iputils-ping

#Install SUN JAVA
RUN apt-get update && \
    apt-get install -y --no-install-recommends locales && \
    locale-gen en_US.UTF-8 && \
    apt-get dist-upgrade -y && \
    apt-get install -qy openjdk-8-jdk

#Prepare for hadoop and spark
RUN apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa
RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

#Prepare python env
RUN apt-get install -y git
RUN apt-get install -y build-essential zlib1g-dev libbz2-dev
RUN apt-get install -y libreadline6 libreadline6-dev sqlite3 libsqlite3-dev
RUN apt-get update --fix-missing
RUN apt-get install -y libssl-dev

RUN apt-get install -y software-properties-common vim
RUN add-apt-repository ppa:jonathonf/python-3.6
RUN apt-get update

RUN apt-get install -y build-essential python3.6 python3.6-dev python3-pip python3.6-venv
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip

#pandas
RUN pip install pandas
RUN pip install -U pandasql

RUN            mkdir /tool/
WORKDIR        /tool/
#add the software hadoop
ADD            install/hadoop-3.2.1.tar.gz /tool/
RUN            ln -s /tool/hadoop-3.2.1 /tool/hadoop
ADD            conf/core-site.xml /tool/hadoop/etc/hadoop/
ADD            conf/hdfs-site.xml /tool/hadoop/etc/hadoop/
ADD            conf/hadoop-env.sh /tool/hadoop/etc/hadoop/

#add the software spark
ADD            install/spark-2.4.4-bin-hadoop2.7.tgz /tool/
RUN            ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark
ADD            conf/spark-env.sh /tool/spark/conf/

#add the software zeppelin
ADD            install/zeppelin-0.8.2-bin-all.tgz /tool/
RUN            ln -s /tool/zeppelin-0.8.2-bin-all /tool/zeppelin

#set up the app
EXPOSE  9000 9870 8080 4040
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Try to Set Up on Host Machine on CentOS 7
I have JAVA there according to jenv
Need JAVA ENV JDK8, 11, 12
> sudo yum install git
> git clone https://github.com/gcuisinier/jenv.git ~/.jenv
> echo 'export PATH="$HOME/.jenv/bin:$PATH"' >> ~/.bash_profile
> echo 'eval "$(jenv init -)"' >> ~/.bash_profile
> . ~/.bash_profile

Check version
> jenv --version
jenv 0.5.2-12-gdcbfd48

Download JDK 8, 11, 12 from Official website
https://www.oracle.com/technetwork/java/javase/downloads/jdk12-downloads-5295953.html
jdk-11.0.4_linux-x64_bin.tar.gz

Unzip all of these files and place in working directory, link to /opt directory
> tar zxvf jdk-11.0.4_linux-x64_bin.tar.gz

> mv jdk-11.0.4 ~/tool/
> sudo ln -s /home/redis/tool/jdk-11.0.4 /opt/jdk-11.0.4

Add to JENV
> jenv add /opt/jdk-11.0.4

Check the installed versions
>  jenv versions
* system (set by /home/redis/.jenv/version)
  11
  11.0
  11.0.4
  oracle64-11.0.4

Try to set global to 11
> jenv global 11.0

> java -version
java version "11.0.4" 2019-07-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.4+10-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.4+10-LTS, mixed mode)

Prepare HADOOP
> wget http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
> tar zxvf hadoop-3.2.1.tar.gz
> mv hadoop-3.2.1 ~/tool/
> sudo ln -s /home/carl/tool/hadoop-3.2.1 /opt/hadoop-3.2.1
> sudo ln -s /opt/hadoop-3.2.1 /opt/hadoop

Site Configuration
> vi etc/hadoop/core-site.xml
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://0.0.0.0:9000</value>
        </property>
</configuration>

HDFS site configuration
> vi etc/hadoop/hdfs-site.xml
<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
        <property>
                <name>dfs.permissions</name>
                <value>false</value>
        </property>
</configuration>

Shell Command ENV
Check JAVA_HOME before we configuration the file
> jenv doctor
[OK] No JAVA_HOME set
[OK] Java binaries in path are jenv shims
[OK] Jenv is correctly loaded

> jenv enable-plugin export
Restart the service
> jenv global 11.0
> java -version
java version "11.0.4" 2019-07-16 LTS
> echo $JAVA_HOME
/home/carl/.jenv/versions/11.0


> vi etc/hadoop/hadoop-env.sh
export JAVA_HOME="/home/carl/.jenv/versions/11.0"

SSH to my localhost, promote for password
> ssh localhost

Generate the key pair
> ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

SSH to localhost successful
> ssh localhost
Last login: Thu Oct 24 16:12:36 2019 from localhost

Start HDFS
> cd /opt/hadoop
> bin/hdfs namenode -format
> sbin/start-dfs.sh

Visit the web UI
http://rancher-worker1:9870/dfshealth.html#tab-overview

Exceptions:
Failed to retrieve data from /webhdfs/v1/?op=LISTSTATUS: Server Error

Find this in the logging
> grep "Error" ./*
./hadoop-carl-namenode-rancher-worker1.log:2019-10-24 16:15:59,844 WARN org.eclipse.jetty.servlet.ServletHandler: Error for /webhdfs/v1/
./hadoop-carl-namenode-rancher-worker1.log:java.lang.NoClassDefFoundError: javax/activation/DataSource
./hadoop-carl-namenode-rancher-worker1.log: at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)

Solution:
https://github.com/highsource/jsonix-schema-compiler/issues/81
https://salmanzg.wordpress.com/2018/02/20/webhdfs-on-hadoop-3-with-java-9/
> vi etc/hadoop/hadoop-env.sh
export HADOOP_OPTS="--add-modules java.activation"
Not working at all, according to this web page
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
Currently it only support JDK8, I will need to change to JDK8 instead.
> jenv global 1.8
> java -version
java version "1.8.0_221"
> echo $JAVA_HOME
/home/carl/.jenv/versions/1.8

Prepare Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
> cd /opt/spark
> cp conf/spark-env.sh.template conf/spark-env.sh

> vi conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Prepare PYTHON 3.7 ENV
Since we want to migrate all the things to Python3, install and prepare python3
Install PYENV from the latest github
> git clone https://github.com/pyenv/pyenv.git ~/.pyenv

Add to the PATH
> vi ~/.bash_profile
PATH=$PATH:$HOME/.pyenv/bin
eval "$(pyenv init -)"
> . ~/.bash_profile

Check installation
>   pyenv -v
pyenv 1.2.14-8-g0e7cfc3

Check all the versions, latest is 3.7.5 and 3.8.0, install some other versions
https://www.python.org/downloads/

Some warning and possible dependencies
WARNING: The Python bz2 extension was not compiled. Missing the bzip2 lib?
WARNING: The Python readline extension was not compiled. Missing the GNU readline lib?
WARNING: The Python sqlite3 extension was not compiled. Missing the SQLite3 lib?

> sudo yum install bzip2-devel
> sudo yum install sqlite-devel
> sudo yum install readline-devel

>  pyenv install 3.8.0
>  pyenv install 3.7.5

>  pyenv versions
* system (set by /home/carl/.pyenv/version)
  3.7.5
  3.8.0

> pyenv global 3.8.0

>  python -V
Python 3.8.0

More Python Libraries
> pip install --upgrade pip
> pip install pandas
> pip install -U pandasql
Failed with
ModuleNotFoundError: No module named '_ctypes'

Solution:
> sudo yum install libffi-devel
It will solve the problem. Success install pandas and pandasql again.
> pip install pandas
> pip install -U pandasql

Prepare Zeppelin
> wget http://www.gtlib.gatech.edu/pub/apache/zeppelin/zeppelin-0.8.2/zeppelin-0.8.2-bin-all.tgz
> tar zxvf zeppelin-0.8.2-bin-all.tgz
> mv zeppelin-0.8.2-bin-all ~/tool/zeppelin-0.8.2
> sudo ln -s /home/carl/tool/zeppelin-0.8.2 /opt/zeppelin-0.8.2
> sudo ln -s /opt/zeppelin-0.8.2 /opt/zeppelin

Some Configuration for Zeppelin
> cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
> cp conf/shiro.ini.template conf/shiro.ini
> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
Add user to the auth config
> vi conf/shiro.ini
[users]
carl = pass123, admin
kiko = pass123, admin

Site configuration
> vi conf/zeppelin-site.xml
<property>
  <name>zeppelin.server.addr</name>
  <value>0.0.0.0</value>
  <description>Server binding address</description>
</property>
<property>
  <name>zeppelin.anonymous.allowed</name>
  <value>false</value>
  <description>Anonymous user allowed by default</description>
</property>

ENV configuration
> vi conf/zeppelin-env.sh
export SPARK_HOME="/opt/spark"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

Start the Service
> bin/zeppelin.sh

> sudo bin/zeppelin-daemon.sh stop

> sudo bin/zeppelin-daemon.sh start

Visit these UI
http://rancher-worker1:9870/explorer.html#/
http://rancher-worker1:8080/#/

References:
Docker Permission
https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo
JAVA_HOME
https://www.jenv.be/
https://github.com/jenv/jenv
https://stackoverflow.com/questions/28615671/set-java-home-to-reflect-jenv-java-version





分享到:
评论

相关推荐

    SQLserver 2019 on centos 7

    ### SQL Server 2019在CentOS 7上的安装指南 #### 一、系统环境配置 本章节将详细介绍如何在CentOS 7系统上安装SQL Server 2019。首先,确保您的CentOS系统版本与安装文档一致。根据提供的部分文件内容,可以看出...

    chrome for centos installation

    Centos 6.5-6.9及以上 安装 chrome浏览器,注意根据系统提示自行安装依赖

    Install of MS2019 on centos8.2

    1. Centos8.2启动盘制作;2. Centos8.2系统安装;3. Centos8.2系统环境设置;4. MS2019环境设置;5. 普通用户安装

    sql server 2019 centos7.8 离线安装包,包含依赖

    在Linux操作系统上,特别是CentOS 7.8,SQL Server 2019同样得到了广泛的应用。离线安装包的提供是为了在没有网络连接或者网络环境不稳定的情况下,能够顺利进行安装。本离线安装包包含了所有必要的依赖,这意味着...

    centos9ISO镜像及yum源配置文件

    CentOS 9 Stream 是一个基于Linux的开源操作系统,它是Red Hat Enterprise Linux (RHEL)的一个社区维护版本。本文将深入探讨如何下载与安装CentOS 9 Stream ISO镜像,并详细讲解如何配置YUM源,以便在系统中进行...

    Centos7.7系统下安装SQL Server 2019.doc

    "Centos7.7系统下安装SQL Server 2019" 本资源主要介绍了在Centos7.7系统下安装SQL Server 2019的步骤和要求,旨在帮助用户快速搭建SQL Server 2019环境。下面是从该资源中提取的关键知识点: 1. 系统要求:安装...

    CentOS 7安装OGG BigData微服务引导文件

    本教程将聚焦于如何在CentOS 7操作系统上安装OGG BigData微服务引导文件。 首先,我们要理解CentOS 7是Linux发行版之一,它为服务器环境提供了稳定性和安全性。而Oracle GoldenGate与BigData的集成,旨在支持大数据...

    centos7_XFS_报Corruption of in-memory data detected错误.docx

    在Linux系统中,CentOS 7使用XFS文件系统可能会遇到一种特定的错误:“Corruption of in-memory data detected”。这个错误通常表示系统检测到了内存中的数据损坏,这可能是因为不正常关机、硬件故障或者软件问题...

    CentOS Stream 9 nasm 安装包

    可参考: https://vault.centos.org/8.5.2111/PowerTools/Source/SPackages/nasm-2.15.03-3.el8.src.rpm

    centos安装gromacs常用指令及问题

    centos安装gromacs时的一些常用指令代码 常见问题解析

    CentOS Linux 8和CentOS Stream发行公告

    周二9月24日16:38:36 UTC 2019 我们想宣布CentOS Linux 8和 所有架构上的新CentOS Stream。 ---------- CentOS Linux 8 这是CentOS Linux 8的第一个版本,版本标记为 8.0-1905,来自Red Hat发布的资源,通过 git....

    CentOS 9 Linux系统镜像文件(2022年12月22日最新更新)

    CentOS-Stream-9-20221219.0-x86_64-boot.iso - Linux系统镜像文件

    CentOS Stream9 的 terminator 源码 RPM 包。

    CentOS Stream9 的 terminator 源码 RPM 包。

    Centos7.4.1708版本最小化安装镜像ISO附带Centos7.4.1708磁力

    CentOS 7.4.1708 是一个基于Linux的开源操作系统,它属于Red Hat Enterprise Linux (RHEL)的克隆版,提供了一个稳定、安全且高效的服务器平台。这个版本发布于2017年8月,是CentOS 7系列的一个重要更新。最小化安装...

    centos stream 9安装cacti1.2.23

    centos stream 9安装cacti1.2.23,使用php8.0+maiadb10.5+rrdtool1.7.2

    centos8下载资源

    centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载centos8下载...

    centos7文本安装

    9. 开始安装:在文本安装界面中,用户可以开始安装操作系统。安装完成后,用户需要重启机器以应用更改。 Centos 7 文本安装的优点包括: * 快速安装:CentOS 7 文本安装可以快速安装操作系统,整个安装过程可以在...

    Installation OpenMeetings 4.0.10 on Centos 7.pdf

    安装的 Apache的 OpenMeetings 4.0.10在CentOS 7。Openmeetings提供视频会议,即时消息传递,白板,协作文档编辑和其他组件软件工具。它使用Media Server的API函数进行远程处理和流传输(Red5或Kurento)。

    golden-data#GoldenTool#Centos7.3 安装bashdb和vim bash-support插件1

    1.1. 系统版本和内核信息 1.2. 安装基础软件 2.1. 查看bash版本 4.3. 在~/.vimrc中启动这个插件

    CentOS7.6部署Oracle12 Data Guard文档.doc

    在本文中,我们将深入探讨如何在CentOS 7.6操作系统上部署Oracle 12c Data Guard环境。Data Guard是一种高可用性和灾难恢复解决方案,它通过在主数据库和备用数据库之间实时传输重做日志来确保数据的一致性和完整性...

Global site tag (gtag.js) - Google Analytics