`
sillycat
  • 浏览: 2543221 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Prediction(1)Data Collection

 
阅读更多
Prediction(1)Data Collection

All the data are in JSON format in S3 buckets.

We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/

I try to do the implementation on zeppelin which is really a useful tool.

Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}"   //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"

val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")

That codes can follow the pattern and load all the files.

clicks.registerTempTable("clicks")
//applications.printSchema

The can register the data as a table and print out the schema of the JSON data.

val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()

This can load all the text files in zip format and convert that to and Dataframe

%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications  group by SUBSTR(timestamp,0,10), job_id

%sql will give us the ability to write SQLs and display that data below in graph.

val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20'  group by SUBSTR(timestamp,0,10), job_id")

import org.apache.spark.sql.functions._

val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))

These command will do the query and sorting for us on Dataframe.

val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)

writes the data back to S3.

Here is the place to check the hadoop cluster
http://localhost:9026/cluster

And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/

References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame
分享到:
评论

相关推荐

    Kaggle:tmdb-box-office-prediction(转结构化数据,用于 SQL 练习)

    https://www.kaggle.com/c/tmdb-box-office-prediction/data 数据量级+建表语句(含字段含义注释)详见博客: https://dataartist.blog.csdn.net/article/details/132268426 共 15 个表: - movies:电影表 - ...

    Data Mining and Learning Analytics: Applications in Educational Research

    Initial series of chapters offer a general overview of DM, Learning Analytics (LA), and data collection models in the context of educational research, while also defining and discussing data mining’...

    华南理工大学人脸打分数据集

    这个数据库包含5500个人的面部图像,每个个体被打上了1到5的分值,这为研究人类面部美学和机器学习模型的训练提供了丰富的数据基础。 一、数据集概述 SCUT-FBP5500数据集的规模和多样性是其独特之处。5500张人脸...

    COMP09092+212222052+chenyinghao.pdf

    It involves the collection, processing, and analysis of vast amounts of data from various sources, including online interactions, user-generated content, and sensor data. By leveraging machine ...

    Laser Scanning Systems in Highway and Safety Assessment- Using LiDAR (2020).pdf

    Data collection and management in databases play a major role in modeling and developing predictive tools. Therefore, the first two chapters of this book introduce laser scanning technology with ...

    rapidminer最新版用户手册

    包括循环(Loop)、属性子集循环(Loop Attribute Subsets)、属性循环(Loop Attributes)、批次循环(Loop Batches)、簇循环(Loop Clusters)、数据集循环(Loop Data Sets)、示例循环(Loop Examples)、文件...

    统计学前沿论文最新成果 2018.11.02 方建勇1

    3. 通过精确保证高效收集互联车辆数据(Efficient Data Collection with Provable Guarantees for Connected Vehicles): 随着互联车辆的普及,如何高效收集和处理大量实时数据成为关键问题。论文提出了一种数据...

    藏经阁-SPARK—UNIVERSAL COMPUTATION EN.pdf

    知识点3: 数据采集(Data Collection) 数据采集是指从各种来源收集数据的过程。 在石油行业中,数据采集可以来自智能井(Smart Well)、传感器读数、_legacy dataset_ 等来源。 知识点4: 数据分析(Data Analysis...

    statistica 全套教程包括数据挖掘

    1. **Neural Networks**: Inspired by the human brain, these models are used for pattern recognition, classification, and prediction. 2. **Independent Component Analysis**: Separates a multivariate ...

    Research Advances in Cloud Computing-Springer(2017).pdf

    Prediction model works with adaptive dirty rate and adaptive data rate to evaluate complex workloads running in a VM. The performance model is used to find dirty pages using dirty page rate model. It...

    SCUT-FBP数据库

    为了训练这些复杂的模型,Data_Collection_face_part_1等压缩文件中包含的大量子文件很可能包含了不同人脸的图像数据。这些数据可能被划分为训练集、验证集和测试集,以便于模型的训练、调整和评估。在实际应用中,...

    Celebs-Like-Me:这是一个用于面部识别的深度学习项目。 该项目获取图像并返回类似的名人面Kong

    明星喜欢我 这是一个用于面部识别的深度学习项目。 该项目获取图像并返回类似的名人面Kong。... Data_collection具有抓取数据的脚本。 训练文件夹包含模型训练和验证代码。 Prediction文件夹包含用于进行预测的脚本。

    基于GSM的天气监测预报装置-项目开发

    2. **gsm-based-weather-monitoring-and-prediction-device-636dbd.pdf**:这可能是一个项目报告或技术指南,详细介绍了系统的架构、工作原理、安装步骤以及可能遇到的问题和解决方案。 3. **w13_MLgz7bmfIu.raw**:...

    深入理解计算机系统(英文版)

    1 Introduction 1 1.1 Information isBits inContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Programs areTranslated byOtherPrograms intoDifferent Forms . . . . . . . . . ....

Global site tag (gtag.js) - Google Analytics