- 浏览: 2539673 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Prediction(3)Model - Decision Tree
Error Message:
[error] (run-main-0) java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Solution:
Add this to your ENV
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx
Exception:
numClasses: Int = 1500 categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map(0 -> 57, 1 -> 29674) impurity: String = gini maxDepth: Int = 5 maxBins: Int = 30000 java.lang.IllegalArgumentException: requirement failed: RandomForest/DecisionTree given maxMemoryInMB = 256, which is too small for the given features. Minimum value = 340 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:187)
Solution:
http://stackoverflow.com/questions/31965611/how-to-increase-maxmemoryinmb-for-decisiontree
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree$
These codes help.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000
val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)
Some Core Codes during Decision Tree Training
1. Import the Classes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.tree.configuration.Strategy
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Gini, Impurity}
2. Use Case Class to Give Dataframe Column Name
case class City(cityName:String, cityCode:Double)
var i = 0
val cities_df = sqlContext.sql("select distinct(city) from jobs").map(row => {
i= i + 1
City(row.getString(0), i)
}).toDF
//cities_df.count()
cities_df.registerTempTable("cities")
3. Use pattern to Load the Data
val date = "{20,21,22,23,24,25,26}"
val clicksRaw_df = sqlContext.load(s"s3n://xx-prediction-engine/xxx/decision_tree_data/clicks/2015/09/" + date + "/*/*", "json")
filter and operate on data
clicksRaw_df.registerTempTable("clicks_raw")
val clicks_df = sqlContext.sql("select sum(count_num) as count_num,job_id as job_id from clicks_raw group by job_id")
clicks_df.registerTempTable("clicks")
val jobsRaw_df = sqlContext.load(s"s3n://xxx-prediction-engine/predictData/decision_tree_data/jobs_with_num/2015/09/" + date + "/*", "json")
jobsRaw_df.registerTempTable("jobs_raw")
val jobsall_df = sqlContext.sql("select sum(num) as num, id as id, city as city, industry as industry from jobs_raw group by id, city, industry ")
val jobindustry_df = jobsall_df.filter(jobsall_df("num") > 6)
jobindustry_df.registerTempTable("jobindustry")
val jobs_df = sqlContext.sql("select id as id, city as city, industry as industry from jobindustry where industry is not null and industry <> '' ")
jobs_df.registerTempTable("jobs")
val total = jobsRaw_df.count()
val total7days = jobs_df.count()
val total1 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 1").count()
val total2 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 2").count()
val total5 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 2 and c.count_num < 5").count()
val total10 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 4 and c.count_num > 10").count()
println("total jobs= " + total)
println("total 7 days jobs= " + total7days)
println("1 clicks = " + total1 + " " + total1 * 100 / total7days + "%")
println("2 clicks = " + total2 + " " + total2 * 100 / total7days + "%")
println("5 clicks = " + total5 + " " + total5 * 100 / total7days + "%")
println("10 clicks = " + total10 + " " + total10 * 100 / total7days + "%")
4. Prepare the LabeledPoint
val data = sqlContext.sql("select j.id, j.industry, c.count_num, cities.cityCode from jobs as j left join cities as cities on j.city = cities.cityName left join clicks as c on j.id = c.job_id ").map( row=>{
//0 - id
//1 - industry
//2 - count
//3 - cityCode
val label = row.get(2) match {
case s:Long => s
case _ => 0l
}
val industry = java.lang.Double.parseDouble(row.getString(1))
val cityCode = row.get(3) match {
case s : Double => s
case _ => 0
}
val features = Vectors.dense(industry, cityCode)
LabeledPoint(label, features)
})
val splits = data.randomSplit(Array(0.2,0.1))
val (trainingData, testData) = (splits(0), splits(1))
5. Decision Tree Classification and Regression
// Train a DecisionTree model with classification
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses1 = 4000
val categoricalFeaturesInfo1 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity1 = "gini"
val maxDepth1 = 5
val maxBins1 = 30000
val strategy = new Strategy( Algo.Classification, Gini , maxDepth1, numClasses1, maxBins = maxBins1, categoricalFeaturesInfo = categoricalFeaturesInfo1, maxMemoryInMB = 1024)
val model1 = DecisionTree.train(trainingData, strategy)
// Train a DecisionTree model with regression
// Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000
val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)
6. Valuate the Model
// Evaluate model on test instances and compute test error
val labelAndPreds2 = testData.map { point =>
val prediction = model2.predict(point.features)
(point.label, prediction)
}
labelAndPreds2.filter(x=>x._1 > 0.0).take(5).foreach { case (score, label) =>
println("label = " + score + " predict = " + label);
}
println("=============================================")
labelAndPreds2.filter(x=>x._1 == 0.0).take(5).foreach { case (score, label) =>
println("label = " + score + " predict = " + label);
}
val testErr = labelAndPreds2.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model2.toDebugString)
Reference:
Decision Tree
http://spark.apache.org/docs/latest/mllib-guide.html
Factorization Machines
http://blog.csdn.net/itplus/article/details/40536025
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
Error Message:
[error] (run-main-0) java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Solution:
Add this to your ENV
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx
Exception:
numClasses: Int = 1500 categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map(0 -> 57, 1 -> 29674) impurity: String = gini maxDepth: Int = 5 maxBins: Int = 30000 java.lang.IllegalArgumentException: requirement failed: RandomForest/DecisionTree given maxMemoryInMB = 256, which is too small for the given features. Minimum value = 340 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:187)
Solution:
http://stackoverflow.com/questions/31965611/how-to-increase-maxmemoryinmb-for-decisiontree
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree$
These codes help.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000
val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)
Some Core Codes during Decision Tree Training
1. Import the Classes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.tree.configuration.Strategy
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Gini, Impurity}
2. Use Case Class to Give Dataframe Column Name
case class City(cityName:String, cityCode:Double)
var i = 0
val cities_df = sqlContext.sql("select distinct(city) from jobs").map(row => {
i= i + 1
City(row.getString(0), i)
}).toDF
//cities_df.count()
cities_df.registerTempTable("cities")
3. Use pattern to Load the Data
val date = "{20,21,22,23,24,25,26}"
val clicksRaw_df = sqlContext.load(s"s3n://xx-prediction-engine/xxx/decision_tree_data/clicks/2015/09/" + date + "/*/*", "json")
filter and operate on data
clicksRaw_df.registerTempTable("clicks_raw")
val clicks_df = sqlContext.sql("select sum(count_num) as count_num,job_id as job_id from clicks_raw group by job_id")
clicks_df.registerTempTable("clicks")
val jobsRaw_df = sqlContext.load(s"s3n://xxx-prediction-engine/predictData/decision_tree_data/jobs_with_num/2015/09/" + date + "/*", "json")
jobsRaw_df.registerTempTable("jobs_raw")
val jobsall_df = sqlContext.sql("select sum(num) as num, id as id, city as city, industry as industry from jobs_raw group by id, city, industry ")
val jobindustry_df = jobsall_df.filter(jobsall_df("num") > 6)
jobindustry_df.registerTempTable("jobindustry")
val jobs_df = sqlContext.sql("select id as id, city as city, industry as industry from jobindustry where industry is not null and industry <> '' ")
jobs_df.registerTempTable("jobs")
val total = jobsRaw_df.count()
val total7days = jobs_df.count()
val total1 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 1").count()
val total2 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 2").count()
val total5 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 2 and c.count_num < 5").count()
val total10 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 4 and c.count_num > 10").count()
println("total jobs= " + total)
println("total 7 days jobs= " + total7days)
println("1 clicks = " + total1 + " " + total1 * 100 / total7days + "%")
println("2 clicks = " + total2 + " " + total2 * 100 / total7days + "%")
println("5 clicks = " + total5 + " " + total5 * 100 / total7days + "%")
println("10 clicks = " + total10 + " " + total10 * 100 / total7days + "%")
4. Prepare the LabeledPoint
val data = sqlContext.sql("select j.id, j.industry, c.count_num, cities.cityCode from jobs as j left join cities as cities on j.city = cities.cityName left join clicks as c on j.id = c.job_id ").map( row=>{
//0 - id
//1 - industry
//2 - count
//3 - cityCode
val label = row.get(2) match {
case s:Long => s
case _ => 0l
}
val industry = java.lang.Double.parseDouble(row.getString(1))
val cityCode = row.get(3) match {
case s : Double => s
case _ => 0
}
val features = Vectors.dense(industry, cityCode)
LabeledPoint(label, features)
})
val splits = data.randomSplit(Array(0.2,0.1))
val (trainingData, testData) = (splits(0), splits(1))
5. Decision Tree Classification and Regression
// Train a DecisionTree model with classification
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses1 = 4000
val categoricalFeaturesInfo1 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity1 = "gini"
val maxDepth1 = 5
val maxBins1 = 30000
val strategy = new Strategy( Algo.Classification, Gini , maxDepth1, numClasses1, maxBins = maxBins1, categoricalFeaturesInfo = categoricalFeaturesInfo1, maxMemoryInMB = 1024)
val model1 = DecisionTree.train(trainingData, strategy)
// Train a DecisionTree model with regression
// Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000
val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)
6. Valuate the Model
// Evaluate model on test instances and compute test error
val labelAndPreds2 = testData.map { point =>
val prediction = model2.predict(point.features)
(point.label, prediction)
}
labelAndPreds2.filter(x=>x._1 > 0.0).take(5).foreach { case (score, label) =>
println("label = " + score + " predict = " + label);
}
println("=============================================")
labelAndPreds2.filter(x=>x._1 == 0.0).take(5).foreach { case (score, label) =>
println("label = " + score + " predict = " + label);
}
val testErr = labelAndPreds2.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model2.toDebugString)
Reference:
Decision Tree
http://spark.apache.org/docs/latest/mllib-guide.html
Factorization Machines
http://blog.csdn.net/itplus/article/details/40536025
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
发表评论
-
Stop Update Here
2020-04-28 09:00 310I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 466NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 361Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 363Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 328Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 419Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 428Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 364Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 444VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 376Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 464NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 413Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 330Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 242GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 443GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 320GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 306Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 310Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 285Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 302Serverless with NodeJS and Tenc ...
相关推荐
在这个项目“Prediction-using-Decision-Tree-Algorithm: 创建决策树分类器并以图形方式对其进行可视化”中,我们将探讨如何在Jupyter Notebook环境中构建和可视化决策树。 首先,我们需要导入必要的库,包括`...
from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score ``` 接着,我们加载数据集并将其分为特征(X)和目标变量...
参见的模型该模型选择了Decisiontree,Random Forest和GBM算法作为预测三种算法的最佳方法,将这三种算法组合在一起可提供更高的准确性(87%)。 该集成模型用于在线将其部署到Herokuapp网站上。 该模型可以在以下...
本文将深入探讨PySpark 2.4.6版本中的决策树算法(Decision Tree)及其实际操作步骤。 决策树是一种监督学习算法,广泛应用于分类和回归任务。在PySpark中,我们主要通过`pyspark.ml.classification`模块的`...
这个"DecisionTree_机器学习_决策树_"的主题显然关注的是如何使用Python编程语言实现决策树模型。以下是对这个主题的详细解释: 决策树算法主要用于分类任务,它通过将数据集分割成更小的子集,然后对每个子集选择...
为了重新训练模型,请进入ctr_prediction.py并将load_model参数设置为False以便在您的机器上重新训练。 内容 执照 概述 数字营销人员使用在线广告(例如Adwords,Facebook广告等)在用户在线浏览时向他们展示广告...
本文介绍了一种新的方法,该方法将极限学习机(Extreme Learning Machine,ELM)与决策树(Decision Tree,DT)结合使用,在基于隐马尔可夫模型(Hidden Markov Model,HMM)的语音合成系统中进行持续时间预测。...
在"Snowfall-prediction-based-on-decision-tree-main"这个主文件夹中,我们可能会找到以下文件和子文件夹: 1. 数据集:可能包含一个CSV文件,比如"weather_data.csv",其中包含历史气象数据,如温度、湿度、风速...
Chapter 5, Click-Through Prediction with Tree-Based Algorithms, explains decision trees and random forests in depth over the course of solving an advertising click-through rate problem. Chapter 6, ...
result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction ...
The most credible decision path from the decision forest can be selected to make a prediction. Furthermore, functional dependencies (FDs), which are extracted from the whole data set based on ...
18.2 Decision Tree Algorithms 252 18.2.1 Implementations of the Gain Measure 253 18.2.2 Pruning 254 18.2.3 Threshold-Based Splitting Rules for Real-Valued Features 255 18.3 Random Forests 255 18.4 ...
4. **Random Forests**: A popular ensemble method, random forests build multiple decision trees by selecting random subsets of features and samples for each tree. The final decision is reached through ...
员工辞职预测概括创建了一个预测员工何时辞职的工具,以帮助完善员工管理并改善工作空间环境。 根据Kaggle上可用的公开发布数据,使用了这些功能。 经过设计的功能可以计算目标变量与自变量之间的依存关系。...
有关该项目的技术细节: :round_pushpin: 编程语言: Python :round_pushpin: 图书馆: scikit-learn :round_pushpin: 应用算法: Decision tree , Bagging , Boosting , Random forest和Xgboost数据源: Kggle:...
常见的分类算法有决策树(Decision Tree)、K近邻(KNN)、贝叶斯分类(Bayes)、支持向量机(SVM)、人工神经网络(ANN)以及逻辑回归(Logistic Regression)。其中,决策树是一种直观且易于理解的方法,它通过...
- **Decision Trees and Random Forests:** Tree-based methods that are easy to interpret and can handle high-dimensional data. #### 4. **Unsupervised Learning** - **Clustering:** Grouping similar data...