Prediction(3)Model - Decision Tree - 快马扬鞭须努力！

sillycat

浏览: 2564458 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Prediction(3)Model - Decision Tree

博客分类：

Summary

Prediction(3)Model - Decision Tree

Error Message:
[error] (run-main-0) java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Solution:
Add this to your ENV
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx

Exception:
numClasses: Int = 1500 categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map(0 -> 57, 1 -> 29674) impurity: String = gini maxDepth: Int = 5 maxBins: Int = 30000 java.lang.IllegalArgumentException: requirement failed: RandomForest/DecisionTree given maxMemoryInMB = 256, which is too small for the given features. Minimum value = 340 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:187)

Solution:
http://stackoverflow.com/questions/31965611/how-to-increase-maxmemoryinmb-for-decisiontree

https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree$

These codes help.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000

val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)

Some Core Codes during Decision Tree Training
1. Import the Classes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.tree.configuration.Strategy
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Gini, Impurity}

2. Use Case Class to Give Dataframe Column Name
case class City(cityName:String, cityCode:Double)

var i = 0
val cities_df = sqlContext.sql("select distinct(city) from jobs").map(row => {
    i= i + 1
    City(row.getString(0), i)
}).toDF

//cities_df.count()

cities_df.registerTempTable("cities")

3. Use pattern to Load the Data
val date = "{20,21,22,23,24,25,26}"
val clicksRaw_df = sqlContext.load(s"s3n://xx-prediction-engine/xxx/decision_tree_data/clicks/2015/09/" + date + "/*/*", "json")

filter and operate on data
clicksRaw_df.registerTempTable("clicks_raw")

val clicks_df = sqlContext.sql("select sum(count_num) as count_num,job_id as job_id from clicks_raw group by job_id")
clicks_df.registerTempTable("clicks")

val jobsRaw_df = sqlContext.load(s"s3n://xxx-prediction-engine/predictData/decision_tree_data/jobs_with_num/2015/09/" + date + "/*", "json")
jobsRaw_df.registerTempTable("jobs_raw")

val jobsall_df = sqlContext.sql("select sum(num) as num, id as id, city as city, industry as industry from jobs_raw group by id, city, industry ")
val jobindustry_df = jobsall_df.filter(jobsall_df("num") > 6)
jobindustry_df.registerTempTable("jobindustry")

val jobs_df = sqlContext.sql("select id as id, city as city, industry as industry from jobindustry where industry is not null and industry <> '' ")
jobs_df.registerTempTable("jobs")

val total = jobsRaw_df.count()
val total7days = jobs_df.count()
val total1 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 1").count()
val total2 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 2").count()
val total5 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 2 and c.count_num < 5").count()
val total10 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 4 and c.count_num > 10").count()

println("total jobs= " + total)
println("total 7 days jobs= " + total7days)
println("1 clicks = " + total1 + " " + total1 * 100 / total7days + "%")
println("2 clicks = " + total2 + " " + total2 * 100 / total7days + "%")
println("5 clicks = " + total5 + " " + total5 * 100 / total7days + "%")
println("10 clicks = " + total10 + " " + total10 * 100 / total7days + "%")

4. Prepare the LabeledPoint
val data = sqlContext.sql("select j.id, j.industry, c.count_num, cities.cityCode from jobs as j left join cities as cities on j.city = cities.cityName left join clicks as c on j.id = c.job_id ").map( row=>{
//0 - id
//1 - industry
//2 - count
//3 - cityCode
val label = row.get(2) match {
      case s:Long => s
      case _ => 0l
}
val industry = java.lang.Double.parseDouble(row.getString(1))
val cityCode = row.get(3) match {
      case s : Double => s
      case _ => 0
}
val features = Vectors.dense(industry, cityCode)
LabeledPoint(label, features)
})

val splits = data.randomSplit(Array(0.2,0.1))
val (trainingData, testData) = (splits(0), splits(1))

5. Decision Tree Classification and Regression
// Train a DecisionTree model with classification
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses1 = 4000
val categoricalFeaturesInfo1 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity1 = "gini"
val maxDepth1 = 5
val maxBins1 = 30000

val strategy = new Strategy( Algo.Classification, Gini , maxDepth1, numClasses1, maxBins = maxBins1, categoricalFeaturesInfo = categoricalFeaturesInfo1, maxMemoryInMB = 1024)

val model1 = DecisionTree.train(trainingData, strategy)

// Train a DecisionTree model with regression
// Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000

val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)

6. Valuate the Model
// Evaluate model on test instances and compute test error
val labelAndPreds2 = testData.map { point =>
val prediction = model2.predict(point.features)
(point.label, prediction)
}

labelAndPreds2.filter(x=>x._1 > 0.0).take(5).foreach { case (score, label) =>
    println("label = " + score + " predict = " + label);
}

println("=============================================")

labelAndPreds2.filter(x=>x._1 == 0.0).take(5).foreach { case (score, label) =>
    println("label = " + score + " predict = " + label);
}

val testErr = labelAndPreds2.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model2.toDebugString)

Reference:
Decision Tree
http://spark.apache.org/docs/latest/mllib-guide.html

Factorization Machines
http://blog.csdn.net/itplus/article/details/40536025

http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application

分享到：

Prediction(4)Logistic Regression - Local ... | Stripe Payment(1)Introduction Of Payment ...

2015-10-01 03:21
浏览 1334
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Prediction(3)Model - Decision Tree

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Prediction(3)Model - Decision Tree

评论

发表评论

相关推荐

Stop Update Here

NodeJS12 and Zlib

Docker Swarm 2020(2)Docker Swarm and Portainer

Docker Swarm 2020(1)Simply Install and Use Swarm

Traefik 2020(1)Introduction and Installation

Portainer 2020(4)Deploy Nginx and Others

Private Registry 2020(1)No auth in registry Nginx AUTH for UI

Docker Compose 2020(1)Installation and Basic

VPN Server 2020(2)Docker on CentOS in Ubuntu

Buffer in NodeJS 12 and NodeJS 8

NodeJS ENV Similar to JENV and PyENV

Prometheus HA 2020(3)AlertManager Cluster

Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings

GraphQL 2019(3)Connect to MySQL

GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud

GraphQL 2019(1)Apollo Basic

Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit

Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree

Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF

Serverless with NodeJS and TencentCloud 2020(1)Running with Component

最近访客更多访客>>