- 浏览: 2560646 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Classification(3)Generate Features and Stem Adjust the Model System
1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala
scala> longContent.contains("python")
res0: Boolean = true
Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._
scala>
scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)
scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)
scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)
Magic scalaz
https://github.com/scalaz/scalaz
Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))
Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}
Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2
2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)
For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF
3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator
4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432
http://www.scalanlp.org/
1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala
scala> longContent.contains("python")
res0: Boolean = true
Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._
scala>
scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)
scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)
scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)
Magic scalaz
https://github.com/scalaz/scalaz
Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))
Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}
Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2
2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)
For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF
3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator
4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432
http://www.scalanlp.org/
发表评论
-
Stop Update Here
2020-04-28 09:00 322I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 484NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 374Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 375Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 345Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 436Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 444Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 381Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 463VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 394Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 488NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 432Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 342Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 255GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 456GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 332GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 318Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 324Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 302Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 315Serverless with NodeJS and Tenc ...
相关推荐
3D Model Classification and Retrieval Method Using LDA Based on Heterogeneous Features
Starting with the introduction of classification and model evaluation techniques, we will explore Apache Mahout and learn why it is a good choice for classification. Next, you will learn about ...
automatic modulation classification principles, algorithms and applications.pdf AMC automatic modulation classification Wiley
The Optics Classification and Indexing Scheme (OCIS) provides a flexible, comprehensive classification system for all optical author input and user retrieval needs. OCIS has a two-level hierarchical ...
图像纹理分类经典的文章《Textural features for image classification》
【船级社】 DNV Rules for classification High speed and light craft RU-HSLC 2023-07.pdf
3. **国际咨询**:通过与其他国家的经验进行比较,获取关于最佳实践的信息。 ### 三、关键发现与建议 #### 3.1 整体价值 报告强调了AR-DRG分类系统对澳大利亚医疗保健系统的重要贡献,包括提高支付的透明度、促进...
Power_Load_Classification_and_Prediction_System_Ba_Powerload-Classification-and-Prediction-System
Assessing and Improving Prediction and Classification Theory and Algorithms in C++ 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
In this project, I tried the traditional method use SIFT to extract features and KNN for classification which get accuracy of 97.31%, and also tried the convolutional neural network method such as ...
【船级社】 KR Guidances for the Classification of High Speed and Light Crafts.pdf
在本文中,我们将深入探讨Caffe框架的C++接口,特别是关于classification任务的实现,以及相关的模型和参数文件。Caffe是一种广泛使用的深度学习框架,以其高效、灵活和易用性而闻名。在这个压缩包中,包含的是Caffe...
Classification and Regression Trees 分类与决策树 英文高清经典教程
Malware Images Visualization and Automatic Classification Web 安全之机器学习 提到的恶意文件图像识别机制,本论文提供恶意图像可视化和自动分类的方法
The opportunity of performing regular calibration is also analyzed, and a classification of analog systems allowing or disallowing this feature is developed. Finally, simulation tools permitting the ...
【船级社】 DNV Rules for classification Rules for classification_ High speed and light craft (RU-HSLC) 2022-07.pdf