- 浏览: 2551032 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Classification(2)NLP and Classifier Implementation
1. Generate the FeatureMap
NLP - Natural Language Processing
remove the noise, remove the html tag, remove the stop word(for example, of, a in English, 的,啊in Chinese)
stem(change the stopped to stop),
NLP for Chinese
https://github.com/xpqiu/fnlp/
NLP for English
Stanford
http://nlp.stanford.edu/software/index.shtml
http://nlp.stanford.edu/software/corenlp.shtml
http://nlp.stanford.edu/software/segmenter.shtml
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu/software/classifier.shtml
apache NLP
http://opennlp.apache.org/
Remove Stop Word
One source for Stop Workd
https://raw.githubusercontent.com/muhammad-ahsan/WebSentiment/master/mit-Stopwords.txt
PorterStemmer
convert the ‘ate’ -> ‘eat’ and etc.
coalesce function in Spark
decrease the number of partitions in the RDD to numParitions.
TF-IDF
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
Term Frequency- Inverse Document Frequency
Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d.
The document frequency DF(t,D) is the number of documents that contains term t.
Inverse document frequency is a numerical measure of how much information a term provides:
IDF(t,D) = log ((|D| + 1) / (DF(t, D) + 1))
|D| is the total number of documents in the corpus.
DF = String / Int
IDF = String, Double = LogValue
IDFSwithIndex = String —> ( Double, Index)
2. Generate Training Data
It seems to me that zeppelin can load the jar from remote
z.load("com.amazonaws:aws-java-sdk:1.10.4.1")
Amazon S3 Operation
import com.amazonaws.services.s3._
import com.amazonaws.services.s3.model._
import com.amazonaws.services.s3.transfer.TransferManager
import com.amazonaws.services.s3.transfer.Upload
/**
* Upload a file to S3
*/
def uploadToS3(client: AmazonS3Client, bucket: String, key: String, file: File): Unit = {
val tm = new TransferManager()
val upload = tm.upload(bucket, key, file)
upload.waitForCompletion()
}
/**
* Read a file's contents from S3
*/
def readFileContentsFromS3(client: AmazonS3Client, bucket: String, key: String): String = {
val getObjectRequest = new GetObjectRequest(bucket, key)
val responseHeaders = new ResponseHeaderOverrides()
responseHeaders.setCacheControl("No-cache")
getObjectRequest.setResponseHeaders(responseHeaders)
val objectStream = client.getObject(getObjectRequest).getObjectContent()
scala.io.Source.fromInputStream(objectStream).getLines().mkString("\n")
}
FeatureMap and Job
FeatureMap will read the features files.
Job will parse the raw data from xml to object. GetFeatures.
BinaryFeatureExtractor
Local Vector
Vectors.sparse(size, sortedElems)
Calculate and upload the binary label to the S3
TFFeatureExtractor
TFIDFFeatureExtractor
TFIDF(t,d,D) = TF(t,d)*IDF(t,D)
3. Classifier
UniformFoldingMechanism
validation codes blog
val msg = (positive, negative) match {
case _ if folds <= 0 =>
s"Invalid number of folds ($folds); Must be a positive integer."
case _ if negative.isEmpty || positive.isEmpty =>
"Insufficient number of samples " +
s"(# positive: ${positive.size}, # negative: ${negative.size})!"
case _ if positive.size < folds =>
s"Insufficient number of positive samples (${positive.size}); " +
s"Must be >= number of folds ($folds)!"
case _ if negative.size < folds =>
s"Insufficient number of negative samples (${negative.size}); " +
s"Must be >= number of folds ($folds)!"
case _ =>
""
}
isNullOrEmpty(msg) match {
case false =>
logger.error("Fold validation failed!")
Some(new RuntimeException(msg))
case true =>
logger.info("Fold validation succeeded!")
None
}
Merge the data and format them.
KFoldCrossValidator
Generate the TrainableSVM ——> TrainedSVM
Validate —> ModelMetrics
Scala Tips:
1. String Tail and Init
scala> val s = "123456"
s: String = 123456
scala> val s1 = s.tail
s1: String = 23456
scala> val s2 = s.init
s2: String = 12345
2. Tuple2
scala> val stuff = (42, "fish")
stuff: (Int, String) = (42,fish)
scala> stuff.getClass
res2: Class[_ <: (Int, String)] = class scala.Tuple2
scala>
scala> stuff._1
res3: Int = 42
scala> stuff._2
res4: String = fish
3. Scala Shuffle
scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res8: List[Int] = List(7, 1, 3, 9, 5, 8, 2, 6, 4)
scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res9: List[Int] = List(5, 1, 2, 6, 9, 4, 8, 7, 3)
4. Scala Grouped
scala> List(1,2,3,4,5,6,7,8,9,10,11,12,13).grouped(4).toList
res11: List[List[Int]] = List(List(1, 2, 3, 4), List(5, 6, 7,, List(9, 10, 11, 12), List(13))
5. Scala List Zip
scala> List(1,2,3).zip(List("one","two","three"))
res12: List[(Int, String)] = List((1,one), (2,two), (3,three))
scala> List(1,2,3).zip(List("one","two","three", "four"))
res13: List[(Int, String)] = List((1,one), (2,two), (3,three))
6. List Operation
scala> val s1 = List(1, 2, 3, 4, 5, 6, 7).splitAt(3)
s1: (List[Int], List[Int]) = (List(1, 2, 3),List(4, 5, 6, 7))
scala> val t1 = s1._1.last
t1: Int = 3
scala> val t2 = s1._1.init
t2: List[Int] = List(1, 2)
scala> val t2 = s1._2
t2: List[Int] = List(4, 5, 6, 7)
References:
http://www.fnlp.org/archives/4231
example
http://www.cnblogs.com/linlu1142/p/3292982.html
http://fuhao-987.iteye.com/blog/891697
1. Generate the FeatureMap
NLP - Natural Language Processing
remove the noise, remove the html tag, remove the stop word(for example, of, a in English, 的,啊in Chinese)
stem(change the stopped to stop),
NLP for Chinese
https://github.com/xpqiu/fnlp/
NLP for English
Stanford
http://nlp.stanford.edu/software/index.shtml
http://nlp.stanford.edu/software/corenlp.shtml
http://nlp.stanford.edu/software/segmenter.shtml
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu/software/classifier.shtml
apache NLP
http://opennlp.apache.org/
Remove Stop Word
One source for Stop Workd
https://raw.githubusercontent.com/muhammad-ahsan/WebSentiment/master/mit-Stopwords.txt
PorterStemmer
convert the ‘ate’ -> ‘eat’ and etc.
coalesce function in Spark
decrease the number of partitions in the RDD to numParitions.
TF-IDF
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
Term Frequency- Inverse Document Frequency
Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d.
The document frequency DF(t,D) is the number of documents that contains term t.
Inverse document frequency is a numerical measure of how much information a term provides:
IDF(t,D) = log ((|D| + 1) / (DF(t, D) + 1))
|D| is the total number of documents in the corpus.
DF = String / Int
IDF = String, Double = LogValue
IDFSwithIndex = String —> ( Double, Index)
2. Generate Training Data
It seems to me that zeppelin can load the jar from remote
z.load("com.amazonaws:aws-java-sdk:1.10.4.1")
Amazon S3 Operation
import com.amazonaws.services.s3._
import com.amazonaws.services.s3.model._
import com.amazonaws.services.s3.transfer.TransferManager
import com.amazonaws.services.s3.transfer.Upload
/**
* Upload a file to S3
*/
def uploadToS3(client: AmazonS3Client, bucket: String, key: String, file: File): Unit = {
val tm = new TransferManager()
val upload = tm.upload(bucket, key, file)
upload.waitForCompletion()
}
/**
* Read a file's contents from S3
*/
def readFileContentsFromS3(client: AmazonS3Client, bucket: String, key: String): String = {
val getObjectRequest = new GetObjectRequest(bucket, key)
val responseHeaders = new ResponseHeaderOverrides()
responseHeaders.setCacheControl("No-cache")
getObjectRequest.setResponseHeaders(responseHeaders)
val objectStream = client.getObject(getObjectRequest).getObjectContent()
scala.io.Source.fromInputStream(objectStream).getLines().mkString("\n")
}
FeatureMap and Job
FeatureMap will read the features files.
Job will parse the raw data from xml to object. GetFeatures.
BinaryFeatureExtractor
Local Vector
Vectors.sparse(size, sortedElems)
Calculate and upload the binary label to the S3
TFFeatureExtractor
TFIDFFeatureExtractor
TFIDF(t,d,D) = TF(t,d)*IDF(t,D)
3. Classifier
UniformFoldingMechanism
validation codes blog
val msg = (positive, negative) match {
case _ if folds <= 0 =>
s"Invalid number of folds ($folds); Must be a positive integer."
case _ if negative.isEmpty || positive.isEmpty =>
"Insufficient number of samples " +
s"(# positive: ${positive.size}, # negative: ${negative.size})!"
case _ if positive.size < folds =>
s"Insufficient number of positive samples (${positive.size}); " +
s"Must be >= number of folds ($folds)!"
case _ if negative.size < folds =>
s"Insufficient number of negative samples (${negative.size}); " +
s"Must be >= number of folds ($folds)!"
case _ =>
""
}
isNullOrEmpty(msg) match {
case false =>
logger.error("Fold validation failed!")
Some(new RuntimeException(msg))
case true =>
logger.info("Fold validation succeeded!")
None
}
Merge the data and format them.
KFoldCrossValidator
Generate the TrainableSVM ——> TrainedSVM
Validate —> ModelMetrics
Scala Tips:
1. String Tail and Init
scala> val s = "123456"
s: String = 123456
scala> val s1 = s.tail
s1: String = 23456
scala> val s2 = s.init
s2: String = 12345
2. Tuple2
scala> val stuff = (42, "fish")
stuff: (Int, String) = (42,fish)
scala> stuff.getClass
res2: Class[_ <: (Int, String)] = class scala.Tuple2
scala>
scala> stuff._1
res3: Int = 42
scala> stuff._2
res4: String = fish
3. Scala Shuffle
scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res8: List[Int] = List(7, 1, 3, 9, 5, 8, 2, 6, 4)
scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res9: List[Int] = List(5, 1, 2, 6, 9, 4, 8, 7, 3)
4. Scala Grouped
scala> List(1,2,3,4,5,6,7,8,9,10,11,12,13).grouped(4).toList
res11: List[List[Int]] = List(List(1, 2, 3, 4), List(5, 6, 7,, List(9, 10, 11, 12), List(13))
5. Scala List Zip
scala> List(1,2,3).zip(List("one","two","three"))
res12: List[(Int, String)] = List((1,one), (2,two), (3,three))
scala> List(1,2,3).zip(List("one","two","three", "four"))
res13: List[(Int, String)] = List((1,one), (2,two), (3,three))
6. List Operation
scala> val s1 = List(1, 2, 3, 4, 5, 6, 7).splitAt(3)
s1: (List[Int], List[Int]) = (List(1, 2, 3),List(4, 5, 6, 7))
scala> val t1 = s1._1.last
t1: Int = 3
scala> val t2 = s1._1.init
t2: List[Int] = List(1, 2)
scala> val t2 = s1._2
t2: List[Int] = List(4, 5, 6, 7)
References:
http://www.fnlp.org/archives/4231
example
http://www.cnblogs.com/linlu1142/p/3292982.html
http://fuhao-987.iteye.com/blog/891697
发表评论
-
Stop Update Here
2020-04-28 09:00 315I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 475NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 367Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 368Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 335Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 429Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 435Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 373Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 454VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 384Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 475NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 421Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 336Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 246GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 450GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 326GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 312Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 317Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 292Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 311Serverless with NodeJS and Tenc ...
相关推荐
本科毕业设计——自然语言处理+NLP+中文文本分类实战——垃圾短信识别本科毕业设计——自然语言处理+NLP+中文文本分类实战——垃圾短信识别本科毕业设计——自然语言处理+NLP+中文文本分类实战——垃圾短信识别本科...
自然语言处理(NLP)是计算机科学领域的一个重要分支,它专注于使计算机能够理解、解析、生成和操作人类自然语言。这些课件涵盖了NLP的多个核心主题,为学习者提供了一个全面的学习路径。 首先,"L1 - Introduction...
【船级社】 DNV Rules for classification High speed and light craft RU-HSLC 2023-07.pdf
Assessing and Improving Prediction and Classification Theory and Algorithms in C++ 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
The Optics Classification and Indexing Scheme (OCIS) provides a flexible, comprehensive classification system for all optical author input and user retrieval needs. OCIS has a two-level hierarchical ...
Classification and Regression Trees 分类与决策树 英文高清经典教程
automatic modulation classification principles, algorithms and applications.pdf AMC automatic modulation classification Wiley
**Python-TextClassification自然语言处理项目** 在信息技术领域,自然语言处理(Natural Language Processing, NLP)是一项核心技术,它涉及到计算机对人类语言的理解、分析和生成。本项目"Python-Text...
Microblogging Emotion Classification based onMulti-classifier Integration Strategy
multi-label,classifier,text_classification,多标签文本分类_classifier_multi_label
Sparse representation classification (SRC) and collaborative representation classification (eRe) are the most promising classifiers for classifying high dimensional data. However, they may suffer from...
Halfond, W., J. Viegas, et al. (2006). A classification of SQL-injection attacks and countermeasures. 对SQL注入攻击的不同手段、不同目的和防御方法进行了整理
近年来,随着深度学习技术的发展,卷积神经网络(CNN)、循环神经网络(RNN)及其变体如长短时记忆网络(LSTM)、门控循环单元(GRU)等,在文本挖掘领域得到了广泛应用,特别是在自然语言处理任务上取得了显著的...
PACS,全称为Physics and Astronomy Classification Scheme,是一个用于物理学和天文学领域的分类系统,由美国物理学会(American Institute of Physics, AIP)制定并维护。这个分类体系类似于图书馆中的中图分类号...