`
sillycat
  • 浏览: 2551032 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Classification(2)NLP and Classifier Implementation

 
阅读更多
Classification(2)NLP and Classifier Implementation

1. Generate the FeatureMap
NLP - Natural Language Processing
remove the noise, remove the html tag, remove the stop word(for example, of, a in English, 的,啊in Chinese)
stem(change the stopped to stop),

NLP for Chinese
https://github.com/xpqiu/fnlp/

NLP for English
Stanford
http://nlp.stanford.edu/software/index.shtml
http://nlp.stanford.edu/software/corenlp.shtml
http://nlp.stanford.edu/software/segmenter.shtml
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu/software/classifier.shtml

apache NLP
http://opennlp.apache.org/

Remove Stop Word
One source for Stop Workd
https://raw.githubusercontent.com/muhammad-ahsan/WebSentiment/master/mit-Stopwords.txt

PorterStemmer
convert the ‘ate’ -> ‘eat’ and etc.

coalesce function in Spark
decrease the number of partitions in the RDD to numParitions.

TF-IDF
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
Term Frequency- Inverse Document Frequency

Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d.

The document frequency DF(t,D) is the number of documents that contains term t.

Inverse document frequency is a numerical measure of how much information a term provides:
IDF(t,D) = log ((|D| + 1) / (DF(t, D) + 1))
|D| is the total number of documents in the corpus.

DF = String / Int
IDF = String, Double = LogValue
IDFSwithIndex = String —> ( Double, Index)

2. Generate Training Data

It seems to me that zeppelin can load the jar from remote
z.load("com.amazonaws:aws-java-sdk:1.10.4.1")

Amazon S3 Operation
import com.amazonaws.services.s3._
import com.amazonaws.services.s3.model._
import com.amazonaws.services.s3.transfer.TransferManager
import com.amazonaws.services.s3.transfer.Upload

/**
*  Upload a file to S3
*/
def uploadToS3(client: AmazonS3Client, bucket: String, key: String, file: File): Unit = {
    val tm = new TransferManager()
    val upload = tm.upload(bucket, key, file)
    upload.waitForCompletion()
}

/**
*  Read a file's contents from S3
*/
def readFileContentsFromS3(client: AmazonS3Client, bucket: String, key: String): String = {
    val getObjectRequest = new GetObjectRequest(bucket, key)
    val responseHeaders = new ResponseHeaderOverrides()
    responseHeaders.setCacheControl("No-cache")
    getObjectRequest.setResponseHeaders(responseHeaders)

    val objectStream = client.getObject(getObjectRequest).getObjectContent()
    scala.io.Source.fromInputStream(objectStream).getLines().mkString("\n")
}

FeatureMap and Job
FeatureMap will read the features files.
Job will parse the raw data from xml to object. GetFeatures.

BinaryFeatureExtractor
Local Vector
Vectors.sparse(size, sortedElems)
Calculate and upload the binary label to the S3

TFFeatureExtractor

TFIDFFeatureExtractor
TFIDF(t,d,D) = TF(t,d)*IDF(t,D)

3. Classifier

UniformFoldingMechanism
validation codes blog
    val msg = (positive, negative) match {
      case _ if folds <= 0 =>
        s"Invalid number of folds ($folds); Must be a positive integer."
      case _ if negative.isEmpty || positive.isEmpty =>
        "Insufficient number of samples " +
        s"(# positive: ${positive.size}, # negative: ${negative.size})!"
      case _ if positive.size < folds =>
        s"Insufficient number of positive samples (${positive.size}); " +
        s"Must be >= number of folds ($folds)!"
      case _ if negative.size < folds =>
        s"Insufficient number of negative samples (${negative.size}); " +
        s"Must be >= number of folds ($folds)!"
      case _ =>
        ""
    }

    isNullOrEmpty(msg) match {
      case false =>
        logger.error("Fold validation failed!")
        Some(new RuntimeException(msg))
      case true =>
        logger.info("Fold validation succeeded!")
        None
    }

Merge the data and format them.

KFoldCrossValidator
Generate the TrainableSVM  ——> TrainedSVM
Validate —> ModelMetrics


Scala Tips:
1. String Tail and Init
scala> val s = "123456"
s: String = 123456

scala> val s1 = s.tail
s1: String = 23456

scala> val s2 = s.init
s2: String = 12345

2. Tuple2
scala> val stuff = (42, "fish")
stuff: (Int, String) = (42,fish)

scala> stuff.getClass
res2: Class[_ <: (Int, String)] = class scala.Tuple2

scala>

scala> stuff._1
res3: Int = 42

scala> stuff._2
res4: String = fish

3. Scala Shuffle
scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res8: List[Int] = List(7, 1, 3, 9, 5, 8, 2, 6, 4)

scala> util.Random.shuffle(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
res9: List[Int] = List(5, 1, 2, 6, 9, 4, 8, 7, 3)

4. Scala Grouped
scala> List(1,2,3,4,5,6,7,8,9,10,11,12,13).grouped(4).toList
res11: List[List[Int]] = List(List(1, 2, 3, 4), List(5, 6, 7,, List(9, 10, 11, 12), List(13))

5. Scala List Zip
scala> List(1,2,3).zip(List("one","two","three"))
res12: List[(Int, String)] = List((1,one), (2,two), (3,three))
scala> List(1,2,3).zip(List("one","two","three", "four"))
res13: List[(Int, String)] = List((1,one), (2,two), (3,three))

6. List Operation
scala> val s1 = List(1, 2, 3, 4, 5, 6, 7).splitAt(3)
s1: (List[Int], List[Int]) = (List(1, 2, 3),List(4, 5, 6, 7))

scala> val t1 = s1._1.last
t1: Int = 3

scala> val t2 = s1._1.init
t2: List[Int] = List(1, 2)

scala> val t2 = s1._2
t2: List[Int] = List(4, 5, 6, 7)

References:
http://www.fnlp.org/archives/4231

example
http://www.cnblogs.com/linlu1142/p/3292982.html
http://fuhao-987.iteye.com/blog/891697

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics