`
sillycat
  • 浏览: 2556836 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Classification(1)Find Phrases from String

 
阅读更多
Classification(1)Find Phrases from String

1. Find Import Phrase in All the Content
Start my Local Zeppelin
> bin/zeppelin-daemon.sh start

Because My local Zeppelin is connecting to my virtual box yarn cluster. So I need to start my virtual box and ubuntu-master, ubuntu-dev1, ubuntu-dev2.

How to Load Jar
z.load("org.scalaz:scalaz-core_2.10:7.2.0-M2")

How to Connect to S3
val rdd = sc.textFile("s3n://sillycat/jobs.csv")

How to Add Customer Jar to Zeppelin
in the file zeppelin-env.sh
export ZEPPELIN_JAVA_OPTS="-Dspark.jars=/home/spark-seed-assembly-0.0.1.jar,/home/classifier-assembly-1.0.jar"

README.md Format will Help a lot
# Classification System #

### What is this repository for? ###

* NLP and classification

### How do I get set up? (TODO)###

* Summary of set up

Special Character in HTML
http://www.degraeve.com/reference/specialcharacters.php

Really Nice Codes to Filter the Charactors
IncludetextMunging.scala
IncludeTextMungingSpec.scala

Get Phrases from One String
/**
* Counts phrases using a sliding window.
*
* Example:
* In:  getPhrasesInTitle(Job("foo foo foo foo foo foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5)
*
* In:  getPhrasesInTitle(Job("foo foo foo foo foo foo bar foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5, foo bar -> 1, bar foo -> 1)
*/
def getPhrasesInTitle(job: Job, numWordsInPhrase: Int) = {
    val phrases = job.title.split(" ").sliding(numWordsInPhrase).foldLeft(Map("" -> 0)) {
        (phraseCounts: Map[String, Int], phrase: Array[String]) =>
            phrase.size == numWordsInPhrase match {
                case true =>
                    val str = phrase.mkString(" ")
                    val count = phraseCounts.getOrElse(str, 0) + 1
                    phraseCounts + (str -> count)
                case false =>
                    phraseCounts
            }
    }
    phrases - ""
}

One Map Operation
scala> val m1 = Map( ""->0, "s1" ->1)
val m2 = m1 - ""
m2: scala.collection.immutable.Map[String,Int] = Map(s1 -> 1)
val m3 = m2 - "s1"
m3: scala.collection.immutable.Map[String,Int] = Map()

Merge Map
http://stackoverflow.com/questions/20047080/scala-merge-map
http://www.nimrodstech.com/scala-map-merge/

Then merge the map by map1 |+| map2

https://github.com/scalaz/scalaz
How to add scalaz-core in your class path
https://keramida.wordpress.com/2013/12/02/using-sbt-to-experiment-with-new-scala-libraries/

Directly on Command
> wget http://central.maven.org/maven2/org/scalaz/scalaz-core_2.10/7.1.3/scalaz-core_2.10-7.1.3.jar
> scala -cp scalaz-core_2.10-7.1.3.jar
scala> import scalaz.Scalaz._
scala> val k1 = Map( "key"->1, "key22"->3)
k1: scala.collection.immutable.Map[String,Int] = Map(key -> 1, key22 -> 3)
scala> val k2 = Map( "key1"->11, "key122"->13)
k2: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13)
scala> val k3 = k1 |+| k2
k3: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13, key -> 1, key22 -> 3)

Or put the jar in one place and this will work
> scala -cp lib/*

The Whole Flow of Phrase Finding will be
item = “foo foo foo foo” —> Map(“foo foo” -> 4, “ok hello” -> 3)
items.map( item => ).reduce(_ |+| _ )


Scala Skill Tip
1. How to use _
var className: ClassName = _
similar to
var className: ClassName = null

2. foldLeft/: and foldRight:\ and fold
val numbers = List(5,1,3,3)
numbers.fold(0) { (z, i) =>
     z+i
}
This function will init the 0, use 0 and add one element in the list, the result will be 5, then the result will add another element in the list.

Another UseCase
class Foo(val name: String, val age: Int, val sex: Symbol)
object Foo {
     def apply(name:String, age:Int, sex: Symbol) = new Foo(name, age, sex)
}

val fooList = Foo(“Carl”, 33, ‘male) :: Foo(“Kiko”, 23, ‘female) :: Nil
val stringList = fooList.foldLeft(List[String]()) { (z, f) =>
     val title = f.sex match {
          case ‘male => “Mr."
          case ‘female => “Ms."
     }
     z :+ s”$title ${f.name}, ${f.age}"
}      //stringList(0) Mr. Carl, 33

folerLeft will begin from Left, folderRight will from Right, fold will be no order.

3. Iterator.Sliding
sliding[B>:A](size: Int, step: Int)   size of the window, step of the window
scala> (1 to 5).iterator.sliding(3).toList
res0: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

scala> (1 to 5).iterator.sliding(4, 3).toList
res1: List[Seq[Int]] = List(List(1, 2, 3, 4), List(4, 5))

scala> (1 to 5).iterator.sliding(4, 3).withPartial(false).toList
res2: List[Seq[Int]] = List(List(1, 2, 3, 4))




References:
scala underscore
http://stackoverflow.com/questions/8000903/what-are-all-the-uses-of-an-underscore-in-scala
foldLeft
http://hongjiang.info/foldleft-and-foldright/
http://www.iteblog.com/archives/1228
sliding
http://daily-scala.blogspot.com/2009/11/iteratorsliding.html
http://hongjiang.info/scala-counting-reduplicated-character/
分享到:
评论

相关推荐

    【船级社】 DNV Rules for classification Rules for classification_ Hig

    【船级社】 DNV Rules for classification Rules for classification_ High speed and light craft (RU-HSLC) 2022-07.pdf

    pattern classification chapter2 solution

    在"PRHW1Solution.pdf"和"HW1 solution.pdf"中,可能包含了对以上概念的实际应用解析,比如通过具体的数据集进行实验,展示如何应用所学知识解决实际问题。解题过程可能会包含代码实现、结果解释和性能评估,这对于...

    Learning Apache Mahout Classification

    This book is a practical guide that explains the classification algorithms provided in Apache Mahout with the help of actual examples. Starting with the introduction of classification and model ...

    Pattern Classification duda 课后答案

    Pattern Classification duda 课后答案 Pattern Classification duda 课后答案 Pattern Classification duda 课后答案 Pattern Classification duda 课后答案 Pattern Classification duda 课后答案

    目标跟踪Moving Target Classification and Tracking from Real-time Video

    《Moving Target Classification and Tracking from Real-time Video》一文由卡内基梅隆大学机器人研究所的Alan J. Lipton、Hironobu Fujiyoshi和Raju S. Patil共同撰写,提出了一种端到端的方法,用于从实时视频流...

    binaryClassification

    二元分类(Binary Classification)是机器学习领域中的一个重要概念,主要目标是将数据分为两个不同的类别。在实际应用中,这种技术广泛应用于垃圾邮件检测、医学诊断、信用评分、情感分析等多个场景。在这个主题中...

    classification toolbox in Matlab

    1. **支持向量机(SVM)**:SVM是一种二分类模型,它的基本模型是定义在特征空间上的间隔最大的线性分类器,间隔最大使它有别于感知机;SVM还包括核技巧,这使得它成为实质上的非线性分类器。在classification ...

    Using the Classification ToolBox 轉載

    1. 下载名为Classification_toolbox.zip的压缩文件。 2. 将压缩文件解压到一个新的目录中。 3. 在MATLAB命令窗口中输入`addpath <directory>`,将新目录添加到MATLAB的搜索路径中,其中`<directory>`应替换为实际的...

    leave_classification.zip

    二值化是图像处理的一种基本方法,它将图像简化为黑(0)和白(1)两种颜色,有利于后续的纹理分析和特征提取。在这个阶段,图像的边缘和形状得以清晰呈现,对于特征提取尤其有利。 接下来,对这些二值化图片进行...

    Deep Learning for the Classification

    Deep Learning for the Classification Deep Learning for the Classification Deep Learning for the Classification

    模式分类 Pattern classification

    Pattern classification Second Edition David G. Stork Richard O. Duda Peter E. Hart 中文翻译人员: 李宏东 姚天翔

    2-Classification_caught9j1_李宏毅_classification_

    《李宏毅Classification课程解析与实战》 在深度学习领域,分类问题占据着核心地位,尤其是在计算机视觉、自然语言处理等多个领域。李宏毅教授的"Classification"课程,旨在为学生提供全面而深入的分类技术理解,...

    Pattern classification

    Pattern classification

    pattern classification ppt

    pattern classification duda的ppt课件

    Classification

    JCOS之Classification.ppt

    模式分类 Pattern Classification

    ### 模式分类 Pattern Classification #### 一、概述与背景 《模式分类》是一本在模式识别领域享有盛誉的经典著作。自首次出版以来,已过去了超过一个季度世纪的时间,作者们在这期间对本书进行了全面的修订与更新...

    pattern classification 2nd solution

    《Pattern Classification》(第二版)由R.O. Duda、P.E. Hart与D.G. Stork合著,是一本在模式识别领域内极具影响力的教材。本书不仅涵盖了模式识别的基础理论,还深入探讨了各种高级算法和技术。David G. Stork为...

Global site tag (gtag.js) - Google Analytics