Classification(1)Find Phrases from String
1. Find Import Phrase in All the Content
Start my Local Zeppelin
> bin/zeppelin-daemon.sh start
Because My local Zeppelin is connecting to my virtual box yarn cluster. So I need to start my virtual box and ubuntu-master, ubuntu-dev1, ubuntu-dev2.
How to Load Jar
z.load("org.scalaz:scalaz-core_2.10:7.2.0-M2")
How to Connect to S3
val rdd = sc.textFile("s3n://sillycat/jobs.csv")
How to Add Customer Jar to Zeppelin
in the file zeppelin-env.sh
export ZEPPELIN_JAVA_OPTS="-Dspark.jars=/home/spark-seed-assembly-0.0.1.jar,/home/classifier-assembly-1.0.jar"
README.md Format will Help a lot
# Classification System #
### What is this repository for? ###
* NLP and classification
### How do I get set up? (TODO)###
* Summary of set up
Special Character in HTML
http://www.degraeve.com/reference/specialcharacters.php
Really Nice Codes to Filter the Charactors
IncludetextMunging.scala
IncludeTextMungingSpec.scala
Get Phrases from One String
/**
* Counts phrases using a sliding window.
*
* Example:
* In: getPhrasesInTitle(Job("foo foo foo foo foo foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5)
*
* In: getPhrasesInTitle(Job("foo foo foo foo foo foo bar foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5, foo bar -> 1, bar foo -> 1)
*/
def getPhrasesInTitle(job: Job, numWordsInPhrase: Int) = {
val phrases = job.title.split(" ").sliding(numWordsInPhrase).foldLeft(Map("" -> 0)) {
(phraseCounts: Map[String, Int], phrase: Array[String]) =>
phrase.size == numWordsInPhrase match {
case true =>
val str = phrase.mkString(" ")
val count = phraseCounts.getOrElse(str, 0) + 1
phraseCounts + (str -> count)
case false =>
phraseCounts
}
}
phrases - ""
}
One Map Operation
scala> val m1 = Map( ""->0, "s1" ->1)
val m2 = m1 - ""
m2: scala.collection.immutable.Map[String,Int] = Map(s1 -> 1)
val m3 = m2 - "s1"
m3: scala.collection.immutable.Map[String,Int] = Map()
Merge Map
http://stackoverflow.com/questions/20047080/scala-merge-map
http://www.nimrodstech.com/scala-map-merge/
Then merge the map by map1 |+| map2
https://github.com/scalaz/scalaz
How to add scalaz-core in your class path
https://keramida.wordpress.com/2013/12/02/using-sbt-to-experiment-with-new-scala-libraries/
Directly on Command
> wget http://central.maven.org/maven2/org/scalaz/scalaz-core_2.10/7.1.3/scalaz-core_2.10-7.1.3.jar
> scala -cp scalaz-core_2.10-7.1.3.jar
scala> import scalaz.Scalaz._
scala> val k1 = Map( "key"->1, "key22"->3)
k1: scala.collection.immutable.Map[String,Int] = Map(key -> 1, key22 -> 3)
scala> val k2 = Map( "key1"->11, "key122"->13)
k2: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13)
scala> val k3 = k1 |+| k2
k3: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13, key -> 1, key22 -> 3)
Or put the jar in one place and this will work
> scala -cp lib/*
The Whole Flow of Phrase Finding will be
item = “foo foo foo foo” —> Map(“foo foo” -> 4, “ok hello” -> 3)
items.map( item => ).reduce(_ |+| _ )
Scala Skill Tip
1. How to use _
var className: ClassName = _
similar to
var className: ClassName = null
2. foldLeft/: and foldRight:\ and fold
val numbers = List(5,1,3,3)
numbers.fold(0) { (z, i) =>
z+i
}
This function will init the 0, use 0 and add one element in the list, the result will be 5, then the result will add another element in the list.
Another UseCase
class Foo(val name: String, val age: Int, val sex: Symbol)
object Foo {
def apply(name:String, age:Int, sex: Symbol) = new Foo(name, age, sex)
}
val fooList = Foo(“Carl”, 33, ‘male) :: Foo(“Kiko”, 23, ‘female) :: Nil
val stringList = fooList.foldLeft(List[String]()) { (z, f) =>
val title = f.sex match {
case ‘male => “Mr."
case ‘female => “Ms."
}
z :+ s”$title ${f.name}, ${f.age}"
} //stringList(0) Mr. Carl, 33
folerLeft will begin from Left, folderRight will from Right, fold will be no order.
3. Iterator.Sliding
sliding[B>:A](size: Int, step: Int) size of the window, step of the window
scala> (1 to 5).iterator.sliding(3).toList
res0: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
scala> (1 to 5).iterator.sliding(4, 3).toList
res1: List[Seq[Int]] = List(List(1, 2, 3, 4), List(4, 5))
scala> (1 to 5).iterator.sliding(4, 3).withPartial(false).toList
res2: List[Seq[Int]] = List(List(1, 2, 3, 4))
References:
scala underscore
http://stackoverflow.com/questions/8000903/what-are-all-the-uses-of-an-underscore-in-scala
foldLeft
http://hongjiang.info/foldleft-and-foldright/
http://www.iteblog.com/archives/1228
sliding
http://daily-scala.blogspot.com/2009/11/iteratorsliding.html
http://hongjiang.info/scala-counting-reduplicated-character/
Classification(1)Find Phrases from String
猜你喜欢
转载自sillycat.iteye.com/blog/2230117
今日推荐
周排行