Spark 2017 BigData Update(3)Notebook Example

Spark 2017 BigData Update(3)Notebook Example

Zeppelin Tutorial/Basic Features(Spark)
Sample is here https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv
17 columns with about 4522 records. (age, job, marital, education, default, balance, housing, load, contact, day, month, duration, campaign, plays, previous, poutcome, y)

The data format is as follow:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"

Load data into table
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

RDD will only know the items, but not in columns. DataFrame will know not only the items, but also columns structure.
https://www.jianshu.com/p/c0181667daa0

Spark SQL
%sql
select age, count(1) value
from bank
where age < 100
group by age
order by age

Spark SQL with Parameters
%sql
select age, count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age

SparkSQL with Select Parameters
%sql
select age, count(1) value
from bank
where marital="${marital=single,single|divorced|married}" and age < ${maxAge=30}
group by age
order by age

Python Example
https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/

Spark SQL - hive interactive queries
Spark Streaming - Real time streaming data analysis
MLLib - Machine learning algorithms
GraphX - Graph Processing Algorithms

Spark Driver - sparkContext
Spark Executor -

Spark Streaming - StreamingContext —> DStream

Some Operation in Spark
Map - function for each item in RDD, that will generate a new RDD
example using python:
%pyspark

x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, x**2))
print(x.collect())   // [1, 2, 3]
print(y.collect())   // [(1, 1), (2, 4), (3, 9)]

mapPartition
Operation on RDD based on partitions
%pyspark

x = sc.parallelize([1,2,3,4], 2)
def f(iterator): yield sum(iterator)
y = x.mapPartitions(f)
print(x.glom().collect())  # glom() flattens elements on the same partition
print(y.glom().collect())

output
[[1, 2], [3, 4]]
[[3], [7]]

Filter
%pyspark

# filter
x = sc.parallelize([1,2,3, 4, 5, 6])
y = x.filter(lambda x: x%2 == 1)  # filters out even elements
print(x.collect())        // [1, 2, 3, 4, 5, 6]
print(y.collect())        // [1, 3, 5]

Distinct
%pyspark

x = sc.parallelize([1,1,2,2,3,3,4,4,5,5])
y = x.distinct()
print(x.collect()) //[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
print(y.collect())  //[2, 4, 1, 3, 5]

More Example
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html

References:
http://sillycat.iteye.com/blog/2405875
http://sillycat.iteye.com/blog/2406113

https://distributesystem.wordpress.com/2016/04/12/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8/
https://distributesystem.wordpress.com/2016/04/13/python-spark-%E4%B8%A4%E5%A4%A9%E5%85%A5%E9%97%A8-%E7%AC%AC%E4%BA%8C%E5%A4%A9/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326246532&siteId=291194637