mapreduce restrictions
Suitable "trip" calculation operation
Symbol combined and nested operations difficult
It can not be represented iterations
========
Due to copy, serialization and disk IO lead to slow mapreduce
Complex applications, stream computing, internal queries because maprecude lack of effective data sharing and slow
======
Iterations are required every copy disk IO
Internal inquiries and online processing needs disk IO
======== spark goal
Store more data in memory to improve performance
Extended maprecude model to better support the analysis of two common applications: 1, iterative algorithm (machine learning, Figure) 2, internal data mining
Enhanced resistance may encode: 1, multiple libraries api, less code 2
======
spark Composition
spark sql,spark straming(real-time),graphx,mllib(meachine learning)
======
You can use it to run in several modes:
In its standalone cluster mode
在hadoop yarn
在 Apache months
在 kubernetes
Living in the clouds above
==========
Data Sources:
1, the local file file: /// opt / httpd / logs / access_log
2,amazon S3
3,hadooop distributed filesystem
4,hbase,cassandra,etc
===========
spark cluster cluster
============
spark workflow
First produce a SparkContext objects (1, spark tell how and where to access the cluster; 2, to connect different types of cluster managers, egYARN, Mesos, itself)
Then use the cluster management to allocate resources
Spark executer last used to run the calculation process, read data block
==============
workers and executives node
executors worker nodes is to run the machine (1, each of a worker or a jvm process, 2 each worker may generate a plurality Executor)
Executor can run the task (1, runs in the sub-jvm, 2 perform one or more tasks in a thread pool)
=========
Solution: Resilient Distributed Datasets
Resilient distributed datasets
=========
RDD operation
transformation: Returns a new RDD, function including: map, filter, flatMap, groupByKey, reduceByKey, aggragateByKey, filter, join
action: new evaluation and returns a value, when a RDD object calls the action method, all data processing queries will be simultaneously calculated, the result value is returned; comprising
reduce,collect,count,first,take,countByKey,foreach,saveAsTextFile
============
How to use RDD
1, the data source is generated in a RDD (1, using the existing set of lists, arrays; 2, RDD conversion; 3, or other data from system hdfs)
2, transformation using RDD
3, the operation using RDD
=======
Generating a RDD
From hdfs, textfiles, amazons S3, hbase, document serial number, the other input formats hadoop
(// generate a file from RDD
JavaRDD <String> distFile = sc.textFile ( "data.txt", 4) // rdd divided into four parts
)
(// create from RDD collection
list<Integer> data = Arrays.aslist(1,2,3,4,5);
JavaRDD<Integer> distData = sc.parallelize(data);
)
========