spark learning record -1

mapreduce restrictions

Suitable "trip" calculation operation

Symbol combined and nested operations difficult

It can not be represented iterations

========

Due to copy, serialization and disk IO lead to slow mapreduce

Complex applications, stream computing, internal queries because maprecude lack of effective data sharing and slow

======

Iterations are required every copy disk IO

Internal inquiries and online processing needs disk IO

======== spark goal

Store more data in memory to improve performance

Extended maprecude model to better support the analysis of two common applications: 1, iterative algorithm (machine learning, Figure) 2, internal data mining

Enhanced resistance may encode: 1, multiple libraries api, less code 2

======

spark Composition

spark sql,spark straming(real-time),graphx,mllib(meachine learning)

======

You can use it to run in several modes:

In its standalone cluster mode

在hadoop yarn

在 Apache months

在 kubernetes

Living in the clouds above

==========

Data Sources:

1, the local file file: /// opt / httpd / logs / access_log

2,amazon S3

3,hadooop distributed filesystem

4,hbase,cassandra,etc

===========

spark cluster cluster

============

spark workflow

First produce a SparkContext objects (1, spark tell how and where to access the cluster; 2, to connect different types of cluster managers, egYARN, Mesos, itself)

Then use the cluster management to allocate resources

Spark executer last used to run the calculation process, read data block

==============

workers and executives node

executors worker nodes is to run the machine (1, each of a worker or a jvm process, 2 each worker may generate a plurality Executor)

Executor can run the task (1, runs in the sub-jvm, 2 perform one or more tasks in a thread pool)

 

 =========

Solution: Resilient Distributed Datasets

Resilient distributed datasets

=========

RDD operation

transformation: Returns a new RDD, function including: map, filter, flatMap, groupByKey, reduceByKey, aggragateByKey, filter, join

action: new evaluation and returns a value, when a RDD object calls the action method, all data processing queries will be simultaneously calculated, the result value is returned; comprising

  reduce,collect,count,first,take,countByKey,foreach,saveAsTextFile

============

How to use RDD

1, the data source is generated in a RDD (1, using the existing set of lists, arrays; 2, RDD conversion; 3, or other data from system hdfs)

2, transformation using RDD

3, the operation using RDD

=======

Generating a RDD

From hdfs, textfiles, amazons S3, hbase, document serial number, the other input formats hadoop

(// generate a file from RDD

JavaRDD <String> distFile = sc.textFile ( "data.txt", 4) // rdd divided into four parts

)

 

(// create from RDD collection

list<Integer> data = Arrays.aslist(1,2,3,4,5);

JavaRDD<Integer> distData = sc.parallelize(data);

)

 

========

 

Guess you like

Origin www.cnblogs.com/li-daphne/p/11871726.html
Recommended