Day7 hadoop offline data analysis batch; spark [spark] * environment configuration: Installation spark - Local Local mode the ok * spark learning @Scala environment: 1 shell interactive environment started: spark - shell; (and comes into the default): command Learning: test case: . 1 WordCount: textFile ( " input " ): read the local file folder input data; flatMap (split _ (. " " )): flattening operation, according to the space delimiter mapped into a row of data words; Map ((_, . 1 )): for each element of the operation, the word is mapped to a tuple; reduceByKey (_ +_): According to the key value for the polymerization, the addition; the collect: Driver collected data to the display terminal. *** RDD: . 1 RDD know: Concepts Cognition: distributed object set; is a read-only partition record set essence, each RDD can be divided into a plurality of partitions, each partition is a collection of data segments, and a RDD different partitions may be stored on different nodes in the cluster, it can be calculated in parallel on different nodes in the cluster elasticity data set; RDD shared memory model provides a highly constrained? ? ? ? ; RDD provides a rich set of common data operations to support operations; Conversion: Understanding read-only Understanding of the operation: Create: poly (parent RDD) of a (sub RDD)- Input RDD, RDD output; the presence of "parent-child" specific dependencies: a correspondence relationship Sons RDD partition; Action: Understanding - Input RDD, output values; official appreciated noun: Subdivision: logically partitioned operator: understanding: a series of operation Transformations & Actions; target - a turn to another RDD RDD; dependency: when a turn to another RDD RDD, linked; narrow dependency: RDDs one correspondence between the partitions / width dependent on: each of the downstream of RDD RDD upstream partition (also referred to as the parent RDD) For each partition, many to many relationship; (Narrow width - understood: according to the corresponding relationship between the partition classification) buffer: Objective: RDD easy reuse; the checkpoint problem: For long-term applications iteration consanguinity longer, once an error in a subsequent iteration, the need is very long kinship to rebuild, affect the performance of purpose: fault tolerance achieved: saving data to persistent storage, cutting off blood relationship, take data directly at the checkpoint; the division of tasks the Application, the job, Stage, task; the Application -> job-> Stage-> Task each layer are both 1-to-n relationship partitioning phase ( 1 -2 ) 1 may be a frame according to the DAG dependency; 2 reverse analysis, there is a wide-dependent, it is cut open stage 3 sets of tasks: Each stage represents a group of associated Shuffle no dependencies between each other tasks collection of tasks each task will be distributed Executor task scheduler on each working node (Worker node) to perform 2 RDD practice: Lamda expression;? ? ? ? ? ? ? ? ? ? ? ? ? ? ? I'm so hard ah tuple tuple <Key, value> ; RDD: is the Map + the reduce ideas; RDD Transformations: flatMap (a large collection of a small set of synthesis) The Map: the Reduce: COALESCE (numPartitions); Value Type partitions.size ;: view of the number of partitions RDD Map (); mapPartitions (); mapPartitionsWithIndex (); flatMap (); Glom () groupBy (); filter (); Sample (withReplacement, fraction, SEED); DISTINCT ( [numTasks]); repartition (numPartitions); the sortBy (FUNC, [Ascending], [numTasks]); bis Value types of interactions: Source RDD & parameter RDD Union (otherDataset); Subtract (otherDataset); intersection (otherDataset); of Cartesian (otherDataset); Cartesian product -> generating a series of tuples <a, b> ZIP (otherDataset); Key - the Value type partitionBy (); GroupByKey (); reduceByKey (FUNC, [numTasks]); aggregateByKey (); ? foldByKey (); combineByKey ();? sortByKey([ascending], [numTasks]); mapValues(); join(otherDataset, [numTasks]); cogroup(otherDataset, [numTasks]); RDD Action reduce(func); collect(); count(); first(); take(n); takeOrdered(n); aggregate; fold (NUM) (FUNC); saveAsTextFile(path); saveAsSequenceFile (path); saveAsObjectFile (path); countByKey (); foreach (FUNC); RDD Summary: 1 to solve what? Realization (for large data sets) efficient computation; 2 how to achieve? ? 2 the Spark standalone application (Note: Support for the Java, Scala, Python) * Scala: Method 1 - manually build: Scala compiler package tool SBT + building project directory structure + core code files -> packaged into a jar; the way people - Compiler : * the Java: mode 1- manually build: the Java packaging tools Maven + build maven project -> packaged into a jar; the way people - compiler: @Python environment: 0 configuration and pyspark related documents; 1 shell environment interaction starts: pyshpark; [ new ] Question 1: In transfer files to Linux / usr / local / after .... you need to add sudo command can perform; ways: change the file owner chown -R Kouch: Kouch (current user) ****;