spark notes 01

Day7 


hadoop offline data analysis batch; 

spark 



[spark]

 * environment configuration: 
    Installation spark - Local Local mode the ok
 * spark learning 
    @Scala environment: 
        1 shell interactive environment 
            started: spark - shell; (and comes into the default): 
            command Learning: 
                test case: 
                    . 1 WordCount: 
                        textFile ( " input " ): read the local file folder input data; 
                        flatMap (split _ (. "  " )): flattening operation, according to the space delimiter mapped into a row of data words; 
                        Map ((_, . 1  )): for each element of the operation, the word is mapped to a tuple;
                        reduceByKey (_ +_): According to the key value for the polymerization, the addition; 
                        the collect: Driver collected data to the display terminal. 

                *** RDD:
                     . 1 RDD know: 
                        Concepts Cognition: 
                            distributed object set; 
                            is a read-only partition record set essence, each RDD can be divided into a plurality of partitions, 
                            each partition is a collection of data segments, 
                            and a RDD different partitions may be stored on different nodes in the cluster, 
                            it can be calculated in parallel on different nodes in the cluster 
                            elasticity data set; 
                                RDD shared memory model provides a highly constrained? ? ? ? ; 
                                RDD provides a rich set of common data operations to support operations; 
                            Conversion: Understanding 
                            read-only
                        Understanding of the operation: 
                            Create: poly (parent RDD) of a (sub RDD)- Input RDD, RDD output; the presence of "parent-child" specific dependencies: a correspondence relationship Sons RDD partition; 
                            Action: Understanding - Input RDD, output values; 
                        official appreciated noun: 
                            
                            Subdivision: 
                                logically partitioned 
                            operator: 
                                understanding: a series of operation Transformations & Actions; 
                                      target - a turn to another RDD RDD; 
                            
                            dependency: 
                                when a turn to another RDD RDD, linked; 
                                narrow dependency: RDDs one correspondence between the partitions / 
                                width dependent on: each of the downstream of RDD RDD upstream partition (also referred to as the parent RDD) For each partition, many to many relationship;
                               (Narrow width - understood: according to the corresponding relationship between the partition classification) 

                            buffer: 
                                Objective: RDD easy reuse; 
                            
                            the checkpoint 
                                problem: For long-term applications iteration consanguinity longer, once an error in a subsequent iteration, 
                                      the need is very long kinship to rebuild, affect the performance of 
                                purpose: fault tolerance 
                                achieved: saving data to persistent storage, cutting off blood relationship, take data directly at the checkpoint; 
                            
                            the division of tasks 
                                the Application, the job, Stage, task; 
                                the Application -> job-> Stage-> Task each layer are both 1-to-n relationship 
                                partitioning phase ( 1 -2 )
                                     1 may be a frame according to the DAG dependency;
                                     2 reverse analysis, there is a wide-dependent, it is cut open stage
                                     3 sets of tasks: Each stage represents a group of associated Shuffle no dependencies between each other tasks collection of tasks 
                                              each task will be distributed Executor task scheduler on each working node (Worker node) to perform 
                    2 RDD practice: 
                        
                        Lamda expression;? ? ? ? ? ? ? ? ? ? ? ? ? ? ? I'm so hard ah 
                    
                        tuple tuple <Key, value> ; 
                        
                        RDD: is the Map + the reduce ideas; 
                        RDD Transformations: 
                        flatMap (a large collection of a small set of synthesis)
                        The Map: 
                        the Reduce: 
                                COALESCE (numPartitions);
                        
                            Value Type 
                                partitions.size ;: view of the number of partitions RDD 
                                Map (); 
                                mapPartitions (); 
                                mapPartitionsWithIndex (); 
                                flatMap (); 
                                Glom () 
                                groupBy (); 
                                filter (); 
                                Sample (withReplacement, fraction, SEED); 
                                DISTINCT ( [numTasks]); 
                                repartition (numPartitions); 
                                the sortBy (FUNC, [Ascending], [numTasks]); 
                            bis Value types of interactions: Source RDD & parameter RDD 
                                Union (otherDataset); 
                                Subtract (otherDataset); 
                                intersection (otherDataset); 
                                of Cartesian (otherDataset); Cartesian product -> generating a series of tuples <a, b> 
                                ZIP (otherDataset); 
                            Key - the Value type 
                                partitionBy (); 
                                GroupByKey (); 
                                reduceByKey (FUNC, [numTasks]); 
                                aggregateByKey (); ? 
                                foldByKey (); 
                                combineByKey ();?
                                sortByKey([ascending], [numTasks]);
                                mapValues();
                                join(otherDataset, [numTasks]);
                                cogroup(otherDataset, [numTasks]);
                        RDD Action
                            reduce(func);
                            collect();
                            count();
                            first();
                            take(n);
                            takeOrdered(n);
                            aggregate;
                            fold (NUM) (FUNC);
                            saveAsTextFile(path);
                            saveAsSequenceFile (path); 
                            saveAsObjectFile (path); 
                            countByKey (); 
                            foreach (FUNC); 
                    RDD Summary: 
                        1 to solve what? 
                            Realization (for large data sets) efficient computation;     
                        2 how to achieve? 
                            ? 
                            
        
        2 the Spark standalone application (Note: Support for the Java, Scala, Python)
             * Scala: 
                Method 1 - manually build: 
                    Scala compiler package tool SBT + building project directory structure + core code files -> packaged into a jar; 
                the way people - Compiler :

             * the Java: 
                mode 1- manually build: 
                    the Java packaging tools Maven + build maven project -> packaged into a jar; 
                the way people - compiler: 
                    
                
        
        
    @Python environment: 
        0 configuration and pyspark related documents;
         1 shell environment interaction 
            starts: pyshpark; 
            
            
            


[ new ] 

Question 1: In transfer files to Linux / usr / local / after .... you need to add sudo command can perform; 
ways: change the file owner chown -R Kouch: Kouch (current user) ****;

 

Guess you like

Origin www.cnblogs.com/floakss/p/11525910.html