spark study notes __chap4_spark module introduces the basic principles __1_

pyspark Spark is a python API, provides the use of a large python preparation and submission of data processing operations interface. In pyspark it was roughly divided into five main modules

  1. pyspark module This module four most basic module, which implements the most basic writing Spark job
    API. This module there are the following:

    Sparkcontext:它是编写Spark程序的主入口
    RDD:分布式弹性数据集,是Spark内部中最重要的抽象
    Broadcast:在各个任务task中重复使用的广播变量
    Accumulator:一个只能增加的累加器,在各个任务中都可以进行累加,最终进行全局累加
    SparkConf:一个配置对象,用来对Spark中的例如资源,内核个数,提交模式等的配置
    SparkFiles:文件访问API
    StorageLevel:它提供了细粒度的对于数据的缓存、持久化级别
    TaskContext:实验性质的API,用于获取运行中任务的上下文信息。
    
  2. pyspark.sql module, which is above the RDD Architecture Advanced module, providing support for SQL, include the following:

     SparkSession:SparkSQL的主入口,其内部仍然是调用SparkContext的
     DataFrame:分布式的结构化的数据集,最终的计算仍然转换为RDD上的计算
     Column:DataFrame中的列
     Row:DataFrame中的行
     GroupedData:这里提供聚合数据的一些方法
     DataFrameNaFunctions:处理缺失数据的方法
     DataFrameStatFunctions:提供统计数据的一些方法
     functions:内建的可用于DataFrame的方法
     types:可用的数据类型
     Window:提供窗口函数的支持
    
  3. pyspark.streaming This module is used to process the stream data, Flume or received from an external messaging middleware such kafka data directly from the network, for real-time processing of streaming data. Inside copies of the received data into DStream, is actually inside DSTREAM RDD. Support pyspark.streaming convection data is not perfect, it is better native Scala language and the Java language. But this class will still include the most important principle. This section contains the following:

     - 接收数据的原理及过程
     - 接收网络数据
     - 接收kafka数据
    
  4. pyspark.ml This module is made of machine learning, which realized a lot of machine learning algorithms, including classification, regression, clustering, recommended. This is what we will include the most important machine learning algorithms. pyspark.ml This module has now become the main machine learning module, its internal implementation is based on DataFrame data frame.

  5. pyspark.mllib This module also do machine learning, but this module underlying the use of RDD, less room for RDD on performance optimization, so now the latest machine learning algorithms are based on DataFrame to implement the API used. But this module there are also many useful machine learning algorithm, we can play with the appropriate look.

Published 63 original articles · won praise 52 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_41521681/article/details/104742857