[2020/1/24] winter self - learning progress reports 5

  Then write four progress reports, prepared to RDD program and SparkSQL on these days knocked off together (did not return home New Year's Eve is somewhat boring).


  This is one I would like to general knowledge about all aspects of the province where there do not understand.

  The first is the RDD.

  As a distributed data structure, for me RDD method is abstract, but in general for all my functional programming operation, it is difficult to appreciate the RDD data conversion in the true sense of action and operation, so by RDD construction and operation of learning ways to better understand the role of RDD in Spark.

  RDD (Resilient Distributed Dataset) cried elastic distributed data.

  RDD features:

  1. The partition is a read-only collection of records;
  2 with a fault tolerance special set;
  3 can be created by deterministic operation (conversion) on a stable memory or other of RDD;
  4 can be distributed in the nodes of the cluster, to functional operation mode set, various parallel operations.

  Also derived from the name of another RDD characteristics - flexibility.  

    1. Based Lineage efficient fault tolerance (errors of n nodes, will recover from the n-1 nodes, blood fault tolerance);
    2. If the Task failure will be a certain number of retries (default 4) automatically;
    3. the Stage If the failure is automatically a certain number of retries (the value can be calculated failure phase of operation), the failure count only data piece;
    4. Flexible scheduling data: DAG TASK and independent resource management;
    5. the the checkpoint (checkpoint);
    6 automatic switching of the data memory and disk storage.

  After the general understanding of the features you need to understand the structure (how characteristics).

  FIG configuration from the network, showing the relationships and attributes RDD -

 

  It is worth mentioning that the internal RDD does not store data.

  RDD only abstract data set, the internal partition and does not store specific data. Partition class contains one index member, indicates the partition number in the RDD, the section number + the block number can be uniquely determined by the partition corresponding to RDD number, using the interface provided by the underlying data storage layer, can be from a storage medium (such as: ) corresponding to the extracted data partition HDFS, Memory.

  After that SparkSQL in DataFrame.

  SparkSQL as Spark in the data warehouse itself even has a faster calculation speed than the Spark, higher computational complexity, which means you can even use the library directly to sophisticated machine learning algorithms in Figure computing (which also shows DataFrame and python depth study of some of the concepts similar).

  Spark of DataFrame from formal point of view the biggest difference is that it is inherently distributed in. Spark can simply think of DaraFrame is a distributed Table. Form as follows:

1
2
3
4
5
6
7
8
Name Age Tel
String Int Long
String Int Long
String Int Long
...
String Int Long
String Int Long
String Int Long

  And the RDD is shaped as follows:

1
2
3
4
5
6
7
Person
Person
Person
...
Person
Person
Person

  DataFrame 是一个分布式数据容器,更像传统数据库的二维表格,除了数据以外,还掌握数据的结构信息, 即 schema。同时,与 Hive 类似,DataFrame 也支持嵌套数据类型(struct、 array 和 map)。 从 API 易用性的角度上 看,DataFrameAPI 提供的是一套高层的关系操作,比函数式的 RDDAPI 要更加友好,门槛更低。   

Guess you like

Origin www.cnblogs.com/limitCM/p/12232468.html