Getting Started with Big Data

1. Typical high performance computer software stack

 

2. Characteristics of big data processing platforms and applications

MPI requires all resources to be available for normal operation, and it is difficult to tolerate faults. MPI can only be efficient when the system is homogeneous.

Big data platforms need to support cheap hardware, and software needs to support automatic fault tolerance and automatic load balancing—that is, support scalability.


1. MapReduce programming model

a. Borrowed concepts from functional languages

b. Users only need to write serial Map and Reduce functions

map(in key, in_value) ->
    (out_key, intermediate_value) list

reduce(out_key, intermediate_value list) ->
    out_value  list

c. It is easier to achieve fault tolerance, as long as the task is recalculated; it is easy to achieve load balancing and support heterogeneous systems

d. The mainstream big data platform Apache Hadoop Ecosystem

e. MapReduce uses files to transfer data, resulting in poor performance due to data replication, serialization and disk I/O


        If the data can be stored in memory, it is 10-100 times faster than the solution using the hard disk.

 

2. In-memory big data analysis platform-Spark


1). Spark's abstraction of memory

a.RDD(Resilient Distributed Datasets)

        - based on collections of data, not individual data

        - Produced by deterministic coarse-grained operations (map, filter, join, etc.)

        - Once the data is generated, it cannot be modified (immutable)

        - If you want to modify the data, you need to generate a new dataset by transforming the dataset

b. RDD supports efficient fault tolerance

        Once the data is generated deterministically, it will not change after generation

        - Data can be recovered by means of "re-calculation"

        - Just remember the generation process of rdd, so that one log can be used for a lot of data, and there is almost no overhead when there is no error



2). Limitations of Spark - data model level

        Big Data Application: Partial Data Update

        Spark: read-only data objects


        For each fine-grained data update, since Spark is based on the coarse-grained RDD read-only data object model, RDD transformation is required, that is, a large amount of data is copied, resulting in low processing efficiency.

 

3). Limitations of Spark - implementation level

a.Spark is based on Scala language and runs on JVM

b. Memory is redundant and occupies a large amount of memory

c. Large memory allocation and recovery overhead

 

3. Design concept of big data system



4. Problems

        Big data system, performance or scalability priority?

1. The importance of performance

a. Why is fault tolerance important?

        Many nodes, long runtime

b. If the performance can be significantly improved

        Fewer nodes (possibly 1 node), shorter runtime

        More expensive fault tolerance techniques, such as checkpointing, can be used because the probability of encountering failures is lower

Performance-first big data platform: performance and fault tolerance can have both (if performance can be much better)

 

2. Many big data problems are limited in size

a. Population

        10 billion, the size of the social network is about 10TB

b. Number of products

        1000000

c. Moore's Law is driving exponential growth in computing power, memory size, and I/O bandwidth

Today's limited-scale big data problems will become tomorrow's small data problems.

 

3. The importance of graph data

        Graphs can express rich data and relationships

        -Internet connection

        -Web links

        -Social relationships

        -Protein interactions

        - People and people, people and companies, people and products

 

4. Graph calculation and analysis

        -PageRank

        - shortest path

        - connected branches

        - Extremely independent set

        - Minimum cost spanning tree

        -Bayesian Belief Propagation

 

5. A programming model based on graph abstraction

        - The basic elements of a graph are nodes and edges

        - The most intuitive programming abstraction can be based on nodes and edges

                - Calculated on the node

                - Communication along the side



5. Graph computing - an eclectic big data analytics platform


        GraphLab is 10 times faster than Spark on some tasks.

 

6. Problems with GraphLab

1. Large memory usage

        The required memory is more than 10 times that of the original data, requiring too many node calculations, introducing unnecessary communication and parallel overhead

2. Poor locality of memory access

        The computing performance is low, and the processing performance of 8 units is not as good as that of a stand-alone system when processing small images.

 

7. GridGraph, a performance-prioritized big data system of Tsinghua University Research Institute

1. Data model: readable and writable data

2. Data structure: data structure based on two-dimensional shuffle, with good locality and compactness

3. Programming abstraction: based on a collection of points and edges, supporting both single-machine and multi-machine

4. Execution platform: stand-alone memory->Out of core->distributed system

5. Performance optimization: NUMA, scheduling, adaptive data format, etc.

First optimize single-machine performance from the perspective of locality and scheduling, and then expand to multiple machines.

6. Multi-machine expansion and optimization

        -Sparse-dense dual calculation engine

        - Chunk-based graph partitioning mechanism

        - Co-scheduling of communication and computation


 

Shared by a professor from Tsinghua University.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327060087&siteId=291194637