Big data summary route combing

01

Concurrency Toolkit review key points

  1. Blocking queue ArrayBlockingQueue, LinkedBlockQueue.
  2. ConcurrentHashMap and HashTable comparison. Thread safety and high performance.
    The old version introduced a segmented lock (bucket) mechanism (16).
    The new version introduces CAS (Compare And Swap without lock algorithm) + linked list becomes a red-black tree.
  3. CountDownLatch thread decrement lock (lock).
  4. Thread pool ThreadPool. Application scenarios of small pools with large queues and large pools with small queues.
  5. RetreenLock reentrant lock. The bottom layer supports fair lock and unfair lock mechanisms.
  6. Atomic type, AtomicInteger, AtomicDouble, AtomicBoolean.

Zookeeper review key points

  1. Understand the application scenarios of Zookeeper.
    • Cluster management, such as a server in the cluster is down, can know.

Insert picture description here
When the client starts, it registers its temporary node with zookeeper. When this client goes down, the corresponding temporary node is deleted. Zookeeper monitors the event that the node is deleted, so as to know the status change of the server.
In summary, Zookeeper manages the cluster through: temporary nodes + monitoring mechanism

  1. A naming service
    needs to ensure the uniqueness of naming. Therefore, the uniqueness of the zookeeper path can be used to achieve this requirement.
    For example, server 1:/server/01
    server 2:/server/02
  2. Coordination service notification
    Insert picture description here
  3. Implement distributed lock service

Insert picture description here
Through the sequence node, determine the server that is the first to preemptively, and then distribute the paste.
If you want to implement a fair lock mechanism: According to the order of the size of the distribution.
If you want to implement an unfair lock mechanism: re-squatting.

  1. Provide information publishing and subscribing to
    monitor node data changes.
    Insert picture description here
  • In summary, Zookeeper can provide many central services, the routine is: various types of nodes + monitoring
    Insert picture description here

Zookeeper node type

  • create /park
    ordinary persistent node
  • create -e /park
    temporary node
  • create -s /park
    sequential node
  • create -e -s /park
    temporary sequence node

Review Zookeeper's election mechanism

Zxid The largest transaction id
election id is
more than half election.

  1. Distributed data consistency and related algorithms
  • 2PC algorithm two-phase commit protocol algorithm
  • 3PC algorithm three-phase commit protocol algorithm
  • The Paxos algorithm is more than half of the algorithm, and it was used in a Google paper at that time.
    Zookeeper is an open source implementation based on this paper.

02

Zookeeper focus

  • ZAB protocol, Zookeeper Atomic Broadcast, atomic broadcast protocol.

    • Achieve distributed data consistency
    • When the
      leader supports crash recovery , after receiving a transaction request, it will assign a globally increasing transaction id to these transactions.
      The Leader writes the transaction into the local transaction log. The
      Leader sends the transaction to each
      Follower and executes the transaction after receiving the transaction. (Execute transaction refers to write the transaction to the local transaction log. If the write succeeds, it means the transaction is executed successfully. If the write fails, the transaction fails to execute.)
      Leader collects feedback information, and if more than half is satisfied, it initiates a commit transaction Instructions. (Submitting a transaction refers to updating the changes caused by the transaction in memory (the client obtains data from Zookeeper's memory)) When
      the follower starts again, it will find the largest transaction id it owns and send it to the leader.
      After the Leader receives it, it determines the transaction synchronization point. And create the corresponding transaction queue, put the transaction in the queue for summary.
      Follower does transaction recovery. Before the recovery is completed, it does not provide read services to avoid dirty reads.
  • What mechanism does Zookeeper use to prevent split brain?
    Split-brain refers to the presence of multiple leaders in a cluster, causing disorder in transaction synchronization.
    Zookeeper uses the EpochId mechanism to prevent split brain. That is, every time a new leader is elected, EpochI will increase by +1. Follower only accepts transactions from the leader with the largest EpochId.

Hadoop

The advantages of Hadoop: cheap and efficient.

Recommended books

  1. zookeeper: <From Paxos to Zookeeper: Principles and Practice of Distributed Consistency> see Chapter 4
  2. hdfs <In-depth analysis of Hadoop HDFS> see Chapter 1
  3. yarn <Hadoop technology insider: in-depth analysis of YARN architecture design and implementation principles> see Chapter 2

Important components of MapReduce

  • Mapper component
  • Reducer component
  • Partitioner component (Hadoop uses HashPartitioner by default, which will output the Hash partition of the Mapper key to ensure that the same Mapper output key falls into the same partition) Partition refers to ReduceTask

Insert picture description here

  • Combiner component (first merge on the Map side and then send to ReduceTask)
    function:
    1. Reduce the combined load of reduceTask
    2. Reduce network data transmission and save bandwidth
  • InputFormat component. Change the input key of Mapper and enter value
  • OutputFormat component. Change the format of the output to the result file.

Number of tasks for MapTask and ReduceTask

  1. The number of MapTask = the number of file slices of the job (file slices are not file slices)
    file slices = InputSplit, which is a logical slice, described by objects, including path start length,
    so there is no file data in the
    file slice, file slices = when the file is cut When uploaded to HDFS, it will be physically cut into blocks and stored on the DataNode.
  2. The number of ReduceTasks has nothing to do with slices. The default is one. Set by code (job.setNumReduceTasks(3))
  3. The sorting
    MR of MapReduce will sort the output of the Mapper. The specific sorting depends on the compare method in the mapper output key type.
  4. MRjob execution process
    • When receiving Maptask, meet the data localization strategy. Avoid transmitting data through the network and save bandwidth.
  5. MapReduce Shuffle process (emphasis)
  6. MapReduce join may cause data skew, how to solve it.
    Insert picture description here
    The figure above demonstrates reduce-side-join. This join method may cause data skew.
    Therefore, map-side-join can be used to achieve this
    Insert picture description here
    . The above figure demonstrates the idea of ​​map-side-join, loading small table data into each MapTask cache, and then completing the Join in MapTask. Can avoid data skew.
  7. Yarn (emphasis)
    Insert picture description here

Yarn's three altimeters

  • FIFO scheduler (first come, first served scheduler): Prioritize all resources of the cluster to run the first job1. After job1 is finished, run job2.
    Application scenario: When you encounter a job with a high priority and want to complete it quickly, you can use this scheduler.
  • The Fair scheduler (fair scheduler)
    cluster runs only one job1, and the memory used at this time: 16GB.
    If you submit a job2 again, the memory used by the two jobs at this time: 8GB,
    and so on, will allocate resources evenly according to the tasks of the job.
  • The Capacity scheduler (container scheduler, the default)
    introduces the concept of Container to encapsulate resources.

Insert picture description here

How MapReduce handles small files

Insert picture description here

  1. Enable the JVM reuse mechanism

  2. Combine multiple slices into one slice or a small number of slices job.setInputFormatClass(CombineTextInputFormat.class); the
    default composite slice size is 128MB
    Insert picture description here

Flume

Core components

  1. Source

    • Avro (important and commonly used)
    • Http
    • NetCat
    • SpoolDir (local directory data source)
    • Exec
  2. Channel

    • Memory
    • File
    • Jdbc
    • Memory overflow channel
  3. Sink

    • Hdfs
    • Logger
    • Euro
    • Kafka
    • File_roll Sink
    • HiveSink
  4. Interceptors (interceptors, bound to source)

    • TimeStamp
    • Host
    • Static
    • UUID
    • Regex_replace
  5. Selector

    • replicating replication mode (mode)
    • multiplexing routing mode
  6. Process

    • Failure recovery
    • Load balancing

The concept of data warehouse

  • OLTP online transaction system (database)
  • OLAP online analysis system (data warehouse)
    Insert picture description here
  • ETL (Extract-Transform-load) data extraction, transformation, and loading
  • Hive table type:
    • Internal table, external table
    • Both internal and external tables can be partitioned tables.
    • Partition related instructions. Create a partitioned table partitioned by. Add partition information for a table add partition. Delete the partition drop partition. Modify the partition rename to.
      Insert picture description here
    • Bucket table

HIVE

  1. Commonly used string manipulation functions in Hive
    Insert picture description here
  2. UDF
  3. Hive's various join operations
    left semi join
  4. How Hive solves the problem of data skew
  • The group by grouping operation may cause data skew. For example, the last statistical case of word count. To solve the idea, split a MRJob into two Jobs. Ob do random partitions. The second job does the final grouping statistics set hive.groupby .skewindata=true
  • The join operation may cause data skew. Solution idea: small table-big table do joim -> use map side join, write small table on the left. Big table-big table join -> use bucket table
  • count Global count may cause data skew. Solution: set hive.groupby.skewindata=true
  1. HIVE tuning
  2. Sqoop
  3. HIVE JDBC
  4. HIVE architecture
    Insert picture description here
    Insert picture description here
    Insert picture description here

Guess you like

Origin blog.csdn.net/yasuofenglei/article/details/101026195