Big data-frequently asked questions and answers

linux + shell

Commonly used advanced commands

top iotop df -h grep sed awk netstat halt ps

View process, view port number, view disk usage

top ps netstat df -h

Common tools

sed awk cut sort

What shell scripts have been written

  • Cluster start and stop script (zk.sh kf.sh xcall.sh myjps.sh xsync.sh)
#! /bin/bash
case $1 in
"start") {
    
    
		for I in hadoop102 hadoop103 hadoop104
		do 
				ssh $I "绝对路径 start"
		done			
};;
"stop") {
    
    
	
};;
esac
  • Data warehouse level import ods -> dwd -> dws 5 steps
#!/bin/bash
定义变量
hive/opt/module/hive/bin/hive
APP=gmall

定义时间

sql="
	遇到表,前面加上数据库名称;
	遇到时间,$do_date
"

hive -e "$sql"
  • Data warehouse and mysql import and export
    -main sqoop

Second, Hadoop

getting Started

  • Port number:
    2.x hadoop access 50070 yarn8088 19888 9000
    3.x hadoop access 9870 yarn8088 19888 8020
  • Required configuration files
    2.x core-site.xml hdfs-site.xml yarn-site.xml mapreduce.xml 3 env.sh salves (no blank lines or spaces)
    3.x core-site.xml hdfs -site.xml yarn-site.xml mapreduce.xml 3 env.sh workers

HDFS

  • Reading and writing process (written test questions) as long as there is a schematic diagram
  • Small file
    -hazard: nn's memory is not enough, 128G memory can store 128g * 1024m * 1024kb * 1024byte / 150 bytes (a file block occupies 150 bytes) = 900 million; calculation: one file corresponds to one slice -> maptask
    -solution Method: har archive [key], customize Inputformat -> reduce nn memory
    -CombineTextInputformat -> reduce the number of slices -> reduce maptask
    -jvm reuse: start
  • Block size, number of copies

MapReduce [shuffle + optimization]

After the map method,
128G memory before the reduce method , to nodemanager100g memory
128m data, 1g memory should be allocated

kafka

Kafka introduction

Kafka adopts a production and consumption model based on topic topics, that is, producers produce data to specified topics; consumers consume data from topics. 1 In order to facilitate expansion and improve throughput, a topic can be divided into multiple partitions; 2 In conjunction with the design of partitions, the concept of consumer group is proposed, and each consumer in the group consumes in parallel; 3 To improve availability, for each partition Add several copies, similar to NameNode HA

basis

  • Composition:
    How many units are deployed in kafka: 2 * (peak production speed * copies/100) + 1 = 3 units
  • Copies:
    -> 2/3, mostly 2; Function: more can improve reliability; Side effect: increase disk IO, reduce network performance
  • Pressure measurement:
    2 * (peak production speed * copy/100) + 1 = 3 sets, peak production speed is less than or equal to 50m/s
  • Data volume:
    1 million daily activities: 100 per person per day, 1 million * 100 = 100 million; how big a log = 0.5~2k, generally 1K; how many logs per second in kafka: 100 million/ (24*3600s ) = 1150 bars/s, roughly 1m/s a day
  • When will it reach the peak?
    From 8 to 12 o'clock, on weekends, the peak is 20 times the usual, 20m/s, not more than 50m/s
  • Storage time The
    default storage is 7 days; the production environment is stored for 3 days (oppo) 6 hours
  • Disk size
    100g * copy 2 * 3 days/0.7 =
  • Monitoring
    Kafka eagle is more friendly than Kafka monitor. Written by yourself? How to do? Praise him
  • How many topics
    satisfy all consumers at the next level. One topic per table-Mini Program-
    "Mini Program topic
    -Official Account-" Official Account topic-"kafka-"SparkStreaming/Flink analysis-" kafka
    -pc website-"pctopic-" hive dwd analysis
  • isr:
    who is the leader when the leader hangs up; everyone in the isr queue has a chance; old version: delay time, number of delays; new version: delay time
  • Number of partitions (function: increase concurrency)
    First set a partition: pressure test Tp, Tc
    total target throughput and Tt, then the number of partitions = Tt / min (Tp, Tc)
    For example: producer throughput = 20m/s; consumer Throughput = 50m/s, expected throughput 100m/s; number of partitions = 100/20 = 5 partitions
  • Partition allocation strategy
    Range default and RoundRobin
    Range: 10 partitions, 3 consumer threads (data hotspots may exist)
    1 2 3 4
    5 6 7
    8 9 10
    RoundRobin All partitions are randomly scattered by hash, polling mode

Bye

The flume channel can be resistant for a period of time; the
log server also saves logs for 30 days;

Lost data

  • ack
    0 sent, no response; fastest transmission speed; worst reliability; almost not used in the production environment;
    1 sent, leader responds; transmission speed, faster; reliability is OK;
    -1 sent, leader + follower responds ; The transmission speed is slow; the reliability is the highest;
  • How to choose in the production environment?
    0 Almost no use in the production environment
    1 If it is an ordinary log, the reliability requirements are not particularly high, generally choose 1, and most companies choose 1
    -1 Generally used in finance or money-related fields

Data is duplicated

  • Idempotence + transaction + ack = -1
    Kafka transaction guarantees
    Kafka downstream processing
  • Hive's dwd layer, sparkstreaming, redis
    open the window to take the first one, group
  • Idempotence: single partition, in one session, high efficiency
  • Transaction: In the entire Kafka, ID is maintained globally, which is inefficient

Data backlog

  • Self: increase partitions, such as increasing from 2 partitions to 10 partitions; increase the downstream processing speed of sparkstreaming cpu/flink parallelism; consumers also have to increase the concurrency
  • Find a brother: increase the number of consumer consumption, increase the consumption speed; batchsize 1000/s -> 2000/s, 3000/s

Kafka optimization

  • The number of partitons is an integer multiple of the number of consumers
    Insert picture description here

Why is Kafka fast? Reasons for reading data efficiently?

  • Kafka itself is a cluster and partition
  • Sequential reading and writing is 600m/s. If it is random reading and writing of 100m/s
    Kafka producer production data, it should be written to the log file. The writing process is to append to the end of the file, which is sequential writing. There is data on the official website that the same disk can be written up to 600M/s in sequential order, but only 100m/s in random write. This is related to the mechanical mechanism of the disk. The reason why sequential writing is fast is that it saves a lot of head addressing time.
  • Zero copy

Kafka can consume data according to time

KafkaUtil.fetchOffsetsWithTimestamp(topic, sTime, kafkaProp);

Kafka consumers consider whether to pull data or push data

Pull data, each consumer pulls the data he wants

Is the data in Kafka ordered?

Order within a single partition; multiple partitions, disorder between partitions and partitions; you can also set up a partition

How does Kafka's Follower and Leader synchronize messages?

Kafka's replication mechanism is neither completely synchronous replication nor pure asynchronous replication. Kafka uses the Ack mechanism. When Ack=-1, complete synchronous replication requires that All Alive Followers have been replicated before this message is considered commit. This replication method greatly affects the throughput rate. In the asynchronous replication mode, Follower replicates data from Leader asynchronously. As long as the data is written to the log by Leader, it is considered to have been committed. In this case, if the follower copies are all lagging behind Leader, and if the leader suddenly goes down, Lost data. Kafka's way of using ISR is a good balance to ensure that data is not lost and throughput. Follower can copy data from Leader in batches, and Leader makes full use of disk sequential read and send file (zero copy) mechanism, which greatly improves copy performance and writes to disk in batches internally, which greatly reduces the message volume difference between Follower and Leader.

What should I do if a follower is behind in Kafka's ISR?

When the leader receives the data, all the followers start to synchronize the data, but there is a follower who cannot synchronize with the leader because of some kind of failure. The leader has to wait until it completes the synchronization before sending an ack. How to solve this problem?
The leader maintains a dynamic in-sync replica set (ISR), which means a follower set that is synchronized with the leader. When the follower in the ISR completes data synchronization, the leader will send an ack to the follower. If the follower does not synchronize data with the leader for a long time, the follower will be kicked out of the ISR. The time threshold is set by the replica.lag.time.max.ms parameter (default 10 seconds). After the leader fails, a new leader will be elected from the ISR.

Ack settings

0: The producer does not wait for the ack of the broker. This operation provides the lowest delay. The broker returns as soon as it receives it and has not written to the disk. When the broker fails, data may be lost;
1: the producer waits for the ack and partition of the broker After successful placement of the leader, the ack is returned. If the leader fails before the success of the follower synchronization, data will be lost;
-1 (all): the producer waits for the ack of the broker, and returns the ack after the leader and followers of the partition are all successfully placed. However, if the leader fails after the follower synchronization is completed and before the broker sends an ack, it will cause data duplication . [After the producer fails to send, it will try again]

Guess you like

Origin blog.csdn.net/weixin_32265569/article/details/108460384