College students to help work through large data plane (have to take the offer, attach a detailed answer)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/a934079371/article/details/102665832

College students to help work through large data plane (have to take the offer, attach a detailed answer)

The students are learning group of students previously wrote one side of the face by Ali: # college students Ali large data side by side "is over", "attach a detailed answer"

now received another job to help Big Data offer, he is a specialist vocational school students in Gansu Province, and is a rare professional police dog , but he worked very hard and seriously, we can learn!


Refresh me the following answer:


one side ** 2. database index structure what? (1) generating an index, the establishment of a binary search tree generates an index (2) to generate the index B-Tree establishment (3) to generate the index to establish B -Tree (4) establishing the Hash, based Mysql InnoDB and MyISAM hash does not display support (5 ) bitmap data structure, the bitMap, using a small amount of the main database, such as Oracle, mysql not supported;








3. Then asked how to locate and optimize the slow query sql?
(1) According to locate slow slow query log SQL 
(2) explain the use of tools such as analysis of SQL 
(3) modify the SQL or SQL as far as possible take the index to optimize the query efficiency

   
4. The index is better established it?
Small amount (1) The table does not require indexing data, will establish additional index overhead; (2) the need to maintain the index data changes, so the more indexes mean more maintenance costs; (3) more index also implies the need for more space


5. Do you know talk about garbage collection algorithms and garbage collector

Garbage collection algorithm (garbage collection methodology):
    1. Copy algorithm: this algorithm space is divided into two parts, each part is used. When garbage collection, the memory is being used in the live objects copied to the unused memory, and then clear the memory in use. This algorithm does not produce debris, but will cause low utilization of space. 2. Record clearance method: This is the garbage collection algorithm is divided into two stages, marking phase and cleanup phase. Mark phase is to recover all the required objects are marked, then mark the end of an object marked for recycling. This algorithm will produce a large number of fragmented and inefficient.

    3. Mark sorting method: This method is the need to recover all the objects are marked, all the surviving objects to the other end, all other cleanup does not require reservations. This algorithm solves the problem of labeling to clear the debris generated by the law.

    4. generational Method: according to the length of the life cycle of the object block, according to different characteristics of each interval, using different garbage collection algorithms, thereby improving the efficiency of garbage collection. For example: Java Virtual Machine heap adopted this approach into the new generation and the old year, and then use different garbage collection algorithm for different generations. Using a new generation of replication algorithm, using old's mark clearance method.

    5. Subdivision Method: The space into a plurality of consecutive ranging between cells, each section alone, independent recovery. Advantage is that one can control the recovery of a plurality of sections.

    6. Reference counting: this algorithm is an object a reference counter provided for each additional reference to a variable to it, the counter is incremented, on the contrary, is decremented. Only when the counter is 0, the object will be recovered. The algorithm is simple, has its drawbacks, the object of frequent operation, increased system consumption; can not handle the case of circular references.

Garbage collector (garbage collection embodied):
1. New Generation: Serial, ParNew, Parallel Scavenge

2. Old Year: Old Serail, ParNew Old, the CMS

================ ========================

Details 1. New Generation retriever:

a.serial: it is a single-threaded garbage collector, threaded the meaning is: will only use a CPU or a garbage collection thread to complete garbage collection. And when it's garbage collection, other threads must be suspended until garbage collection is completed.

b.ParNew: it is multi-threaded version of serail, basic operations and Serial same. The collector general and work with CMS.

c.parallel Scavenge: The purpose of this collector is controlled to achieve a certain.

= Time to run a certain user code / (time garbage run user code recovery time)

d.GC adaptive strategy: JVM According collect performance monitoring operation of the current system, these parameters are dynamically adjusted to provide the optimum dwell time or throughput.

======================================= years old garbage collector: 1.Serial Old: the old version of the Serail, and it is single-threaded collector, the collector mainly to the virtual machine in the Client mode used. Note: generational collection algorithm: the new generation uses algorithms to copy, and suspend all user threads. Old's use of tags to organize law and to suspend all user threads. 2.Parallel Old: the old version of Parallel, the use of multi-threading and tags to organize garbage collection algorithms. 3.CMS: a pause for the shortest recovery time objectives collector. The collection is based on the "Clear mark" algorithm implementation.

======================================= old and the new generation's garbage collector: G1G1 collection is what are the features included: 1 concurrency and parallelism: use multiple threads to shorten the pause time, G1 collector through a parallel way to make Java continue. 2. generational collection: G1 dividing the entire collector independent Java heap size of the plurality of equal regions (Region). G1 collector may be the reason why there are plans to avoid the garbage heap receipts throughout the entire region of Java, is the experience of space because the G1 collector tracks the value of the size of each Region inside the garbage accumulation (were recovered and the time required for recovery value), in the background to maintain a priority list, according to each collection time allowed, the maximum recovery value of priority Region. That Grabage-First. In G1 collector, the object references between the Region and other objects between the collector and the new generation's old quote, virtual machines are to avoid a full scan of the heap through Remembered Set. Each Region has a G1 in the corresponding Remembered Set. When creating a new target, JVM will quote the relevant information recorded in the Region of Remembered Set the referenced object belongs. When recovered, adding Remembered Set in the enumeration range GC root of the heap can be scanned to assure there will not be missed.

3. Spatial Integration: sorting algorithm using the pause marker 4. predictable: recommended dwell time of the prediction model, so that the user explicitly in a time length of y milliseconds segment M, consumed in the time garbage collector can not more than N milliseconds, to the real-time garbage collector.

G1 garbage collector collection phase is divided into the following steps: 1, the initial mark (just mark what GC Roots can be directly linked to the objects and create new objects have to be modified Region, this phase needs to pause a thread, but very time-consuming short) 2, concurrent mark (beginning of the heap objects reachability analysis from GC Roots, find live objects) 3, the final mark (corrected during the month due to concurrent mark Honghu program continues to run and lead to marked changes in that part of the produce mark record) 4, filter recycling (first on the recovery value and cost of each Regin sort recycling program based on user specified expect GC pause time, recovering a portion of Region)

6. heap memory is how allocation?  Java heap is divided into two areas - the young generation and the old era, the young generation including Eden District area and Survivor, Survivor region is divided From and To fields. As shown below:


Eden District object is assigned a priority area in the new generation of Eden, Eden zone when there is insufficient space, the virtual machine using the copy algorithm to launch a Minor GC (Young GC), clear out garbage objects. After, Eden zone most of the objects will be recovered, and those who live objects do not need recovery, will advance to the Survivor From District (out of memory From area, directly into the Old district).

Survivor Survivor District area is the equivalent of a buffer zone and the Old Eden area. If there is no Survivor region, Old District will soon be filled, it will trigger Major GC (because Major GC generally accompanied Minor GC, can also be seen as triggering a Full GC). Survivor raison d'etre is to reduce the object was sent to old age, thereby reducing the incidence of Major GC. Survivor is divided into two areas, one area From, To is a region. Each execution Minor GC, Eden area and will survive From To object into the area of Survivor (out of memory To area, directly into the Old district).

Why should Survivor From and To area is divided into two zones? In order to solve the memory fragmentation problems. After Minor GC execution, Eden District will clear the surviving objects into the Survivor areas, while the area of objects before Survivor, may also have some that need to be cleared. This time JVM to use the mark sweep algorithm to clear garbage objects, and labeled Clear biggest problem algorithm is memory fragmentation, due to the Eden area many objects are "Chaosheng evening death", it is bound to make memory a serious fragmentation . Survivor has two regions, each time Minor GC, before copying live objects will Eden region to region and From To area. When the second Minor GC, then Eden and To fields are live objects copied to From region, thereby repeatedly. Thus, there is always a Survivor region is idle. This would solve the problem of memory fragmentation.

Old Old region data region with 2/3 heap memory space, when the object from the new generation to survive, will be copied here. Major GC will clean up the object Old zone, each Major GC will trigger the "Stop-The-World". The larger the memory, execution time is longer. Due to the high survival rate of the old era objects using replication algorithm efficiency is very low, so the garbage collector uses tags to organize algorithm is years old.

Memory allocation strategy memory allocation strategy the following main points:

Objects priority in the allocation area in Eden, Eden zone if there is not enough space allocation, virtual machines perform once MinorGC. Large objects directly into the older years (continuous objects require a lot of memory space). The purpose of this is to avoid a large number of memory copy (copy algorithm employed to collect new generation memory) occurs between the region and the two Eden Survivor areas. Long-term survival of the object into the old era. Virtual machine definition for each object of an age (Age Count) counter, if the object after 1 Minor GC then the object will enter Survivor areas, after each time through the Minor GC then the age of the subject plus 1 until it reaches the threshold (by default 15), the object area into old age. Dynamic determine the age of an object. If the sum of the sizes of all objects of the same age Survivor region Survivor greater than one half the space, equal to or older than the age of the subject can directly access the old year. Space allocation guarantee. Every time Minor GC, JVM will calculate the average size of the object to move it Survivor elderly area, if this value is greater than the size of the remaining area is aged for a Full GC, if the check is less than the set HandlePromotionFailure, only be true if the Monitor GC, if false then carried Full GC.

7. Have you ever seen flink source? (Because there are resume writing section read source code) directly asked during the execution of the job I Flink

reference: https: //www.cnblogs.com/ooffff/p/9486451.html

8. Then he asked in a java strong references, soft references, weak references, phantom references what's the use?   Strong Quote: The most common references: Object obj = new Object () ; would rather not willing to throw OOM strong recovery objects referenced by the object reference to null to weaken, it is recovered, or wait for the life cycle of a timeout. Soft references (Soft Reference): When an object is in the state of useful but not necessary when there is insufficient memory space, will reclaim it, enough memory, will allow it to. Role: it can be used to implement cache weak reference (Weak Reference): than soft references will be recovered but due to the low GC thread priorities will not necessarily be immediately recovered weaker GC suitable for occasional references to be used without impact of garbage collection target virtual reference (PhantomReference): it does not determine the life cycle of the object at any time are likely to be garbage collected, the garbage collector object is tracked recovery activities, acts as a sentinel can not be used alone, and must be used in conjunction with a reference queue ReferenceQueue;

What is the difference 9.Http and https is, https data transfer process?   Differences: 1. Https CA needs to apply for a certificate, need not Http 2. Https ciphertext transmission, different transport plaintext 3. Http connection, using the default Https port 443, port 80 Http 4. Htttps = http SSL (encrypted certification integrity protection), compared Http security;
 Https data transmission: the encryption algorithm (1) browser support information to the server (2) encryption algorithm to select a server browser supports, back in the form of a certificate. to the browser (3). legality browser verifies the certificate, combined with public key encryption certificate information to the server (4) in response to the message server using the private key to decrypt the information, verification hash, encrypt sent back to the browser ( 5) the browser decrypts the response information, and verifies the message, followed by interactive data encryption

10.GET request the difference and the POST request?
   Http message level: GET request will request information on the URL, POST reported on the database level Style: GET accord idempotency and security, POST does not comply with other levels: GET can be cached, stored, and not POST. (GET request can be cached in the local browser, bookmarks, etc., and not POST, non-power, etc.)

11. The algorithm only asked a question about the binary tree algorithm printed in layers.
12. What end asked me to ask him, (I just ask how performance in the interview, he also said that good.)


1. self-introduction is part of .......
2. asked a three-way handshake of TCP / IP and the fourth wave?
  Handshake Process: 1. first handshake to establish a connection, the client sends a SYN packet (syn = j) to the server, enters the SYN and the SEND state, the server waits for acknowledgment. 2. The second handshake: server receives a SYN packet, must confirm the customer SYN (ack = j 1), while themselves sends a SYN packet (syn = k), i.e., SYN ACK packet, then the server SYN enters the RECV state ; 3. third handshake: the client receives the SYN ACK packet to the server, the server sends an acknowledgment packet ACK (ack = k 1), this packet is sent, the client and server enters ESTAB -LISTEN state, complete three-way handshake.
  Issues, problems: first handshake timeout for hidden ---- SYN SYN Flood protection measures: 1. SYN after the queue is full, by tcp syncookies parameter postback SYN Cookie 2. If it is properly connected to the Client will be sent back to SYN Cookie, establish a direct connection waving process: 1. first wave: Client sends a FIN, for closing the Client Server data transfer, into the Client FIN the WAIT . 1 status; 2. the second wave: Server receives the FIN sends a ACK to Client, receipt acknowledgment number is the number 1 (the same SYN, FIN occupies a program number), Server enters the CLOSE the WAIT state; 3. third wave: after sending a FIN, for closing the Client to the Server Server data transfer, Server enter LASTACK state; 4. Fourth wave: the Client receives the FIN, Client enters the TIME the WAIT state, and then sends an ACK to the Server, for the receipt of the acknowledgment number No. 1, Server enters the CLOSED state, four complete wave.
  Problems, risks: (1) Why the TIME WAIT state reasons: 1. Make sure there is enough time and they'll get an ACK packet 2. Avoid connecting the old and new confusion
    (2) a large number of servers appears CLOSE
reasons WAIT state other closed socket connection, we are busy to read or write, did not close the connection 1. check the code, especially code that free up resources 2. check configuration, in particular, to process the request thread configuration
    (3) Why do I need to disconnect the four-way handshake: because the full-duplex work, the sender and receiver need to FIN packets and ACK packets

5. After you enter the URL in the browser, what will happen? (Just think of some vague said data transmission https general answer to the interviewer said to skip)

The method of execution logic put 6.HashMap of?

public V put(K key,V value){
// 调用hash(key)计算出key的hash值
return putVal(hash(key),key,value,false,true);
        }
static final int hash(Object key){
int h;
// 如果key为null,则hash值为0,否则调用key的hashCode()方法
// 并让高16位与整个hash异或,这样做是为了使计算出的hash更分散
return(key==null)?0:(h=key.hashCode())^(h>>>16);
        }
final V putVal(int hash,K key,V value,boolean onlyIfAbsent,
        boolean evict){
        Node<K, V>[]tab;
        Node<K, V> p;
int n,i;
// 如果桶的数量为0,则初始化
if((tab=table)==null||(n=tab.length)==0)
// 调用resize()初始化
        n=(tab=resize()).length;
// (n - 1) & hash 计算元素在哪个桶中
// 如果这个桶中还没有元素,则把这个元素放在桶中的第一个位置
if((p=tab[i=(n-1)&hash])==null)
// 新建一个节点放在桶中
        tab[i]=newNode(hash,key,value,null);
else{
// 如果桶中已经有元素存在了
        Node<K, V> e;
        K k;
// 如果桶中第一个元素的key与待插入元素的key相同,保存到e中用于后续修改value值
if(p.hash==hash&&
        ((k=p.key)==key||(key!=null&&key.equals(k))))
        e=p;
else if(p instanceof TreeNode)
// 如果第一个元素是树节点,则调用树节点的putTreeVal插入元素
        e=((TreeNode<K, V>)p).putTreeVal(this,tab,hash,key,value);
else{
// 遍历这个桶对应的链表,binCount用于存储链表中元素的个数
for(int binCount=0;;  binCount){
// 如果链表遍历完了都没有找到相同key的元素,说明该key对应的元素不存在,则在链表最后插入一个新节点
if((e=p.next)==null){
        p.next=newNode(hash,key,value,null);
// 如果插入新节点后链表长度大于8,则判断是否需要树化,因为第一个元素没有加到binCount中,所以这里-1
if(binCount>=TREEIFY_THRESHOLD-1) // -1 for 1st
        treeifyBin(tab,hash);
break;
        }
// 如果待插入的key在链表中找到了,则退出循环
if(e.hash==hash&&
        ((k=e.key)==key||(key!=null&&key.equals(k))))
break;
        p=e;
        }
        }
// 如果找到了对应key的元素
if(e!=null){ // existing mapping for key
// 记录下旧值
        V oldValue=e.value;
// 判断是否需要替换旧值
if(!onlyIfAbsent||oldValue==null)
// 替换旧值为新值
        e.value=value;
// 在节点被访问后做点什么事,在LinkedHashMap中用到
        afterNodeAccess(e);
// 返回旧值
return oldValue;
        }
        }
// 到这里了说明没有找到元素
// 修改次数加1
          modCount;
// 元素数量加1,判断是否需要扩容
if(  size>threshold)
// 扩容
        resize();
// 在节点插入后做点什么事,在LinkedHashMap中用到
        afterNodeInsertion(evict);
// 没找到元素返回null
return null;
        }


(1) hash value calculation key; and (2) the number (the array) if the bucket is zero, initializing the tub; (3) if the bucket key where no element is directly inserted; tub if the key is located (4) the first key element of the same key to be inserted, the elements found described, the subsequent transfer process (9) process; (5) If the first element is a tree node, the tree node putTreeVal () is called to find the element or insert tree node; (6) if the above three conditions is not, then traversing the linked list corresponding to find out whether the tub is present in the key list; and (7) if the key corresponding to the element is found, the subsequent transfer process (9) process; (8) If not find a corresponding key element of the linked list last insert a new node and judges whether tree; (9) if the corresponding key element is found, it is determined whether to replace the old value and directly return the old value; (10 ) If the element is inserted, the number incremented by one and determines whether additional capacity is needed;

7. The four properties database transaction?      
A: The four characteristics of database transactions ACID: Atomic Atomic Consistency Consistency: 50 50 100 = Always isolation Isolation: between concurrent transactions do not affect the update is missing: mysql all transaction isolation levels are to avoid dirty reads on the database level: rEAD-COMMITTED transaction isolation level above you can avoid dirty read problem can not be repeated degree: REPEATABLE-rEAD or more transaction isolation level to avoid phantom reads: SERIALIZABLE transaction isolation level to avoid under RR level snapshot reads can get historical versions of data. [Snapshot] read and read the current persistent;

8. introduce your project ......
concurrency index tolerable daily 9.
Answer: 3000-4000 tolerable million data every day, probably about 500-600MB size. Hourly data is likely to be 1.45 million to produce, probably little more than data is two per minute, write-once every two minutes, the amount of data written to a batch ES is probably around 50,000. In order to reduce the pressure and ES written, the write mode is used in a batch ES written per second is because the best stability ES 10000.


10.Source end data overload, ES does not have a very good * (data stream) dynamic backpressure mechanism, too late to back pressure, will be in the log will report timeout error, how to solve?

The solution: do some measures to solve the static current limit Source source. If kafka, then it has a good back-pressure mechanism inside.

11.Streaming WordCount of execution?
A: The implementation process stream processing and batch differ to print () print, for example, streaming method must be added to execute the code at the end, otherwise invalid, while the batch is in print () inside the package execute () does not need to show execution. 
Workflow processes: it will create a PrintFunction instance, then the instance of the object passed into the method addSink, which printfunction inherited richprintfunction, and finally perform the execute method flow of the program.
In the execute method, it returns a return executeInternal, executeInternal is an abstract method, which contains localStreamEnvironment implemented in the following steps:

  1. The program stream converted to StreamGraph;

Configuration of Job then is determined BATCH or streaming type, a difference is used to build a program stream of buildStreamProperties; (number of transformations is first determined whether a given will be greater than 0, if not greater than zero, it will throw an exception another batch using buildBatchProperties configured to build a program, and then are generating using SteamGraphGenator SteamGraph)
   2. converted into JobGraph StreamGraph; (StreamGraph first added to the configuration in CreatGraph the jobgraph, followed registered will also cache configuration file is added to the job, finally set ExecutionConfig)
   3. create a MiniCluster and starts; (building MiniCluster configuration, create MiniCluster objects, then start method executes the class, open Metric Registry, open all RPC services, open resources Manager, create highly available, blob, heart services, and finally create standalone scheduler and start the standalone dispatcher) 4. run Job;. (submitted job :( we must allow scheduled to Flip-6 model, because we need from ResourceManager Seeking Slot, the Job submitted to the scheduler, it will get based on the implementation JobID to the execution status Job in scheduler, running job: At run time, it will create JobManager Runner, and start it at the time of creation); asynchronous execution; Job attempts to acquire the execution result) 5. Close the resource; (off resources)

12.flink feekback Why no similar mechanism?
TCP has a natural flow control mechanism, Flink on which to implement counter-pressure. (Sliding window concept ....)


13.credit-based backpressure how?

TCP-based because there are certain drawbacks: 1 single task due to back pressure, the whole will be blocked between Socket TM-TM, that is, even Checkpoint barrier is not issued. 2. The back pressure propagation path is too long, resulting in delayed entry into force is relatively large. Also in mechanism simulation application layer TCP flow control ...

how to store 14.state?

FSStateBackend: good performance, daily storage in heap memory, faced with the risk of OOM, does not support incremental CheckPoint; RocksDBStateBackend: no need to worry OOM risk, and supports incremental CheckPoint (on how to support incremental CheckPoint can view the LSM algorithm).
The state-backend can be divided into three categories: MemoryStateBackend: Checkpoint data is returned directly to the master node; FSstatebackend: checkpoint data directly written to the file, the file path to pass the master node; RocksDBstatebackend: checkpoint is written directly to the data file, file path passed master node; heapKeyedStateBacken data eventually present in memory, or stored in the RocksDB.

15. If you have a lot of scenes limit value value of how to do?

Because of the limited limitation JNI bridge API, and support only a single value 2 ^ 31 bytes; MapState can be used instead consider ListState or ValueState;

16. See you used the elasticsearch, you know how much?   (Only know part of a simple operation, not very familiar)

17. Finally, she asked the next chat-style Why switch to programming problems? (This is not just because I am a professional, truthfully speaking next in accordance with my wishes, is relatively easy to answer)
--- End ---

This article from the blog article multiple platforms OpenWrite release!

Guess you like

Origin blog.csdn.net/a934079371/article/details/102665832