Big Data Development Engineer interview "a" Shopee technical shrimp

First, the project issues

1 did what items
2 what technology to use
3 which is your dominant project, a total number of interfaces development, project how long, how many database tables

Second, the technical problem

1 in their own language to achieve good non-recursive single chain reversal in handwriting
main difference between the 2 Hadoop and the spark of
3 Hadoop in a big sort files, how to ensure the overall orderly? to ensure that only a single node data sort ordered
4 Hive in which UDF
. 5 the Hadoop files put get the detailed description of the process
6 Java GC method in which
a weak strength in the cited references 7 Java and soft references which are used in the scene

Third, resolve technical issues

1 using java non-recursive single chain reversal

Ideas:

Because at the time of reversing the list needs to be updated every node of the "next" value, however, before the next update of value, we need to be saved next, otherwise we can not continue. So, we need two pointers point to a previous node and the node, the current node finish each "next" after the value is updated, the two nodes move down until the last node.

Codes are as follows: 

class Node {
    char value;
    Node next;
}

public Node reverse(Node current) {
    //initialization
    Node previousNode = null;
    Node nextNode = null;
    
    while (current != null) {
        //save the next node
        nextNode = current.next;
        //update the value of "next"
        current.next = previousNode;
        //shift the pointers
        previousNode = current;
        current = nextNode;            
    }
    return previousNode;

}

 

The main difference 2 Hadoop and spark - the basic question will be asked to

Remember that the most important differences between 3:00:

  • spark eliminating redundant HDFS read: Hadoop shuffle after each operation must be written to disk, and Spark may not place orders after the shuffle, you can cache in memory for use when iteration. If the operation of complex, many shufle operation, the read and write IO Hadoop time will be greatly increased, but also the main reason for slower the Hive.
  • spark eliminating redundant MapReduce phases: Hadoop is attached to shuffle operation must complete MapReduce operation, redundant red tape. Spark based RDD and provides a rich operator operation, and reduce shuffle operation generates data to be cached in memory.
  • JVM optimizations: Hadoop MapReduce operation each time, it will start a Task JVM startup time, process-based operation. The Spark every MapReduce operation is based on the thread, just start Executor is to start a JVM, memory operation is the Task thread reuse. Each time you start the JVM time may take several seconds or even ten seconds, then when the Task more, this time do not know Hadoop Spark slower than more.

3 Hadoop in a big sort files, how to ensure the orderly overall

 

4 Hive in which UDF

Hive There are three kinds of UDF:

  • UDF (User-Defined-Function) user-defined functions, a data input and generating a data; 
  • UDAF (User-Defined Aggregation Function) user-defined aggregate function, a plurality of input data and generating output parameter; 
  • UDTF (User-Defined Table-generating Function) table generating user-defined function, the input line of N data generating line data.

UDF which you wrote? The UDF will be used under what circumstances? - you can extend this problem


5 Hadoop to read and write data flow analysis, that is what happened in the file put get specific process

 i) hadoop fs -put operations as an example:

  • When receiving the PUT request, attempt to create in the NameNode INode a new node, this node is the destination node of the path past src constructed according to create, if the node is found to exist or is not present in the parent node INodeDirectory then abort, otherwise HdfsFileStatus object contains INode information is returned.

  • HdfsFileStatus constructed using a class OutputStream implements DFSOutputStream interface, data is written by DFSOutputStream nio interface to be transmitted.

  • Data are written in DFSOutputStream certain size (typically 64 k) into a package DFSPacket, press-fitted in the transmission queue DataStreamer.

  • DataStreamer Client is responsible for data transmission in a separate thread, and when found in the queue have DFSPacket, first obtain DataNode information available for transmission from NameNode by namenode.addBlock, and data transfer with specified DataNode.

  • In a special DataXceiverServer DataNode is responsible for receiving data, when data arrives, the corresponding writeBlock proceeds writing operation, and if there is found downstream of the DataNode also need to receive data transmitted to it via conduit data will be sent again to the downstream DataNodes, data backup, data transmitted via Client avoid once.

The key step in the whole procedure has NameNode :: addBlock and DataNode :: writeBlock, the next two steps will be analyzed in detail.

 ii) hadoop fs -get operations:

Process GET operation with respect to the PUT would be simpler, INode acquire a corresponding first position corresponding to the Block NameNode by source path parameter, and then construct an object based on LocatedBlocks DFSInputStream returned. In DFSInputStream read method, the address has to find DataNode Block according LocatedBlocks, from the byte stream acquired by DataNode readBlock.


6 Java GC algorithms which have

Modern virtual machine garbage collection algorithms:

  • Mark - Clear
  • Replication algorithm (for the new generation)
  • Mark - Compression (for the old s)

 Generational collection (using the new generation of replication algorithm, using old's mark - compression algorithm)

Mark - sweep algorithm

"Mark - sweep" (Mark-Sweep) algorithm, as its name suggests, the algorithm is divided into "mark" and "clear" two stages: first mark all objects need to be recovered, marking the completion of a unified recovery after being out all marked objects. The reason why it is the most basic collection algorithm, because subsequent collection algorithms are based on this idea and make improvements to its shortcomings obtained.

Its main disadvantages are two: one is efficiency, marking and clearance process efficiency is not high; the other is a space problem, it will produce a large number of discrete memory fragmentation mark after clearing space debris could cause too much when unable to find enough contiguous memory when the program needs to allocate a large object during subsequent runs and had to trigger another garbage collection operation in advance.

Replication algorithm

"Copy" (Copying) collection algorithm that by the available memory capacity is divided into two equal size, uses only one of them. When this piece of memory runs out, the copy will also survive object to another one above, then memory space has been used once and then clean out.

Such that each time a block of memory which is recovered, will not consider the complexities of memory fragmentation isochronous memory allocation, as long as the top of the stack pointer movement, in order to allocate memory, simple, efficient operation. But the cost of this algorithm is that the memory is reduced to half the original, continuous replication of long-lived objects cause a decrease in efficiency.

Mark - compression algorithm

Copying collection algorithm will perform more copy operations at higher target survival, efficiency will be low. More to the point, if you do not want to waste 50% of the space, you need to have additional space allocation guarantees in response to the memory used by all objects are 100% survival in extreme cases, it's the old general can not directly choose this algorithm.

According to the characteristics of old age, it was suggested that another kind of "mark - finishing" (Mark-Compact) algorithm, the process is still marked with "mark - sweep" algorithm the same, but the subsequent steps are not directly recycled objects to clean up, but let all surviving objects are moved to the end, then clean out the direct memory other than the end border

Generational collection algorithm

The basic assumption GC generational: the life cycle of most of the objects are very short, short survival time.

"Generational collection" (Generational Collection) algorithm, the Java heap into the new generation and the old time, so that you can collect the most appropriate algorithm according to the characteristics of each era. In the new generation, each time garbage collection when there are a large number of objects found dead, only a few survive, then copy the selection algorithm, only need to pay the cost of reproduction of a small amount of live objects to complete the collection. The old era because of the high survival rate of the object, there is no extra space is allocated to its guarantee, you must use the "mark - clean-up" or "mark - finishing" algorithm to recover.


Weak references in 7 Java, strong references, references and soft references to what is false, what they were used in the scene

  • 强引用(”Strong”Reference),我们平常典型编码 Object obj=newObject() 中的obj就是强引用。通过关键字new创建的对象所关联的引用就是强引用强引用是使用最普遍的引用。如果一个对象具有强引用,那垃圾回收器绝不会回收它。当JVM 内存空间不足,JVM 宁愿抛出OutOfMemoryError运行时错误(OOM),使程序异常终止,也不会靠随意回收具有强引用的“存活”对象来解决内存不足的问题。只要还有强引用指向一个对象,就能表明对象还“活着”,垃圾收集器不会碰这种对象。对于一个普通的对象,如果没有其他的引用关系,只要超出对象的生命周期范围或者显式地将相应(强)引用赋值为null,就是可以被垃圾收集的了,当然具体回收时机还是要看垃圾收集策略。
  • 软引用(SoftReference),是一种相对强引用弱化一些的引用,可以让对象豁免一些垃圾收集,只有当JVM 认为内存不足时,才会去试图回收软引用指向的对象。JVM 会确保在抛出OutOfMemoryError之前,清理软引用指向的对象。软引用通常用来实现内存敏感的缓存,如果还有空闲内存,就可以暂时保留缓存,当内存不足时清理掉,这样就保证了使用缓存的同时,不会耗尽内存。软引用可以和一个引用队(ReferenceQueue)联合使用,如果软引用所引用的对象被垃圾回收器回收,Java虚拟机就会把这个软引用加入到与之关联的引用队列中。后续,我们可以调用ReferenceQueue的poll()方法来检查是否有它所关心的对象被回收。如果队列为空,将返回一个null,否则该方法返回队列中前面的一个Reference对象。【应用场景】:软引用通常用来实现内存敏感的缓存。如果还有空闲内存,就可以暂时保留缓存,当内存不足时清理掉,这样就保证了使用缓存的同时,不会耗尽内存。
  • 弱引用通过WeakReference类实现。弱引用的生命周期比软引用短。在垃圾回收器线程扫描它所管辖的内存区域的过程中,一旦发现了具有弱引用的对象,不管当前内存空间足够与否,都会回收它的内存。由于垃圾回收器是一个优先级很低的线程,因此不一定会很快回收弱引用的对象。弱引用可以和一个引用队列(ReferenceQueue)联合使用,如果弱引用所引用的对象被垃圾回收,Java虚拟机就会把这个弱引用加入到与之关联的引用队列中。【应用场景】:弱应用同样可用于内存敏感的缓存。
  • 虚引用,你不能通过它访问对象。幻象引用仅仅是提供了一种确保对象被finalize以后,做某些事情的机制。虚引用只是用来得知对象是否被GC。如果一个对象仅持有虚引用,那么它就和没有任何引用一样,在任何时候都可能被垃圾回收器回收。虚引用必须和引用队列(ReferenceQueue)联合使用。当垃圾回收器准备回收一个对象时,如果发现它还有虚引用,就会在回收对象的内存之前,把这个虚引用加入到与之关联的引用队列中。【应用场景】:可用来跟踪对象被垃圾回收器回收的活动,当一个虚引用关联的对象被垃圾收集器回收之前会收到一条系统通知。

通过表格来说明一下,如下:

 

引用类型

被垃圾回收时间

   用途

   生存时间

强引用

从来不会

对象的一般状态

JVM停止运行时终止

软引用

在内存不足时

对象缓存

内存不足时终止

弱引用

在垃圾回收时

对象缓存

gc运行后终止

虚引用

任何时候

跟踪对象被垃圾回收器回收的活动

Unknown

 

 

=================================================================================

原创文章,转载请务必将下面这段话置于文章开头处(保留超链接)。
本文转发自程序媛说事儿,原文链接https://www.cnblogs.com/abc8023/p/10910741.html

=================================================================================

Guess you like

Origin www.cnblogs.com/abc8023/p/11041921.html