Big Data Direction interview knowledge learn map

As the saying goes, no stroke win a trick.

I would like to read this article and understand the technical people from the technology itself early, early Deng Bian.

All technology is ultimately only a small worm carving skills.

Outline

Theme of this series is big data to develop the interview guide designed to provide for everyone's basic line data of a large study, and improve the development of the technology stack data, as well as our development job interview, when, and what is the focus of a large study data, these companies hope the interviewer What skills.

This article does not perform to a certain point a detailed knowledge of the expansion, will continue to follow a feature article, I hope readers will outline as a study or a review to leak filled.

Language Basics

Java Basics

The whole big data development technology stack us from the real-time point of view, consists mainly off-line computing and real-time calculation of two parts, the whole frame most of the big data ecology are developed using Java or Java compatible API calls , then as a first language Java on the JVM is what we around the past Hom, the Java language is the basis of our reading the source code and tune foundation.

Java foundation mainly contains the following sections:

  • Language Basics

  • lock

  • Multithreading

  • And concurrent containers commonly used contract (the JUC)

Language Basics

  • Java object-oriented

  • Three features of the Java language: encapsulation, inheritance and polymorphism

  • Java language data types

  • Java's automatic type conversion, cast

  • String immutability, the constant pool of virtual machines, String.intern () the underlying principle

  • Java language keywords: the underlying principle final, static, transient, instanceof, volatile, synchronized the

  • The principle commonly used in Java collection class: ArrayList / LinkedList / Vector, SynchronizedList / Vector, HashMap / HashTable / ConcurrentHashMap each other and the underlying realization principle difference

  • Dynamic proxy implementation

lock

  • CAS, optimistic and pessimistic locking lock, the lock mechanism related databases, distributed lock, tend to lock, lock lightweight, heavyweight lock, monitor

  • Lock optimization, to eliminate the lock, the lock roughening spin lock reentrant lock, blocking locks, deadlocks

  • Cause of a deadlock

  • Deadlock solution

  • Use and principles CountDownLatch, CyclicBarrier and Semaphore three classes

Multithreading

  • The difference between concurrent and parallel

  • The difference between threads and processes

  • State to implement threads, thread priority, thread scheduling, a variety of ways to create a thread, the thread guard

  • Design their own thread pool, submit () and execute (), the thread pool principle

  • Why not allow the use of Executors create a thread pool

  • Deadlock, deadlock how to troubleshoot, and the relationship between thread-safe memory model

  • ThreadLocal variable

  • Executor thread pool created in several ways:

    • newFixedThreadPool(int nThreads)

    • newCachedThreadPool()

    • newSingleThreadExecutor ()

    • newScheduledThreadPool(int corePoolSize)

    • newSingleThreadExecutor ()

  • ThreadPoolExecutor create a thread pool, deny policy

  • Thread pool closed way

Concurrent containers (the JUC)

  • List of packets JUC interface implementation class: CopyOnWriteArrayList

  • JUC Set packet in the interface implementation class: CopyOnWriteArraySet, ConcurrentSkipListSet

  • JUC packets in the Map interface classes: ConcurrentHashMap, ConcurrentSkipListMap

  • JUC the packet Queue interface implementation classes: ConcurrentLinkedQueue, ConcurrentLinkedDeque, ArrayBlockingQueue, LinkedBlockingQueue, LinkedBlockingDeque

Java Advanced articles

Advanced articles section are in addition to Java Basics, which is a part of our big data framework source familiar with the necessary skills, but also our interview hardest hit during the interview of senior positions.

JVM

JVM memory structure

class file format, the runtime data areas: the stack, stack, method area, direct memory, runtime constant pool

Heap and stack difference

Java is an object allocated on the heap must do?

Java Memory Model

Computer memory model, cache coherence, the MESI protocol, visibility, atom, sequence, happens-before, memory barriers, synchronized, volatile, final, lock

Garbage Collection

GC method: Clear mark, the reference count, copy, tag compression, generation of recovery, incremental recovery, GC parameters, determined live objects, the garbage collector (CMS, G1, ZGC, Epsilon)

JVM tuning parameters and

-Xmx、-Xmn、-Xms、Xss、-XX:SurvivorRatio、-XX:PermSize、-XX:MaxPermSize、-XX:MaxTenuringThreshold

Java object model

oop-klass, object header

HotSpot

In-time compiler, compiler optimization

VM performance monitoring and troubleshooting tool

jps、jstack、jmap、jstat、jconsole、 jinfo、 jhat、javap、btrace、TProfiler、Arthas

Class loading mechanism

classLoader, class loading process, delegating parent (parent delegate destruction), modular (jboss modules, osgi, jigsaw)

NIO

  • User space and kernel space

  • Linux network I / O model: blocking I / O (Blocking I / O), a non-blocking I / O (Non-Blocking I / O), I / O multiplexing (I / O Multiplexing), the drive signal I / O (Signal Driven I / O), asynchronous I / O

  • Ling copy (zerocopy)

  • BIO and NIO contrast

  • Buffer Buffer

  • Channel Channel

  • reactor

  • Selector

  • AIO

RPC

  • The principle RPC programming model

  • RPC common framework: Thrift, Dubbo, SpringCloud

  • RPC application scenario difference and a message queue

  • RPC core technology point: service exposure, remote proxy object, communication, serialization

Linux Basics

  • Learn Linux common commands

  • Remote Login

  • Upload Download

  • System Directory

  • File and directory operations

  • Permission system under Linux

  • Compression and packaging

  • Users and Groups

  • Shell scripts are written

  • Pipeline Operation

Distributed theory articles

  • Some basic concepts distributed in: Cluster (Cluster), load balancing (Load Balancer), etc.

  • The theoretical basis of distributed systems: consistency, 2PC and 3PC

  • The theoretical basis of distributed systems: CAP

  • The theoretical basis of distributed systems: time, clock, and the sequence of events

  • Advanced Theory of Distributed Systems: Paxos

  • Advanced Theory of Distributed Systems: Raft, Zab

  • Advanced Theory of Distributed Systems: elections, majority and leases

  • Distributed Lock Solutions

  • Solutions for distributed transactions

  • ID generator Distributed Solutions

Large data network communication frame cornerstone --Netty

Netty is the most popular NIO framework, Netty in the Internet field, big data distributed computing, game industry, communications industry, such as access to a wide range of applications, the industry's leading open source components it comes to network communication, Netty is the best option .

About Netty us to grasp:

  • Netty Tier Architecture: Reactor communication scheduling level, duty chain Pipeline, the business logic layer

  • Netty thread scheduling model

  • Serialization

  • Link availability detection

  • Traffic Shaping

  • Elegant stop strategy

  • Netty support for SSL / TLS in

  • Netty source of high quality, it is recommended to the core code section to read:

  • Netty 的 Buffer

  • Netty's Reactor

  • Netty 的 Pipeline

  • Handler review of Netty

  • Netty 的 ChannelHandler

  • Netty 的 LoggingHandler

  • Netty 的 TimeoutHandler

  • Netty 的 CodecHandler

  • Netty 的 MessageToByteEncoder

Off-line calculation

Hadoop system is that we learn big data framework cornerstone, especially MapReduce, HDFS, Yarn Troika basic pad for the entire data path of development direction. We also learn the basics behind other frameworks, we should have about Hadoop itself, what does? Want to learn the system big data, you can poke I joined the large group study exchange technical data, private letters administrator can receive a free development tools and entry-learning materials

MapReduce:

  • Master works of MapReduce

  • MapReduce handwritten code can achieve a simple algorithm WordCount or TopN

  • The role of master of MapReduce Combiner and Partitioner

  • Familiar with the process to build a Hadoop cluster, and can resolve common errors

  • Familiar with Hadoop cluster expansion process and a common pit

  • How to solve the tilt of MapReduce data

  • Shuffle the principles and methods of reducing Shuffle

HDFS:

  • Very familiar with the HDFS architecture diagram and read and write processes

  • Very familiar with HDFS configuration

  • DataNode familiar with the role and the NameNode

  • HA establish the configuration and function, Fsimage NameNode the scene and EditJournal

  • Common operation command HDFS file

  • HDFS security model

Yarn:

  • Yarn of background and architecture

  • Yarn role in the division and the respective roles

  • Yarn configuration and common resource scheduling policy

  • Yarn conduct a task resource scheduling process

OLAP engine Hive

Hive is a data warehouse tool base for processing structured data in Hadoop. It is built on top of Hadoop, big data is always, and make query and analysis easy. Hive is the most widely used OLAP framework. Hive SQL is our most used SQL development framework.

About Hive you must master the knowledge points are as follows:

  • HiveSQL principle: We all know HiveSQL will be translated into MapReduce task execution, how it is translated into a SQL MapReduce is?

  • Hive and common relational database What is the difference?

  • Which data formats support Hive

  • Hive how the underlying storage of NULL

  • Several delegates supported the sort HiveSQL What does it mean (Sort By / Order By / Cluster By / Distrbute By)

  • Hive dynamic partitioning

  • HQL and SQL What are common difference

  • The difference between the inner and outer tables in Hive

  • Hive table associated with the query how to solve the long-tailed and data skew problem

  • HiveSQL optimization (system parameter adjustment, SQL statement optimization)

Columnar database Hbase

When we mentioned the concept of columnar database, the first reaction is Hbase.

HBase is a data model on nature, similar to Google's large table designed to provide fast random access to vast amounts of structured data. It uses fault-tolerant Hadoop file system (HDFS) provides.

It is the Hadoop ecosystem, providing real-time data for random read / write access, is part of the Hadoop file system.

We can either directly or through HBase storage HDFS data. HBase use in consumer HDFS read / random access data. HBase on top of Hadoop file system, and provides read and write access.

HBase is a column-oriented database, it sorts the rows in the table. Table schema definition only column family, which is the key value pairs. A table with a plurality of columns and each column aromatic group can have any number of columns. Subsequent columns of values ​​stored contiguously on the disk. Cell values ​​in each table has a time stamp. In short, in a HBase: is a collection of table rows, columns row is a collection of family, column family is a collection of columns, the column is a collection of key-value pairs.

About Hbase you need to know:

  • Hbase architecture and principles

  • Hbase read and write processes

  • Hbase have no concurrency issues? How Hbase realize their MVVC of?

  • Hbase in several important concepts: HMaster, RegionServer, WAL mechanism, MemStore

  • Hbase table during the design process and how the column family RowKey design

  • Hbase data and find solutions to hot issues

  • Improve read and write performance of the common practice Hbase

  • HBase in principle and BloomFilter of RowFilter

  • Hbase API common comparator

  • Hbase pre-partition

  • Hbase of Compaction

  • Hbase cluster downtime solve how HRegionServer

Real-time computing articles

Distributed message queue Kafka

Kafka was originally developed by the Linkedin, is a distributed, support partitions (partition), multiple copies (replica) distributed messaging system, its biggest feature is the ability to handle large amounts of data in real-time to meet the various needs of the scene : for example, Hadoop-based batch processing system, low-latency real-time systems, Spark streaming engine, Nginx logs, access logs, messaging services, etc., with the Scala language, Linkedin contribution in 2010 to the Apache Foundation and become the top open source projects.

Kafka Kafka or similar each company made its own news 'wheel' is already the de facto standard for big data field messaging middleware. Currently Kafka has been updated to version 2.x, supports a similar KafkaSQL other functions, Kafka does not meet the simple messaging middleware, it is evolving in the direction of the platform.

About Kafka we need to know:

  • Kafka's features and usage scenarios

  • Kafka some of the concepts: Leader, Broker, Producer, Consumer, Topic, Group, Offset, Partition, ISR

  • Kafka's overall architecture

  • Kafka election strategy

  • What Kafka read and write messages in the process have taken place

  • How Kakfa data synchronization (ISR)

  • Kafka realization of the principle of partition message sequence

  • The relationship between consumers and consumer groups

  • Consumer Kafka message Best Practice (best practice) what

  • How Kafka guarantee message delivery reliability and idempotency

  • How Transactional messages are implemented Kafka

  • How to manage Offset Kafka message

  • Kafka file storage mechanism

  • How Kafka is supported Exactly-once semantics

  • Kafka and generally also require other messaging middleware compared RocketMQ

Spark

Spark is a general-purpose computing engine specifically designed for large data processing is a fast general-purpose cluster computing platform. It is a universal memory parallel computing framework developed by the University of California, Berkeley AMP Lab, used to build large-scale, low-latency data analysis applications. It extends MapReduce computational model widely used. Support more efficient calculation mode, and stream including interactive query processing. A key feature is the ability to calculate Spark in memory, even if the disk-dependent complex calculations, is still more efficient than Spark MapReduce.

Spark ecological include: Spark Core, Spark Streaming, Spark SQL, Structured Streming libraries and machine learning related.

Spark we should learn to master:

(1)Spark Core:

  • Spark clusters and cluster building architecture (Spark role in the cluster)

  • The difference between Spark Cluster and Client modes

  • Spark elasticity distributed data sets RDD

  • Spark DAG (Directed Acyclic Graph)

  • Master Spark RDD program operator API (Transformation and Action operator)

  • RDD dependencies, what is dependent on wide and narrow dependence

  • RDD mechanism of blood

  • Spark core computing mechanism

  • Spark task scheduling and resource scheduling

  • Spark of CheckPoint and fault tolerance

  • Spark communication mechanism

  • Spark Shuffle principle and process

(2)Spark Streaming:

  • Principle analysis (source code level) and operational mechanism

  • Spark Dstream its API operations

  • Spark Streaming are two ways of consumption Kafka

  • Spark consumption Kafka message Offset processing

  • Data processing scheme is inclined

  • Spark Streaming operators tune

  • Broadcast and parallelism variables

  • Shuffle Tuning

(3)Spark SQL:

  • Spark SQL principle and operating mechanism

  • Catalyst's overall architecture

  • Spark SQL 的 DataFrame

Spark SQL optimization strategy: memory columnar storage and memory cache tables, columns, storage compression, logical query optimization, Join Optimization

(4)Structured Streaming

Spark from 2.3.0 version began to support Structured Streaming, which is a build scalable and fault-tolerant on Spark SQL engine stream processing engine, the unification of batch processing and stream processing. It makes Spark Structured Streaming join in a unified flow, batch and Flink aspects can rival.

We need to know:

  • Structured Streaming model

  • Structured Streaming result output mode

  • Time Event (Event-time) and delayed data (Late Data)

  • Window operation

  • Watermark

  • Fault Tolerance and Data Recovery

Spark Mlib:

This section is a Spark of machine learning support section, we learn there is spare capacity of students can learn about Spark commonly used classification, regression, clustering, collaborative filtering, dimension reduction and the underlying optimization algorithms and primitive tools. You can try their own use Spark Mlib do some simple algorithm.

Flink

Apache Flink (hereinafter referred to as Flink) project is a large field of data processing slowly recent rising star, its many features different from other big data projects have attracted more and more attention. In particular, in early 2019 Blink will open attention Flink raised to an unprecedented degree.

What about Flink this framework which we should grasp the core of knowledge?

  • Flink build clusters

  • Flink principle of architecture

  • Flink programming model

  • HA Flink cluster configuration

  • Flink DataSet API 和 DataSteam

  • Serialization

  • Flink accumulator

  • State of state management and recovery

  • And time window

  • Parallelism

  • Flink binding and messaging middleware Kafka

  • Flink Table and SQL principles and usage

In addition the focus of talk here about Ali Baba Blink support for SQL, you can see in the official website of Ali cloud, Blink some of the most proud of is the support for SQL, then SQL is the most common two issues: 1. Shuangliu JOIN problem, 2.State the key issues is the failure of our attention.

Big Data Algorithm

This part of the algorithm consists of two parts. The first part is: the interview questions common algorithms for large data processing; the second part is: commonly used machine learning and data mining algorithms.

We highlight some of the first part, the second part of our school has spare capacity to reach some of the students may, in the course of the interview can be considered a bright spot.

Big Data common algorithmic problems:

  1. Two large files to find the word co-occurrence

  2. Massive data requirements TopN

  3. Huge amounts of data to identify unique data

  4. Bloom filter

  5. bit-map

  6. stack

  7. Trie

  8. Inverted index

The companies expect you what it was like?

Let's look at a few typical BAT recruitment requirements of large data development engineer:

More than three jobs are from Baidu and Tencent, Ali, then we have grouped together in their demands:

  1. 1 ~ 2 base language

  2. A solid background development foundation

  3. Off-line calculation direction (Hadoop / Hbase / Hive etc.)

  4. Calculated in real time the direction (Spark / Flink / Kafka, etc.)

  5. A wider range of knowledge priority (+ other counterparts experience)

If you are a top-level Apache project Committer then congratulations, you are going to be major companies competing to lure them away objects.

What should we pay attention to when writing your resume?

I have interviewed many people as the interviewer, I think a relatively good resume should contain:

  1. Nice layout, eliminating the use of word, formatted template, recommended MarkDown generate PDF

  2. Do not pile technical terms, no, do not write do not understand, or you will severely abused

  3. 1 to 2 outstanding project experience, do not let your resume looks like a simple and obvious as Demo

  4. Written on the resume of the project I suggest that you should be familiar with every detail, if not you have to know is how to develop achievable

  5. If there is a well-known company internship or work experience that is a big plus

Technical depth and breadth?

In the technical direction, we prefer to multi-skill, breadth and depth of both the students, of course, this requirement is already high. But at least it should do is that you use the technology not only be familiar with how to use, should also be aware of the principles.

If you have or develop as a core technology leader in the group so to highlight their technical superiority and forward-looking, not only to be familiar with now some  轮子 of the pros and cons, but also a certain degree of foresight and predictability of future technological development .

How to send your resume?

The most recommended way is to find the person directly responsible for recruiting groups or for students or colleagues in the push.

发布了142 篇原创文章 · 获赞 0 · 访问量 9751

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104326867