Jianpu Master | directions Big Data learning interview knowledge map

I am concerned about the number of public, backstage reply [JAVAPDF page 200] get questions!
50,000 people of concern to large data path of God, do not come to know about it?
Road 50000 Big Data concern to God, do not really look at it?
50,000 people of concern to large data path of God, do not really determined to learn about it?

Welcome your interest in "big data into the path of God."

I would like to read this article and understand the technical people from the technology itself early, early Deng Bian.
All technology eventually just fluff.

file

Outline

Theme of this series is big data development interview guide designed to provide for everyone's basic line data of a large study, and improve the development of the technology stack data, as well as our development job interview, when, and what is the focus of a large study data, these companies hope the interviewer What skills.
This article does not perform to a certain point a detailed knowledge of the expansion, will continue to follow a feature article, I hope readers will outline as a study or a review to leak filled.

Language Basics

Java Basics
whole big data development technology stack us from the real-time point of view, it consists mainly off-line computing and real-time calculation of two parts, the whole frame most of the big data development or ecology are compatible with the Java Java API calls, then as a first language Java on the JVM is what we around the past Hom, the Java language is the basis of our reading the source code and tune foundation.
Java foundation mainly contains the following sections:

  • Language Basics
  • lock
  • Multithreading
  • And concurrent containers commonly used contract (the JUC)

Language Basics

Java object-oriented
three features of the Java language: encapsulation, inheritance, and polymorphism
Java language data types
Java automatic type conversions, cast
immutability String of constant pool of virtual machines, the String.intern () the underlying principles of the
Java language keywords: final, the underlying principle static, transient, instanceof, volatile, synchronized the
principle commonly used in Java class that implements the collection: ArrayList / LinkedList / Vector, synchronizedList / Vector, HashMap / HashTable / ConcurrentHashMap each other's differences and the underlying the principle
dynamic proxy implementation

lock

CAS, optimistic and pessimistic locking lock, the lock mechanism related databases, distributed lock, tend to lock, lock lightweight, heavyweight lock, monitor
lock optimization, eliminate lock, lock coarsening, spin locks, reentrant lock, blocking locks, deadlocks
cause of a deadlock
solutions deadlock
CountDownLatch, CyclicBarrier and Semaphore three classes of use and principles

Multithreading

Concurrency and parallelism of the difference between
the difference between threads and processes
to achieve thread state of the thread priority, thread scheduling, a variety of ways to create a thread, the thread guard
design their own thread pool, submit () and execute (), the thread pool principle
Why Executors are not allowed to create a thread pool
deadlock, deadlock how to troubleshoot, and the relationship between thread-safe memory model
ThreadLocal variable
Executor several ways to create a thread pool:
newFixedThreadPool (int nThreads)
newCachedThreadPool ()
newSingleThreadExecutor ()
newScheduledThreadPool (int corePoolSize)
newSingleThreadExecutor ()
ThreadPoolExecutor create a thread pool, refused strategies
thread pool closed way

Concurrent containers (the JUC)

JUC in package List interface implementation class: CopyOnWriteArrayList
JUC package Set interface implementation class: CopyOnWriteArraySet, ConcurrentSkipListSet
JUC package class that implements the Map interface: ConcurrentHashMap, ConcurrentSkipListMap
implementation class JUC package Queue Interface: ConcurrentLinkedQueue, ConcurrentLinkedDeque, ArrayBlockingQueue, LinkedBlockingQueue , LinkedBlockingDeque

Java Advanced articles

Advanced articles section are in addition to Java Basics, which is a part of our big data framework source familiar with the necessary skills, but also our interview hardest hit during the interview of senior positions.

JVM

JVM memory architecture
class file format, run-time data area: heap, stack, method area, direct memory, runtime constant pool
heap and stack difference between
Java objects in a certain allocated on the heap it?
Java memory model
computer model memory, cache coherence, the MESI protocol, visibility, atom, sequence, happens-before, memory barriers, synchronized, volatile, final, lock
garbage collection
GC method: Clear mark, the reference count, copy, tag compression, generation of recovery, incremental recovery, the GC parameters, determined live objects, the garbage collector (the CMS, Gl, a ZGC, Epsilon)
the JVM tuning parameters and
-Xmx, -Xmn, -Xms, Xss, -XX : SurvivorRatio, -XX: PermSize, -XX: MaxPermSize, -XX: MaxTenuringThreshold
the Java object model
oop-klass, object header
HotSpot
-time compiler, the compiler optimize
virtual machine performance monitoring and troubleshooting tool
jps, jstack, jmap, jstat, jconsole , jinfo, jhat, javap, btrace , TProfiler, Arthas
class loading mechanism
classLoader, class loading process, parents delegate (undermine parents delegate), modular (jboss modules, osgi, jigsaw)

NIO

User space and kernel space
Linux network I / O model: blocking I / O (Blocking I / O ), a non-blocking I / O (Non-Blocking I / O), I / O multiplexing (I / O Multiplexing), the signal the drive I / O (Signal driven I / O), asynchronous I / O
spirit copy (zerocopy)
BIO comparison with NIO
buffer buffer
channel channel
reactor
selector
AIO

RPC

RPC programming model principle
common RPC framework: Thrift, Dubbo, SpringCloud
difference RPC application scenario and a message queue
RPC core technical points: exposed service, the remote proxy object, the communication sequence of

Linux Basics

Learn Linux common commands
remote login
upload and download
system directory
file and directory operations
privilege system under Linux
compression and packaging
users and groups
write scripts Shell
pipeline operation

Distributed theory articles

Some basic concepts distributed in: Cluster (Cluster), load balancing (Load Balancer) such as
the theoretical basis of distributed systems: consistency, 2PC and 3PC
theoretical basis of distributed systems: CAP
Distributed Systems Rationale: time, clock, and events order
distributed systems theory Advanced: Paxos
distributed systems theory advanced: Raft, Zab
distributed systems theory advanced: elections, majority and lease
solutions for distributed lock of
distributed transaction solutions
distributed to solve the ID generator Program

Large data network communication frame cornerstone --Netty

Netty is the most popular NIO framework, Netty in the Internet field, big data distributed computing, game industry, communications industry, such as access to a wide range of applications, the industry's leading open source components it comes to network communication, Netty is the best option .
About Netty us to grasp:
Netty three-tier network architecture: Reactor communication scheduling layer, chain of responsibility PipeLine, business logic layer

Netty thread scheduling model

Serialization

Link availability detection

Traffic Shaping

Elegant stop strategy

Netty support for SSL / TLS in

Netty source of high quality, it is recommended to the core code section to read:

Netty's Buffer

Netty's Reactor

Netty's Pipeline

Handler review of Netty

Netty's ChannelHandler

Netty 的 LoggingHandler

Netty's TimeoutHandler

Netty's CodecHandler

Netty 的 MessageToByteEncoder

Off-line calculation

Hadoop system is the cornerstone of our study large data framework, especially MapReduce, HDFS, Yarn Troika basic pad for the entire data path of development direction. We also learn the basics behind other frameworks, we should have about Hadoop itself, what does?

MapReduce:

Master works of MapReduce

MapReduce handwritten code can achieve a simple algorithm WordCount or TopN

The role of master of MapReduce Combiner and Partitioner

Familiar with the process to build a Hadoop cluster, and can resolve common errors

Familiar with Hadoop cluster expansion process and a common pit

How to solve the tilt of MapReduce data

Shuffle the principles and methods of reducing Shuffle

HDFS:

Very familiar with the HDFS architecture diagram and read and write processes

Very familiar with HDFS configuration

DataNode familiar with the role and the NameNode

HA establish the configuration and function, Fsimage NameNode the scene and EditJournal

Common operation command HDFS file

HDFS security model

Yarn:

Yarn of background and architecture

Yarn role in the division and the respective roles

Yarn configuration and common resource scheduling policy

Yarn conduct a task resource scheduling process

OLAP engine Hive

Hive is a data warehouse tool base for processing structured data in Hadoop. It is built on top of Hadoop, big data is always, and make query and analysis easy. Hive is the most widely used OLAP framework. Hive SQL is our most used SQL development framework.
Knowledge about Hive you must master the following:
HiveSQL principle: We all know HiveSQL will be translated into MapReduce task execution, how it is translated into a SQL MapReduce is?

Hive and common relational database What is the difference?

Which data formats support Hive

Hive how the underlying storage of NULL

Several delegates supported the sort HiveSQL What does it mean (Sort By / Order By / Cluster By / Distrbute By)

Hive dynamic partitioning

HQL and SQL What are common difference

The difference between the inner and outer tables in Hive

Hive table associated with the query how to solve the long-tailed and data skew problem

HiveSQL optimization (system parameter adjustment, SQL statement optimization)

Columnar database Hbase

When we mentioned the concept of columnar database, the first reaction is Hbase.
HBase is a data model on nature, similar to Google's large table designed to provide fast random access to vast amounts of structured data. It uses fault-tolerant Hadoop file system (HDFS) provides.
It is the Hadoop ecosystem, providing real-time data for random read / write access, is part of the Hadoop file system.
We can either directly or through HBase storage HDFS data. HBase use in consumer HDFS read / random access data. HBase on top of Hadoop file system, and provides read and write access.
HBase is a column-oriented database, it sorts the rows in the table. Table schema definition only column family, which is the key value pairs. A table with a plurality of columns and each column aromatic group can have any number of columns. Subsequent columns of values stored contiguously on the disk. Cell values in each table has a time stamp. In short, in a HBase: is a collection of table rows, columns row is a collection of family, column family is a collection of columns, the column is a collection of key-value pairs.
About Hbase you need to know:
Hbase architecture and principles

Hbase read and write processes

Hbase have no concurrency issues? How Hbase realize their MVVC of?

Hbase in several important concepts: HMaster, RegionServer, WAL mechanism, MemStore

Hbase table during the design process and how the column family RowKey design

Hbase data and find solutions to hot issues

Improve read and write performance of the common practice Hbase

HBase in principle and BloomFilter of RowFilter

Hbase API common comparator

Hbase pre-partition

Hbase 的 Compaction

Hbase cluster downtime solve how HRegionServer

Real-time computing articles

Distributed message queue Kafka

Kafka was originally developed by the Linkedin, is a distributed, support partitions (partition), multiple copies (replica) distributed messaging system, its biggest feature is the ability to handle large amounts of data in real-time to meet the various needs of the scene : for example, Hadoop-based batch processing system, low-latency real-time systems, Spark streaming engine, Nginx logs, access logs, messaging services, etc., with the Scala language, Linkedin contribution in 2010 to the Apache Foundation and become the top open source projects.
Kafka Kafka or similar each company made its own news 'wheel' is already the de facto standard for big data field messaging middleware. Currently Kafka has been updated to version 2.x, supports a similar KafkaSQL other functions, Kafka does not meet the simple messaging middleware, it is evolving in the direction of the platform.
About Kafka we need to know:
characteristics and usage scenarios Kafka

Kafka some of the concepts: Leader, Broker, Producer, Consumer, Topic, Group, Offset, Partition, ISR

Kafka's overall architecture

Kafka election strategy

What Kafka read and write messages in the process have taken place

How Kakfa data synchronization (ISR)

Kafka realization of the principle of partition message sequence

The relationship between consumers and consumer groups

Consumer Kafka message Best Practice (best practice) what

How Kafka guarantee message delivery reliability and idempotency

How Transactional messages are implemented Kafka

How to manage Offset Kafka message

Kafka file storage mechanism

How Kafka is supported Exactly-once semantics

Kafka and generally also require other messaging middleware compared RocketMQ

Spark

Spark is a general-purpose computing engine specifically designed for large data processing is a fast general-purpose cluster computing platform. It is a universal memory parallel computing framework developed by the University of California, Berkeley AMP Lab, used to build large-scale, low-latency data analysis applications. It extends MapReduce computational model widely used. Support more efficient calculation mode, and stream including interactive query processing. A key feature is the ability to calculate Spark in memory, even if the disk-dependent complex calculations, is still more efficient than Spark MapReduce.
Spark ecological include: Spark Core, Spark Streaming, Spark SQL, Structured Streming libraries and machine learning related.
Spark learning we should have:
(1) Spark Core:
(Spark role in the cluster) Spark clusters and cluster architecture building

The difference between Spark Cluster and Client modes

Spark elasticity distributed data sets RDD

Spark DAG (Directed Acyclic Graph)

Master Spark RDD program operator API (Transformation and Action operator)

RDD dependencies, what is dependent on wide and narrow dependence

RDD mechanism of blood

Spark core computing mechanism

Spark task scheduling and resource scheduling

Spark of CheckPoint and fault tolerance

Spark communication mechanism

Spark Shuffle principle and process

(2) Spark Streaming:
the principle of analysis (source code level) and operational mechanism

Spark Dstream its API operations

Spark Streaming are two ways of consumption Kafka

Spark consumption Kafka message Offset processing

Data processing scheme is inclined

Spark Streaming operators tune

Broadcast and parallelism variables

Shuffle Tuning

(3) the Spark SQL:
the Spark SQL principle and operating mechanism

Catalyst's overall architecture

Spark SQL 的 DataFrame

Spark SQL optimization strategy: memory columnar storage and memory cache tables, columns, storage compression, logical query optimization, Join optimization
(4) Structured Streaming
the Spark from the 2.3.0 version began to support Structured Streaming, it is a built Spark SQL engine on scalable and fault-tolerant stream processing engine, the unification of batch processing and stream processing. It makes Spark Structured Streaming join in a unified flow, batch and Flink aspects can rival.
We need to know:
Structured Streaming model

Structured Streaming result output mode

Time Event (Event-time) and delayed data (Late Data)

Window operation

Watermark

Fault Tolerance and Data Recovery

Spark Mlib:
This section is part of Spark machine learning support, we learn there is spare capacity of students can learn about Spark commonly used classification, regression, clustering, collaborative filtering, dimension reduction and the underlying optimization algorithms and primitive tools. You can try their own use Spark Mlib do some simple algorithm.

Apache Flink (hereinafter referred to as Flink) project is a large field of data processing slowly recent rising star, its many features different from other big data projects have attracted more and more attention. In particular, in early 2019 Blink will open attention Flink raised to an unprecedented degree.
What about Flink this framework which we should grasp the core of knowledge?
Flink build clusters

Flink principle of architecture

Flink programming model

HA Flink cluster configuration

Flink DataSet API 和 DataSteam

Serialization

Flink accumulator

State of state management and recovery

And time window

Parallelism

Flink binding and messaging middleware Kafka

Flink Table and SQL principles and usage

In addition the focus of talk here about Ali Baba Blink support for SQL, you can see in the official website of Ali cloud, Blink some of the most proud of is the support for SQL, then SQL is the most common two issues: 1 double JOIN problem, 2.State the key issues is the failure of our attention.

Big Data Algorithm

This part of the algorithm consists of two parts. The first part is: the interview questions common algorithms for large data processing; the second part is: commonly used machine learning and data mining algorithms.
We highlight some of the first part, the second part of our school has spare capacity to reach some of the students may, in the course of the interview can be considered a bright spot.
Big Data common algorithmic problems:
two large files to find the word co-occurrence

Massive data requirements TopN

Huge amounts of data to identify unique data

Bloom filter

bit-map

stack

Trie

Inverted index

The companies expect you what it was like?

Let's look at a few typical BAT recruitment requirements of large data development engineer:
file
file
file

More than three jobs are from Baidu and Tencent, Ali, then we have grouped together in their demands:
1 to 2 Language Infrastructure

A solid background development foundation

Off-line calculation direction (Hadoop / Hbase / Hive etc.)

Calculated in real time the direction (Spark / Flink / Kafka, etc.)

A wider range of knowledge priority (+ other counterparts experience)

If you are a top-level Apache project Committer then congratulations, you are going to be major companies competing to lure them away objects.
What should we pay attention to when writing your resume?
I have interviewed a lot of people as the interviewer, I think a relatively good resume should contain:
a beautiful layout, eliminating the use of word, formatted template, recommended MarkDown generate PDF

Do not pile technical terms, no, do not write do not understand, or you will severely abused

1 to 2 outstanding project experience, do not let your resume looks like a simple and obvious as Demo

Written on the resume of the project I suggest that you should be familiar with every detail, if not you have to know is how to develop achievable

If there is a well-known company internship or work experience that is a big plus

Technical depth and breadth?
In the technical direction, we prefer to multi-skill, breadth and depth of both the students, of course, this requirement is already high. But at least it should do is that you use the technology not only be familiar with how to use, should also be aware of the principles.
If you have or develop as a core technology leader in the group so to highlight their technical superiority and forward-looking, not only have to be familiar with some of the pros and cons of using the wheel now, should have a certain forward-looking and anticipate future technological development sex.
How to send your resume?
The most recommended way is to find the person directly responsible for recruiting groups or for students or colleagues in the push.

Big Data technology and architecture
Welcome to my public concern scan code number, reply] [JAVAPDF can get a 200 Autumn trick interview questions!

Guess you like

Origin www.cnblogs.com/importbigdata/p/11517117.html