I am concerned about the number of public, backstage reply [JAVAPDF page 200] get questions!
50,000 people of concern to large data path of God, do not come to know about it?
Road 50000 Big Data concern to God, do not really look at it?
50,000 people of concern to large data path of God, do not really determined to learn about it?
Welcome your interest in "big data into the path of God."
I would like to read this article and understand the technical people from the technology itself early, early Deng Bian.
All technology eventually just fluff.
Outline
Theme of this series is big data development interview guide designed to provide for everyone's basic line data of a large study, and improve the development of the technology stack data, as well as our development job interview, when, and what is the focus of a large study data, these companies hope the interviewer What skills.
This article does not perform to a certain point a detailed knowledge of the expansion, will continue to follow a feature article, I hope readers will outline as a study or a review to leak filled.
Language Basics
Java Basics
whole big data development technology stack us from the real-time point of view, it consists mainly off-line computing and real-time calculation of two parts, the whole frame most of the big data development or ecology are compatible with the Java Java API calls, then as a first language Java on the JVM is what we around the past Hom, the Java language is the basis of our reading the source code and tune foundation.
Java foundation mainly contains the following sections:
- Language Basics
- lock
- Multithreading
- And concurrent containers commonly used contract (the JUC)
Language Basics
Java object-oriented
three features of the Java language: encapsulation, inheritance, and polymorphism
Java language data types
Java automatic type conversions, cast
immutability String of constant pool of virtual machines, the String.intern () the underlying principles of the
Java language keywords: final, the underlying principle static, transient, instanceof, volatile, synchronized the
principle commonly used in Java class that implements the collection: ArrayList / LinkedList / Vector, synchronizedList / Vector, HashMap / HashTable / ConcurrentHashMap each other's differences and the underlying the principle
dynamic proxy implementation
lock
CAS, optimistic and pessimistic locking lock, the lock mechanism related databases, distributed lock, tend to lock, lock lightweight, heavyweight lock, monitor
lock optimization, eliminate lock, lock coarsening, spin locks, reentrant lock, blocking locks, deadlocks
cause of a deadlock
solutions deadlock
CountDownLatch, CyclicBarrier and Semaphore three classes of use and principles
Multithreading
Concurrency and parallelism of the difference between
the difference between threads and processes
to achieve thread state of the thread priority, thread scheduling, a variety of ways to create a thread, the thread guard
design their own thread pool, submit () and execute (), the thread pool principle
Why Executors are not allowed to create a thread pool
deadlock, deadlock how to troubleshoot, and the relationship between thread-safe memory model
ThreadLocal variable
Executor several ways to create a thread pool:
newFixedThreadPool (int nThreads)
newCachedThreadPool ()
newSingleThreadExecutor ()
newScheduledThreadPool (int corePoolSize)
newSingleThreadExecutor ()
ThreadPoolExecutor create a thread pool, refused strategies
thread pool closed way
Concurrent containers (the JUC)
JUC in package List interface implementation class: CopyOnWriteArrayList
JUC package Set interface implementation class: CopyOnWriteArraySet, ConcurrentSkipListSet
JUC package class that implements the Map interface: ConcurrentHashMap, ConcurrentSkipListMap
implementation class JUC package Queue Interface: ConcurrentLinkedQueue, ConcurrentLinkedDeque, ArrayBlockingQueue, LinkedBlockingQueue , LinkedBlockingDeque
Java Advanced articles
Advanced articles section are in addition to Java Basics, which is a part of our big data framework source familiar with the necessary skills, but also our interview hardest hit during the interview of senior positions.
JVM
JVM memory architecture
class file format, run-time data area: heap, stack, method area, direct memory, runtime constant pool
heap and stack difference between
Java objects in a certain allocated on the heap it?
Java memory model
computer model memory, cache coherence, the MESI protocol, visibility, atom, sequence, happens-before, memory barriers, synchronized, volatile, final, lock
garbage collection
GC method: Clear mark, the reference count, copy, tag compression, generation of recovery, incremental recovery, the GC parameters, determined live objects, the garbage collector (the CMS, Gl, a ZGC, Epsilon)
the JVM tuning parameters and
-Xmx, -Xmn, -Xms, Xss, -XX : SurvivorRatio, -XX: PermSize, -XX: MaxPermSize, -XX: MaxTenuringThreshold
the Java object model
oop-klass, object header
HotSpot
-time compiler, the compiler optimize
virtual machine performance monitoring and troubleshooting tool
jps, jstack, jmap, jstat, jconsole , jinfo, jhat, javap, btrace , TProfiler, Arthas
class loading mechanism
classLoader, class loading process, parents delegate (undermine parents delegate), modular (jboss modules, osgi, jigsaw)
NIO
User space and kernel space
Linux network I / O model: blocking I / O (Blocking I / O ), a non-blocking I / O (Non-Blocking I / O), I / O multiplexing (I / O Multiplexing), the signal the drive I / O (Signal driven I / O), asynchronous I / O
spirit copy (zerocopy)
BIO comparison with NIO
buffer buffer
channel channel
reactor
selector
AIO
RPC
RPC programming model principle
common RPC framework: Thrift, Dubbo, SpringCloud
difference RPC application scenario and a message queue
RPC core technical points: exposed service, the remote proxy object, the communication sequence of
Linux Basics
Learn Linux common commands
remote login
upload and download
system directory
file and directory operations
privilege system under Linux
compression and packaging
users and groups
write scripts Shell
pipeline operation
Distributed theory articles
Some basic concepts distributed in: Cluster (Cluster), load balancing (Load Balancer) such as
the theoretical basis of distributed systems: consistency, 2PC and 3PC
theoretical basis of distributed systems: CAP
Distributed Systems Rationale: time, clock, and events order
distributed systems theory Advanced: Paxos
distributed systems theory advanced: Raft, Zab
distributed systems theory advanced: elections, majority and lease
solutions for distributed lock of
distributed transaction solutions
distributed to solve the ID generator Program
Large data network communication frame cornerstone --Netty
Netty is the most popular NIO framework, Netty in the Internet field, big data distributed computing, game industry, communications industry, such as access to a wide range of applications, the industry's leading open source components it comes to network communication, Netty is the best option .
About Netty us to grasp:
Netty three-tier network architecture: Reactor communication scheduling layer, chain of responsibility PipeLine, business logic layer
Netty thread scheduling model
Serialization
Link availability detection
Traffic Shaping
Elegant stop strategy
Netty support for SSL / TLS in
Netty source of high quality, it is recommended to the core code section to read:
Netty's Buffer
Netty's Reactor
Netty's Pipeline
Handler review of Netty
Netty's ChannelHandler
Netty 的 LoggingHandler
Netty's TimeoutHandler
Netty's CodecHandler
Netty 的 MessageToByteEncoder
Off-line calculation
Hadoop system is the cornerstone of our study large data framework, especially MapReduce, HDFS, Yarn Troika basic pad for the entire data path of development direction. We also learn the basics behind other frameworks, we should have about Hadoop itself, what does?
MapReduce:
Master works of MapReduce
MapReduce handwritten code can achieve a simple algorithm WordCount or TopN
The role of master of MapReduce Combiner and Partitioner
Familiar with the process to build a Hadoop cluster, and can resolve common errors
Familiar with Hadoop cluster expansion process and a common pit
How to solve the tilt of MapReduce data
Shuffle the principles and methods of reducing Shuffle
HDFS:
Very familiar with the HDFS architecture diagram and read and write processes
Very familiar with HDFS configuration
DataNode familiar with the role and the NameNode
HA establish the configuration and function, Fsimage NameNode the scene and EditJournal
Common operation command HDFS file
HDFS security model
Yarn:
Yarn of background and architecture
Yarn role in the division and the respective roles
Yarn configuration and common resource scheduling policy
Yarn conduct a task resource scheduling process
OLAP engine Hive
Hive is a data warehouse tool base for processing structured data in Hadoop. It is built on top of Hadoop, big data is always, and make query and analysis easy. Hive is the most widely used OLAP framework. Hive SQL is our most used SQL development framework.
Knowledge about Hive you must master the following:
HiveSQL principle: We all know HiveSQL will be translated into MapReduce task execution, how it is translated into a SQL MapReduce is?
Hive and common relational database What is the difference?
Which data formats support Hive
Hive how the underlying storage of NULL
Several delegates supported the sort HiveSQL What does it mean (Sort By / Order By / Cluster By / Distrbute By)
Hive dynamic partitioning
HQL and SQL What are common difference
The difference between the inner and outer tables in Hive
Hive table associated with the query how to solve the long-tailed and data skew problem
HiveSQL optimization (system parameter adjustment, SQL statement optimization)
Columnar database Hbase
When we mentioned the concept of columnar database, the first reaction is Hbase.
HBase is a data model on nature, similar to Google's large table designed to provide fast random access to vast amounts of structured data. It uses fault-tolerant Hadoop file system (HDFS) provides.
It is the Hadoop ecosystem, providing real-time data for random read / write access, is part of the Hadoop file system.
We can either directly or through HBase storage HDFS data. HBase use in consumer HDFS read / random access data. HBase on top of Hadoop file system, and provides read and write access.
HBase is a column-oriented database, it sorts the rows in the table. Table schema definition only column family, which is the key value pairs. A table with a plurality of columns and each column aromatic group can have any number of columns. Subsequent columns of values stored contiguously on the disk. Cell values in each table has a time stamp. In short, in a HBase: is a collection of table rows, columns row is a collection of family, column family is a collection of columns, the column is a collection of key-value pairs.
About Hbase you need to know:
Hbase architecture and principles
Hbase read and write processes
Hbase have no concurrency issues? How Hbase realize their MVVC of?
Hbase in several important concepts: HMaster, RegionServer, WAL mechanism, MemStore
Hbase table during the design process and how the column family RowKey design
Hbase data and find solutions to hot issues
Improve read and write performance of the common practice Hbase
HBase in principle and BloomFilter of RowFilter
Hbase API common comparator
Hbase pre-partition
Hbase 的 Compaction
Hbase cluster downtime solve how HRegionServer
Real-time computing articles
Distributed message queue Kafka
Kafka was originally developed by the Linkedin, is a distributed, support partitions (partition), multiple copies (replica) distributed messaging system, its biggest feature is the ability to handle large amounts of data in real-time to meet the various needs of the scene : for example, Hadoop-based batch processing system, low-latency real-time systems, Spark streaming engine, Nginx logs, access logs, messaging services, etc., with the Scala language, Linkedin contribution in 2010 to the Apache Foundation and become the top open source projects.
Kafka Kafka or similar each company made its own news 'wheel' is already the de facto standard for big data field messaging middleware. Currently Kafka has been updated to version 2.x, supports a similar KafkaSQL other functions, Kafka does not meet the simple messaging middleware, it is evolving in the direction of the platform.
About Kafka we need to know:
characteristics and usage scenarios Kafka
Kafka some of the concepts: Leader, Broker, Producer, Consumer, Topic, Group, Offset, Partition, ISR
Kafka's overall architecture
Kafka election strategy
What Kafka read and write messages in the process have taken place
How Kakfa data synchronization (ISR)
Kafka realization of the principle of partition message sequence
The relationship between consumers and consumer groups
Consumer Kafka message Best Practice (best practice) what
How Kafka guarantee message delivery reliability and idempotency
How Transactional messages are implemented Kafka
How to manage Offset Kafka message
Kafka file storage mechanism
How Kafka is supported Exactly-once semantics
Kafka and generally also require other messaging middleware compared RocketMQ
Spark
Spark is a general-purpose computing engine specifically designed for large data processing is a fast general-purpose cluster computing platform. It is a universal memory parallel computing framework developed by the University of California, Berkeley AMP Lab, used to build large-scale, low-latency data analysis applications. It extends MapReduce computational model widely used. Support more efficient calculation mode, and stream including interactive query processing. A key feature is the ability to calculate Spark in memory, even if the disk-dependent complex calculations, is still more efficient than Spark MapReduce.
Spark ecological include: Spark Core, Spark Streaming, Spark SQL, Structured Streming libraries and machine learning related.
Spark learning we should have:
(1) Spark Core:
(Spark role in the cluster) Spark clusters and cluster architecture building
The difference between Spark Cluster and Client modes
Spark elasticity distributed data sets RDD
Spark DAG (Directed Acyclic Graph)
Master Spark RDD program operator API (Transformation and Action operator)
RDD dependencies, what is dependent on wide and narrow dependence
RDD mechanism of blood
Spark core computing mechanism
Spark task scheduling and resource scheduling
Spark of CheckPoint and fault tolerance
Spark communication mechanism
Spark Shuffle principle and process
(2) Spark Streaming:
the principle of analysis (source code level) and operational mechanism
Spark Dstream its API operations
Spark Streaming are two ways of consumption Kafka
Spark consumption Kafka message Offset processing
Data processing scheme is inclined
Spark Streaming operators tune
Broadcast and parallelism variables
Shuffle Tuning
(3) the Spark SQL:
the Spark SQL principle and operating mechanism
Catalyst's overall architecture
Spark SQL 的 DataFrame
Spark SQL optimization strategy: memory columnar storage and memory cache tables, columns, storage compression, logical query optimization, Join optimization
(4) Structured Streaming
the Spark from the 2.3.0 version began to support Structured Streaming, it is a built Spark SQL engine on scalable and fault-tolerant stream processing engine, the unification of batch processing and stream processing. It makes Spark Structured Streaming join in a unified flow, batch and Flink aspects can rival.
We need to know:
Structured Streaming model
Structured Streaming result output mode
Time Event (Event-time) and delayed data (Late Data)
Window operation
Watermark
Fault Tolerance and Data Recovery
Spark Mlib:
This section is part of Spark machine learning support, we learn there is spare capacity of students can learn about Spark commonly used classification, regression, clustering, collaborative filtering, dimension reduction and the underlying optimization algorithms and primitive tools. You can try their own use Spark Mlib do some simple algorithm.
Flink
Apache Flink (hereinafter referred to as Flink) project is a large field of data processing slowly recent rising star, its many features different from other big data projects have attracted more and more attention. In particular, in early 2019 Blink will open attention Flink raised to an unprecedented degree.
What about Flink this framework which we should grasp the core of knowledge?
Flink build clusters
Flink principle of architecture
Flink programming model
HA Flink cluster configuration
Flink DataSet API 和 DataSteam
Serialization
Flink accumulator
State of state management and recovery
And time window
Parallelism
Flink binding and messaging middleware Kafka
Flink Table and SQL principles and usage
In addition the focus of talk here about Ali Baba Blink support for SQL, you can see in the official website of Ali cloud, Blink some of the most proud of is the support for SQL, then SQL is the most common two issues: 1 double JOIN problem, 2.State the key issues is the failure of our attention.
Big Data Algorithm
This part of the algorithm consists of two parts. The first part is: the interview questions common algorithms for large data processing; the second part is: commonly used machine learning and data mining algorithms.
We highlight some of the first part, the second part of our school has spare capacity to reach some of the students may, in the course of the interview can be considered a bright spot.
Big Data common algorithmic problems:
two large files to find the word co-occurrence
Massive data requirements TopN
Huge amounts of data to identify unique data
Bloom filter
bit-map
stack
Trie
Inverted index
The companies expect you what it was like?
Let's look at a few typical BAT recruitment requirements of large data development engineer:
More than three jobs are from Baidu and Tencent, Ali, then we have grouped together in their demands:
1 to 2 Language Infrastructure
A solid background development foundation
Off-line calculation direction (Hadoop / Hbase / Hive etc.)
Calculated in real time the direction (Spark / Flink / Kafka, etc.)
A wider range of knowledge priority (+ other counterparts experience)
If you are a top-level Apache project Committer then congratulations, you are going to be major companies competing to lure them away objects.
What should we pay attention to when writing your resume?
I have interviewed a lot of people as the interviewer, I think a relatively good resume should contain:
a beautiful layout, eliminating the use of word, formatted template, recommended MarkDown generate PDF
Do not pile technical terms, no, do not write do not understand, or you will severely abused
1 to 2 outstanding project experience, do not let your resume looks like a simple and obvious as Demo
Written on the resume of the project I suggest that you should be familiar with every detail, if not you have to know is how to develop achievable
If there is a well-known company internship or work experience that is a big plus
Technical depth and breadth?
In the technical direction, we prefer to multi-skill, breadth and depth of both the students, of course, this requirement is already high. But at least it should do is that you use the technology not only be familiar with how to use, should also be aware of the principles.
If you have or develop as a core technology leader in the group so to highlight their technical superiority and forward-looking, not only have to be familiar with some of the pros and cons of using the wheel now, should have a certain forward-looking and anticipate future technological development sex.
How to send your resume?
The most recommended way is to find the person directly responsible for recruiting groups or for students or colleagues in the push.
Welcome to my public concern scan code number, reply] [JAVAPDF can get a 200 Autumn trick interview questions!