First, Inscription
Say the current IT industry what the hottest? ABC second to none. The so-called ABC's, AI + Big Data + Cloud also that artificial intelligence, big data and cloud computing (cloud platform). Each of these areas there is now a leader in the industry leading forward, today we are on to the next big data Big Data in this direction.
Second, the data inside a large role
A role: Big Data Engineering
Big data projects need to address the definition of the data collection, calculation and preservation work, so large data engineers in the design and deployment of such a system is the primary consideration of data availability problems, namely large engineering systems require real-time data for downstream business systems or analysis systems provide data services;
Role Two: Big Data Analytics
Role of big data analytics on how to use the data - that is how to provide data analysis is output as a business or organization after receiving data from a large engineering system data, and can really help to improve company business or improve service levels, so for large data analysts, their primary problem is to find and exploit the value of data, which may include specific: trend analysis, modeling and forecasting analysis.
These two types of roles are interdependent but operate independently, meaning what? No major engineering data, big data analysis would be impossible; but if there are no major data analysis, I can not think of the raison d'etre of big data projects. This is similar to marriage and love - love aim is to get married, not to marry for the purpose of love is bullying.
Briefly summarize large data projects need to consider the role of data collection, calculation (or processing) and save; big data analysis role is to perform advanced computing data.
[Big Data to develop learning materials collection method: adding big data exchange technology to learn q group 522,189,307, private letters administrator can receive a free
Third, the Big Data Engineer
For a role: big data projects that correspond to jobs is called Big Data Engineer for large data engineer, you at least want to have the following skills:
linux foundation
Because big data systems, which are mostly open source software, open source software is running on the open source linux system, so you'll have to linux basic operations, such as user management, privilege, shell programming like
A JVM-based languages:
A large proportion of the current ecological data JVM-based languages like great, some degree of monopoly is not excessive. Here I recommend you learn Java or Scala, Clojure As such language is not easy to get started, in fact, does not recommend that you use. In addition, now is "your mother to child," the era of big data framework will take some fire its popular programming languages, such as on Docker's Go, Kafka is to the Scala.
So here we recommend that you at least want to be proficient in a language JVM system. It is worth mentioning, be sure to understand multi-threading model and memory model of the language, many large data frame processing mode In fact, language level and multi-threaded processing model is similar, only the big data framework they extended to multi-machine distributed this level.
Recommendation: to learn Java or Scala
Calculation processing frame:
Strictly speaking, it is divided into off-line batch processing and streaming. Streaming is the future trend, suggest that you must go to learn; and off-line batch is already out of date quickly, it can not handle the idea of infinite batch process data sets, thus shrinking its scope. In fact, Google has officially abandoned house in offline processing with MapReduce represented.
So if you want to learn big data projects, to master a real-time streaming framework it is necessary. Current mainstream framework include: Apache Samza, Apache Storm, Apache Spark Streaming as well as the most recent year heady Apache Flink. Of course, Apache Kafka also launched its own streaming framework: Kafka Streams
Recommendation: learning a Flink, Spark Streaming or Kafka Streams in.
Distributed Storage Framework:
While MapReduce is somewhat outdated, but another cornerstone of Hadoop HDFS remain strong, and the open source community is the most popular distributed storage, absolutely you take the time to learn.
Resource Scheduling Framework:
Docker but a full fire last year or two. Companies are in force Docker container-based solutions, the most famous open source container is K8S scheduling framework, but there are equally famous for Hadoop YARN and Apache Mesos. The latter two not only can schedule the container cluster, you can also schedule non-container clusters, very worthy of our study.
Distributed coordination framework:
There are some common functionality required to implement in all major distributed big data framework, such as service discovery, leader election, distributed lock, KV storage. These features also spawned the development of distributed coordination framework. The oldest and most famous was undoubtedly the Apache Zookeeper, the new number, including Consul, etcd and so on. Big Data learning projects, distributed coordination framework can not understand, to some extent also in-depth understanding.
Column-store database:
Once took a long time to learn Oracle, but have to admit that the current relational database has been slowly fade out of people's vision, there are too many programs can replace the rdbms. People line for large data storage is not available for ad-hoc queries such malpractice developed columnar storage, a typical column-store database is HBASE the open source community.
message queue:
Big Data processing project in the message queue as "load shifting" the main system is essential, solutions in this area there are a lot of current, including ActiveMQ, Kafka and so on. Ali also open the domestic RocketMQ. This was undoubtedly one of the leaders of the Apache Kafka. Kafka's many design ideas are especially designed to fit the concept of streaming data distribution process. It's no wonder, Kafka's original author Jay Kreps But today's real-time streaming of the top aspects of the great God.
Recommendation: learning Kafka, not only Haozhaogongzuo (nearly all big data will require the recruitment resume Kafka :-)), but also to understand the analogy further log-based data backup way of handling paradigm
Fourth, the big data scientists Data Analyst Or
For the role of two: big data analytics, corresponding jobs called big data analysts or data scientists, as a data scientist we must master the following skills:
Mathematical skills:
Calculus is strictly to master. Do not have to master the multivariate calculus, calculus is but one yuan must be familiar with and use. Further Linear Algebra must be proficient, particularly the concept of arithmetic, vector space, rank matrices. Machine learning framework in many calculations need to use matrix multiplication, transpose or inverse. Although many frameworks directly provide such tools, but at least we want to understand the principles of the internal prototype, such as how to efficiently determine whether there is a matrix inverse matrix and how to calculate and so on.
Review Tongji version of "higher mathematics", can afford to go Coursea learn calculus course at the University of Pennsylvania
Recommended learning Strang's Linear Algebra: "Introduction to Linear Algebra" - This is the most classic textbook, not one!
Mathematical Statistics:
Probability and statistical methods to do all kinds of basic control, such as how Bayesian probability calculation? Probability distribution is how is it? Though not required proficiency, but the background and make sure you understand the terms
To find a "probability theory" re-learning under
Interactive Data Analysis Framework:
Here does not refer to or SQL database query, but analysis of the interaction framework like Apache Hive or Apache Kylin. The open source community there are many such similar framework, you can use the traditional way of data analysis on large data data analysis or data mining.
We have had experience with is the Hive and Kylin. But especially Hive1 Hive is based on MapReduce, the performance is not particularly good, but Kylin using data cube concept combines star model, speed of analysis can be done very low latency, besides Kylin is the first R & D team is the main force of Chinese people the Apache incubator project, and therefore increasingly widespread attention.
First learning Hive, there is time to find out Kylin, and data mining ideas behind.
Machine learning framework:
The current machine learning universe is really hot, and everyone put machine learning and AI, but I always thought that machine learning is like a cloud, like a few years ago, although currently popular, but no actual landing project, may take years to mature.
But start now to reserve the knowledge of machine learning is always no harm. When it comes to machine learning framework, familiar to everyone, there are many, will come in handy include TensorFlow, Caffe8, Keras9, CNTK10, Torch711, etc., among which TensorFlow lead.
The current recommended that you select a frame in which to learn, but to my understanding of these frameworks, these frameworks most easily package a variety of machine learning algorithms available to users, but to understand the underlying algorithms may in fact not much at learning. Therefore, it is recommended that the principle of learning algorithms can learn from machines, such as:
The most NB's machine learning introductory courses: Dr. Andrew Ng Machine Learning
Five large data necessary skills in detail
Because Java application itself is biased direction, so sort of big data skills necessary detail, is biased in the direction of big data engineer. A total of five parts, namely:
Offline computing Hadoop
Stream computing Storm
Memory computing Spark
Machine learning algorithms
linux operating basis
linux common commands - File operations
linux common commands - User management and permissions
linux common commands - System Management
linux common commands - free secret landing configuration and network management
Commonly used software installed on linux
Local source linux yum yum configuration and software installation
linux firewall configuration
Advanced Text processing commands linux cut, sed, awk
linux regular tasks crontab
shell programming
shell programming - basic grammar
shell programming - Process Control
shell programming - Function
shell programming - Integrated Case - automated deployment scripts
Memory Database redis
redis and nosql Profile
redis client connections
redis string type data structure of the operation and application - Object Cache
the type of list data structure redis operation and Applications - task scheduling queue
redis set of hash and operating data structures and application cases - Shopping Cart
redis of sortedset data structures and operations Applications - Leaderboard
Distributed coordination services zookeeper
zookeeper introduction and application scenarios
zookeeper cluster installation deployment
zookeeper data nodes and the command line
zookeeper java client's basic operations and event listeners
zookeeper core mechanism and data nodes
zookeeper Applications - Distributed Shared Resource lock
zookeeper Applications - off the assembly line on the server to dynamically perception
zookeeper data consistency principle and leader election mechanism
java advanced features to enhance
Java multi-threading basics
Detailed Java synchronized keyword
java application and contract in the thread pool and open-source software
Java application and contract news team and open source software
Java JMS technology
Java dynamic proxy reflection
Lightweight RPC framework development
RPC principles of learning
Nio principle of learning
Netty common API to learn
Lightweight RPC framework needs analysis and Principle Analysis
Lightweight RPC framework development
Offline computing Hadoop
hadoop Quick Start
hadoop Background
Distributed System Overview
Offline data analysis process Introduction
Cluster Setup
Cluster using the preliminary
HDFS enhanced
Concepts and features of HDFS
Of HDFS shell (command line client) Operation
HDFS working mechanism
NAMENODE working mechanism
Operation of java api
Case 1: Development of shell scripts collection
Detailed MAPREDUCE
Customize RPC framework of hadoop
Mapreduce writing programming specifications and examples
Mapreduce running debug mode and method
The internal mechanism of the operating mode of the program mapreduce
Workflow operation body frame mapreduce
Custom serialization method defined object
MapReduce programming Cases
MAPREDUCE enhanced
Sort Mapreduce
Custom partitioner
Mapreduce的combiner
Detailed working mechanism mapreduce
MAPREDUCE combat
maptask parallelism mechanisms - file sections
maptask parallelism provided
Inverted index
Common friends
introduction and use federation hive
Hadoop mechanism of HA
HA cluster installation and deployment
Cluster operation and maintenance of the test on the offline dynamic Datanode
Namenode state test cluster operation and maintenance of the switch management
balance block cluster of test operation and maintenance
HA changes the HDFS-API
About hive
hive architecture
hive installation and deployment
hvie early use
hive introduce enhancements and flume
HQL-DDL basic grammar
HQL-DML basic grammar
HIVE的join
HIVE parameter configuration
HIVE custom functions and Transform
Examples execute HQL analysis of HIVE
HIVE best practices Precautions
HIVE optimization strategy
HIVE real case
Flume Introduction
Flume installation deployment
Case: collection catalog to HDFS
Case: capture files to HDFS
Stream computing Storm
Storm from entry to the master
What is Storm
Storm architecture analysis
Storm architecture analysis
Storm programming model, Tuple source, concurrency analysis
Storm WordCount case analysis and common Api
Storm cluster deployment of combat
Storm + Kafka + Redis business index calculation
Storm compile source code download
Strom cluster startup and source code analysis
Storm job submission and source code analysis
Storm sends data flow analysis
Storm communication mechanism analysis
Storm news source code analysis and fault tolerance
Storm multi-stream Project Analysis
Write your own streaming task execution framework
Storm upstream and downstream integration and architecture
What is the message queue
Kakfa core components
Kafka cluster deployment and combat common commands
Kafka profile carding
Kakfa JavaApi learning
Kafka文件存储机制分析
Redis基础及单机环境部署
Redis数据结构及典型案例
Flume快速入门
Flume+Kafka+Storm+Redis整合
内存计算Spark
scala编程
scala编程介绍
scala相关软件安装
scala基础语法
scala方法和函数
scala函数式编程特点
scala数组和集合
scala编程练习(单机版WordCount)
scala面向对象
scala模式匹配
actor编程介绍
option和偏函数
实战:actor的并发WordCount
柯里化
隐式转换
AKKA与RPC
Akka并发编程框架
实战:RPC编程实战
Spark快速入门
spark介绍
spark环境搭建
RDD简介
RDD的转换和动作
实战:RDD综合练习
RDD高级算子
自定义Partitioner
实战:网站访问次数
广播变量
实战:根据IP计算归属地
自定义排序
利用JDBC RDD实现数据导入导出
WorldCount执行流程详解
RDD详解
RDD依赖关系
RDD缓存机制
RDD的Checkpoint检查点机制
Spark任务执行过程分析
RDD的Stage划分
Spark-Sql应用
Spark-SQL
Spark结合Hive
DataFrame
实战:Spark-SQL和DataFrame案例
SparkStreaming应用实战
Spark-Streaming简介
Spark-Streaming编程
实战:StageFulWordCount
Flume结合Spark Streaming
Kafka结合Spark Streaming
窗口函数
ELK技术栈介绍
ElasticSearch安装和使用
Storm架构分析
Storm编程模型、Tuple源码、并发度分析
Storm WordCount案例及常用Api分析
Spark核心源码解析
Spark源码编译
Spark远程debug
Spark任务提交行流程源码分析
Spark通信流程源码分析
SparkContext创建过程源码分析
DriverActor和ClientActor通信过程源码分析
Worker启动Executor过程源码分析
Executor向DriverActor注册过程源码分析
Executor向Driver注册过程源码分析
DAGScheduler和TaskScheduler源码分析
Shuffle过程源码分析
Task执行过程源码分析
机器学习算法
python及numpy库
机器学习简介
机器学习与python
python语言–快速入门
python语言–数据类型详解
python语言–流程控制语句
python语言–函数使用
python语言–模块和包
phthon语言–面向对象
python机器学习算法库–numpy
机器学习必备数学知识–概率论
常用算法实现
knn分类算法–算法原理
knn分类算法–代码实现
knn分类算法–手写字识别案例
lineage回归分类算法–算法原理
lineage回归分类算法–算法实现及demo
朴素贝叶斯分类算法–算法原理
朴素贝叶斯分类算法–算法实现
朴素贝叶斯分类算法–垃圾邮件识别应用案例
kmeans聚类算法–算法原理
kmeans聚类算法–算法实现
kmeans聚类算法–地理位置聚类应用
决策树分类算法–算法原理
决策树分类算法–算法实现