0 Big Data Development Learning basic outline

First, Inscription

Say the current IT industry what the hottest? ABC second to none. The so-called ABC's, AI + Big Data + Cloud also that artificial intelligence, big data and cloud computing (cloud platform). Each of these areas there is now a leader in the industry leading forward, today we are on to the next big data Big Data in this direction.

 

Second, the data inside a large role

 

A role: Big Data Engineering

Big data projects need to address the definition of the data collection, calculation and preservation work, so large data engineers in the design and deployment of such a system is the primary consideration of data availability problems, namely large engineering systems require real-time data for downstream business systems or analysis systems provide data services;

Role Two: Big Data Analytics

Role of big data analytics on how to use the data - that is how to provide data analysis is output as a business or organization after receiving data from a large engineering system data, and can really help to improve company business or improve service levels, so for large data analysts, their primary problem is to find and exploit the value of data, which may include specific: trend analysis, modeling and forecasting analysis.

These two types of roles are interdependent but operate independently, meaning what? No major engineering data, big data analysis would be impossible; but if there are no major data analysis, I can not think of the raison d'etre of big data projects. This is similar to marriage and love - love aim is to get married, not to marry for the purpose of love is bullying.

Briefly summarize large data projects need to consider the role of data collection, calculation (or processing) and save; big data analysis role is to perform advanced computing data.

[Big Data to develop learning materials collection method: adding big data exchange technology to learn q group 522,189,307, private letters administrator can receive a free

Third, the Big Data Engineer

For a role: big data projects that correspond to jobs is called Big Data Engineer for large data engineer, you at least want to have the following skills:

linux foundation

Because big data systems, which are mostly open source software, open source software is running on the open source linux system, so you'll have to linux basic operations, such as user management, privilege, shell programming like

A JVM-based languages:

A large proportion of the current ecological data JVM-based languages ​​like great, some degree of monopoly is not excessive. Here I recommend you learn Java or Scala, Clojure As such language is not easy to get started, in fact, does not recommend that you use. In addition, now is "your mother to child," the era of big data framework will take some fire its popular programming languages, such as on Docker's Go, Kafka is to the Scala.

So here we recommend that you at least want to be proficient in a language JVM system. It is worth mentioning, be sure to understand multi-threading model and memory model of the language, many large data frame processing mode In fact, language level and multi-threaded processing model is similar, only the big data framework they extended to multi-machine distributed this level.

Recommendation: to learn Java or Scala

Calculation processing frame:

Strictly speaking, it is divided into off-line batch processing and streaming. Streaming is the future trend, suggest that you must go to learn; and off-line batch is already out of date quickly, it can not handle the idea of ​​infinite batch process data sets, thus shrinking its scope. In fact, Google has officially abandoned house in offline processing with MapReduce represented.

So if you want to learn big data projects, to master a real-time streaming framework it is necessary. Current mainstream framework include: Apache Samza, Apache Storm, Apache Spark Streaming as well as the most recent year heady Apache Flink. Of course, Apache Kafka also launched its own streaming framework: Kafka Streams

Recommendation: learning a Flink, Spark Streaming or Kafka Streams in.

Distributed Storage Framework:

While MapReduce is somewhat outdated, but another cornerstone of Hadoop HDFS remain strong, and the open source community is the most popular distributed storage, absolutely you take the time to learn.

Resource Scheduling Framework:

Docker but a full fire last year or two. Companies are in force Docker container-based solutions, the most famous open source container is K8S scheduling framework, but there are equally famous for Hadoop YARN and Apache Mesos. The latter two not only can schedule the container cluster, you can also schedule non-container clusters, very worthy of our study.

Distributed coordination framework:

There are some common functionality required to implement in all major distributed big data framework, such as service discovery, leader election, distributed lock, KV storage. These features also spawned the development of distributed coordination framework. The oldest and most famous was undoubtedly the Apache Zookeeper, the new number, including Consul, etcd and so on. Big Data learning projects, distributed coordination framework can not understand, to some extent also in-depth understanding.

Column-store database:

Once took a long time to learn Oracle, but have to admit that the current relational database has been slowly fade out of people's vision, there are too many programs can replace the rdbms. People line for large data storage is not available for ad-hoc queries such malpractice developed columnar storage, a typical column-store database is HBASE the open source community.

message queue:

Big Data processing project in the message queue as "load shifting" the main system is essential, solutions in this area there are a lot of current, including ActiveMQ, Kafka and so on. Ali also open the domestic RocketMQ. This was undoubtedly one of the leaders of the Apache Kafka. Kafka's many design ideas are especially designed to fit the concept of streaming data distribution process. It's no wonder, Kafka's original author Jay Kreps But today's real-time streaming of the top aspects of the great God.

Recommendation: learning Kafka, not only Haozhaogongzuo (nearly all big data will require the recruitment resume Kafka :-)), but also to understand the analogy further log-based data backup way of handling paradigm

 

Fourth, the big data scientists Data Analyst Or

For the role of two: big data analytics, corresponding jobs called big data analysts or data scientists, as a data scientist we must master the following skills:

Mathematical skills:

Calculus is strictly to master. Do not have to master the multivariate calculus, calculus is but one yuan must be familiar with and use. Further Linear Algebra must be proficient, particularly the concept of arithmetic, vector space, rank matrices. Machine learning framework in many calculations need to use matrix multiplication, transpose or inverse. Although many frameworks directly provide such tools, but at least we want to understand the principles of the internal prototype, such as how to efficiently determine whether there is a matrix inverse matrix and how to calculate and so on.

Review Tongji version of "higher mathematics", can afford to go Coursea learn calculus course at the University of Pennsylvania

Recommended learning Strang's Linear Algebra: "Introduction to Linear Algebra" - This is the most classic textbook, not one!

Mathematical Statistics:

Probability and statistical methods to do all kinds of basic control, such as how Bayesian probability calculation? Probability distribution is how is it? Though not required proficiency, but the background and make sure you understand the terms

To find a "probability theory" re-learning under

Interactive Data Analysis Framework:

Here does not refer to or SQL database query, but analysis of the interaction framework like Apache Hive or Apache Kylin. The open source community there are many such similar framework, you can use the traditional way of data analysis on large data data analysis or data mining.

We have had experience with is the Hive and Kylin. But especially Hive1 Hive is based on MapReduce, the performance is not particularly good, but Kylin using data cube concept combines star model, speed of analysis can be done very low latency, besides Kylin is the first R & D team is the main force of Chinese people the Apache incubator project, and therefore increasingly widespread attention.

First learning Hive, there is time to find out Kylin, and data mining ideas behind.

Machine learning framework:

The current machine learning universe is really hot, and everyone put machine learning and AI, but I always thought that machine learning is like a cloud, like a few years ago, although currently popular, but no actual landing project, may take years to mature.

But start now to reserve the knowledge of machine learning is always no harm. When it comes to machine learning framework, familiar to everyone, there are many, will come in handy include TensorFlow, Caffe8, Keras9, CNTK10, Torch711, etc., among which TensorFlow lead.

The current recommended that you select a frame in which to learn, but to my understanding of these frameworks, these frameworks most easily package a variety of machine learning algorithms available to users, but to understand the underlying algorithms may in fact not much at learning. Therefore, it is recommended that the principle of learning algorithms can learn from machines, such as:

The most NB's machine learning introductory courses: Dr. Andrew Ng Machine Learning

 

Five large data necessary skills in detail

Because Java application itself is biased direction, so sort of big data skills necessary detail, is biased in the direction of big data engineer. A total of five parts, namely:

Offline computing Hadoop

Stream computing Storm

Memory computing Spark

Machine learning algorithms

 

linux operating basis

linux common commands - File operations

linux common commands - User management and permissions

linux common commands - System Management

linux common commands - free secret landing configuration and network management

Commonly used software installed on linux

Local source linux yum yum configuration and software installation

linux firewall configuration

Advanced Text processing commands linux cut, sed, awk

linux regular tasks crontab

 

shell programming

shell programming - basic grammar

shell programming - Process Control

shell programming - Function

shell programming - Integrated Case - automated deployment scripts

 

Memory Database redis

redis and nosql Profile

redis client connections

redis string type data structure of the operation and application - Object Cache

the type of list data structure redis operation and Applications - task scheduling queue

redis set of hash and operating data structures and application cases - Shopping Cart

redis of sortedset data structures and operations Applications - Leaderboard

 

Distributed coordination services zookeeper

zookeeper introduction and application scenarios

zookeeper cluster installation deployment

zookeeper data nodes and the command line

zookeeper java client's basic operations and event listeners

zookeeper core mechanism and data nodes

zookeeper Applications - Distributed Shared Resource lock

zookeeper Applications - off the assembly line on the server to dynamically perception

zookeeper data consistency principle and leader election mechanism

 

java advanced features to enhance

Java multi-threading basics

Detailed Java synchronized keyword

java application and contract in the thread pool and open-source software

Java application and contract news team and open source software

Java JMS technology

Java dynamic proxy reflection

Lightweight RPC framework development

 

RPC principles of learning

Nio principle of learning

Netty common API to learn

Lightweight RPC framework needs analysis and Principle Analysis

Lightweight RPC framework development

 

Offline computing Hadoop

hadoop Quick Start

hadoop Background

Distributed System Overview

Offline data analysis process Introduction

 

Cluster Setup

Cluster using the preliminary

HDFS enhanced

Concepts and features of HDFS

Of HDFS shell (command line client) Operation

HDFS working mechanism

NAMENODE working mechanism

Operation of java api

Case 1: Development of shell scripts collection

 

Detailed MAPREDUCE

Customize RPC framework of hadoop

Mapreduce writing programming specifications and examples

Mapreduce running debug mode and method

The internal mechanism of the operating mode of the program mapreduce

Workflow operation body frame mapreduce

Custom serialization method defined object

MapReduce programming Cases

 

MAPREDUCE enhanced

Sort Mapreduce

Custom partitioner

Mapreduce的combiner

Detailed working mechanism mapreduce

 

MAPREDUCE combat

maptask parallelism mechanisms - file sections

maptask parallelism provided

Inverted index

Common friends

 

introduction and use federation hive

Hadoop mechanism of HA

HA cluster installation and deployment

Cluster operation and maintenance of the test on the offline dynamic Datanode

Namenode state test cluster operation and maintenance of the switch management

balance block cluster of test operation and maintenance

HA changes the HDFS-API

About hive

hive architecture

hive installation and deployment

hvie early use

 

hive introduce enhancements and flume

HQL-DDL basic grammar

HQL-DML basic grammar

HIVE的join

HIVE parameter configuration

HIVE custom functions and Transform

Examples execute HQL analysis of HIVE

HIVE best practices Precautions

HIVE optimization strategy

HIVE real case

Flume Introduction

Flume installation deployment

Case: collection catalog to HDFS

Case: capture files to HDFS

 

Stream computing Storm

Storm from entry to the master

What is Storm

Storm architecture analysis

Storm architecture analysis

Storm programming model, Tuple source, concurrency analysis

Storm WordCount case analysis and common Api

Storm cluster deployment of combat

Storm + Kafka + Redis business index calculation

Storm compile source code download

Strom cluster startup and source code analysis

Storm job submission and source code analysis

Storm sends data flow analysis

Storm communication mechanism analysis

Storm news source code analysis and fault tolerance

Storm multi-stream Project Analysis

Write your own streaming task execution framework

 

Storm upstream and downstream integration and architecture

What is the message queue

Kakfa core components

Kafka cluster deployment and combat common commands

Kafka profile carding

Kakfa JavaApi learning

Kafka文件存储机制分析

Redis基础及单机环境部署

Redis数据结构及典型案例

Flume快速入门

Flume+Kafka+Storm+Redis整合

 

内存计算Spark

 

scala编程

scala编程介绍

scala相关软件安装

scala基础语法

scala方法和函数

scala函数式编程特点

scala数组和集合

scala编程练习(单机版WordCount)

scala面向对象

scala模式匹配

actor编程介绍

option和偏函数

实战:actor的并发WordCount

柯里化

隐式转换

 

AKKA与RPC

Akka并发编程框架

实战:RPC编程实战

Spark快速入门

spark介绍

spark环境搭建

RDD简介

RDD的转换和动作

实战:RDD综合练习

RDD高级算子

自定义Partitioner

实战:网站访问次数

广播变量

实战:根据IP计算归属地

自定义排序

利用JDBC RDD实现数据导入导出

WorldCount执行流程详解

 

RDD详解

RDD依赖关系

RDD缓存机制

RDD的Checkpoint检查点机制

Spark任务执行过程分析

RDD的Stage划分

Spark-Sql应用

Spark-SQL

Spark结合Hive

DataFrame

实战:Spark-SQL和DataFrame案例

 

SparkStreaming应用实战

Spark-Streaming简介

Spark-Streaming编程

实战:StageFulWordCount

Flume结合Spark Streaming

Kafka结合Spark Streaming

窗口函数

ELK技术栈介绍

ElasticSearch安装和使用

Storm架构分析

Storm编程模型、Tuple源码、并发度分析

Storm WordCount案例及常用Api分析

 

Spark核心源码解析

Spark源码编译

Spark远程debug

Spark任务提交行流程源码分析

Spark通信流程源码分析

SparkContext创建过程源码分析

DriverActor和ClientActor通信过程源码分析

Worker启动Executor过程源码分析

Executor向DriverActor注册过程源码分析

Executor向Driver注册过程源码分析

DAGScheduler和TaskScheduler源码分析

Shuffle过程源码分析

Task执行过程源码分析

 

机器学习算法

 

python及numpy库

机器学习简介

机器学习与python

python语言–快速入门

python语言–数据类型详解

python语言–流程控制语句

python语言–函数使用

python语言–模块和包

phthon语言–面向对象

python机器学习算法库–numpy

机器学习必备数学知识–概率论

 

常用算法实现

knn分类算法–算法原理

knn分类算法–代码实现

knn分类算法–手写字识别案例

lineage回归分类算法–算法原理

lineage回归分类算法–算法实现及demo

朴素贝叶斯分类算法–算法原理

朴素贝叶斯分类算法–算法实现

朴素贝叶斯分类算法–垃圾邮件识别应用案例

kmeans聚类算法–算法原理

kmeans聚类算法–算法实现

kmeans聚类算法–地理位置聚类应用

决策树分类算法–算法原理

决策树分类算法–算法实现

Guess you like

Origin blog.csdn.net/fdfsdrjku/article/details/92433366