2019 latest Hadoop Big Data Development Learning Roadmap

Hadoop to today has been a very rich family of products to meet the different scenes of large data processing needs. As the current mainstream large data processing technology, many companies in the market of large Hadoop-based data services are carried out, but also has a lot of scenes with very mature solution.

As a master developer to develop technology within the framework of Hadoop and its ecosystem, the only way is to enter the field of big data.

 

The following details about learning Hadoop technology development roadmap.
Hadoop itself is in java, so supportive of java is very good, but it can also be used in other languages.

The following data mining technology roadmap focused direction, because of the high development efficiency so we use Python Python to carry out the task.

Because Hadoop is running on a Linux system, you also need to have Linux knowledge.

 Data may want to learn a big public concern No. programmer Daniel have video sharing learning resources together

The first stage: Hadoop eco-architecture technology
Language Infrastructure

Java: master javase knowledge, understanding and practice in the Java virtual machine memory management, and multi-threading, thread pool, design patterns, parallelism can, without a deep understanding.

Linux: System Installation (command line interface and a graphical interface), basic commands, network configuration, Vim editor, process manager, Shell scripts, virtual machines, and so familiar with the menu.

Fundamentals basic syntax, data structures, functions, condition judgment, circulation: Python.

 

Preparing the Environment

Here are the windows to build a fully distributed computer, from the main 2.

VMware virtual machines, Linux system (Centos6.5), Hadoop installation package, good preparation here fully distributed Hadoop cluster environment.

 

MapReduce

Offline MapReduce distributed computing framework, Hadoop is the core programming model. Mainly for high-volume cluster task, as is the batch execution, so timeliness is low.

 

HDFS1.0 / 2.0

Hadoop Distributed File System (HDFS) is a highly fault-tolerant systems, suitable for deployment on low-cost machines. HDFS provides high throughput data access, ideal for use on large data sets.

 

Yarn(Hadoop2.0)

Early to understand, Yarn is a resource management platform, is responsible for allocating resources to the task. Yarn is a public resource scheduling platform to meet all the conditions of the framework can be used to Yarn resource scheduling.

 

Hive

Hive is a data warehouse, all data is stored in the on HDFS. Use Hive mainly write Hql, very similar to Sql Mysql database. In fact, Hive execution MapRedce program Hql, bottom or in the implementation of the execution.

 

Spark

Spark is designed for large-scale data processing designed for fast general-purpose computing engine that is memory-based iterative calculations. Spark retains the advantages of MapReduce, but also on the timeliness has been greatly improved.

 

Spark Streaming

Spark Streaming real-time processing framework, the data is processed in batch to batch.

 

Spark Hive

Fast retrieval of Sql Spark. Spark as Hive calculation engine, the Hive queries submitted as Spark tasks to be calculated on Spark cluster, you can improve the performance of Hive queries.

 

Storm

Storm is a real-time computing framework, and is the difference between MR, MR is the mass of data is processed off-line, and Storm is added each time data is processed, the process is a one, we can ensure the timeliness of the data processing .

 

Zookeeper

Zookeeper is the basis of many large data framework, which is a cluster manager. Monitoring the state of each node in the cluster for the next operation in accordance with a reasonable feedback submitting node.

Finally, the interface easy to use and efficient performance, function and stability of the system to the user

 

Hbase

Hbase Nosql is a database, a database of Key-Value type, is highly reliable, column-oriented, scalable, distributed database.

Adapted to the HDFS unstructured data storage, the underlying data store.

 

Kafka

kafka is a message middleware, for a scene in the real-time processing often work as an intermediate buffer layer.

 

Flume

Flume is a log collection tools, data collection log files common is generated by the application in general there are two processes.

A data storage Flume collected Kafka, it is convenient or Storm SparkStreaming real-time processing.

Another process is to store data collected Flume the HDFS, for later use hadoop spark or processed offline.

 

The second stage: data mining algorithms
Chinese word

Open source points lexicon of offline and online applications

Natural Language Processing

Text relevancy algorithms

Recommendation algorithm

Based on CB, CF, normalization, Mahout application.

Classification algorithms

NB、SVM

Regression algorithm

LR、Decision Tree

Clustering Algorithm

Hierarchical clustering, Kmeans

Neural networks and deep learning

NN、Tensorflow

 

Above is a detailed study of the route Hadoop development, in view of space reasons only lists and explains the role of the framework.

Knowledge After completing the first phase, already engaged in big data architecture related work, or some may be responsible for some of the development and maintenance work in the enterprise.

Complete the second phase of learning knowledge, you can engage in data mining-related work, which is currently entering the Big Data industry's highest operating gold.

 

Guess you like

Origin www.cnblogs.com/wuxiaoxia888/p/10955170.html