Hadoop to today has been a very rich family of products to meet the different scenes of large data processing needs. As the current mainstream large data processing technology, many companies in the market of large Hadoop-based data services are carried out, but also has a lot of scenes with very mature solution.
As a master developer to develop technology within the framework of Hadoop and its ecosystem, the only way is to enter the field of big data.
The following details about learning Hadoop technology development roadmap.
Hadoop itself is in java, so supportive of java is very good, but it can also be used in other languages.
The following data mining technology roadmap focused direction, because of the high development efficiency so we use Python Python to carry out the task.
Because Hadoop is running on a Linux system, you also need to have Linux knowledge.
Data may want to learn a big public concern No. programmer Daniel have video sharing learning resources together
The first stage: Hadoop eco-architecture technology
Language Infrastructure
Java: master javase knowledge, understanding and practice in the Java virtual machine memory management, and multi-threading, thread pool, design patterns, parallelism can, without a deep understanding.
Linux: System Installation (command line interface and a graphical interface), basic commands, network configuration, Vim editor, process manager, Shell scripts, virtual machines, and so familiar with the menu.
Fundamentals basic syntax, data structures, functions, condition judgment, circulation: Python.
Preparing the Environment
Here are the windows to build a fully distributed computer, from the main 2.
VMware virtual machines, Linux system (Centos6.5), Hadoop installation package, good preparation here fully distributed Hadoop cluster environment.
MapReduce
Offline MapReduce distributed computing framework, Hadoop is the core programming model. Mainly for high-volume cluster task, as is the batch execution, so timeliness is low.
HDFS1.0 / 2.0
Hadoop Distributed File System (HDFS) is a highly fault-tolerant systems, suitable for deployment on low-cost machines. HDFS provides high throughput data access, ideal for use on large data sets.
Yarn(Hadoop2.0)
Early to understand, Yarn is a resource management platform, is responsible for allocating resources to the task. Yarn is a public resource scheduling platform to meet all the conditions of the framework can be used to Yarn resource scheduling.
Hive
Hive is a data warehouse, all data is stored in the on HDFS. Use Hive mainly write Hql, very similar to Sql Mysql database. In fact, Hive execution MapRedce program Hql, bottom or in the implementation of the execution.
Spark
Spark is designed for large-scale data processing designed for fast general-purpose computing engine that is memory-based iterative calculations. Spark retains the advantages of MapReduce, but also on the timeliness has been greatly improved.
Spark Streaming
Spark Streaming real-time processing framework, the data is processed in batch to batch.
Spark Hive
Fast retrieval of Sql Spark. Spark as Hive calculation engine, the Hive queries submitted as Spark tasks to be calculated on Spark cluster, you can improve the performance of Hive queries.
Storm
Storm is a real-time computing framework, and is the difference between MR, MR is the mass of data is processed off-line, and Storm is added each time data is processed, the process is a one, we can ensure the timeliness of the data processing .
Zookeeper
Zookeeper is the basis of many large data framework, which is a cluster manager. Monitoring the state of each node in the cluster for the next operation in accordance with a reasonable feedback submitting node.
Finally, the interface easy to use and efficient performance, function and stability of the system to the user
Hbase
Hbase Nosql is a database, a database of Key-Value type, is highly reliable, column-oriented, scalable, distributed database.
Adapted to the HDFS unstructured data storage, the underlying data store.
Kafka
kafka is a message middleware, for a scene in the real-time processing often work as an intermediate buffer layer.
Flume
Flume is a log collection tools, data collection log files common is generated by the application in general there are two processes.
A data storage Flume collected Kafka, it is convenient or Storm SparkStreaming real-time processing.
Another process is to store data collected Flume the HDFS, for later use hadoop spark or processed offline.
The second stage: data mining algorithms
Chinese word
Open source points lexicon of offline and online applications
Natural Language Processing
Text relevancy algorithms
Recommendation algorithm
Based on CB, CF, normalization, Mahout application.
Classification algorithms
NB、SVM
Regression algorithm
LR、Decision Tree
Clustering Algorithm
Hierarchical clustering, Kmeans
Neural networks and deep learning
NN、Tensorflow
Above is a detailed study of the route Hadoop development, in view of space reasons only lists and explains the role of the framework.
Knowledge After completing the first phase, already engaged in big data architecture related work, or some may be responsible for some of the development and maintenance work in the enterprise.
Complete the second phase of learning knowledge, you can engage in data mining-related work, which is currently entering the Big Data industry's highest operating gold.