[Turn] Big Data engineers need to learn what?

Author: Chen Chen
link: https: //www.zhihu.com/question/25542750/answer/493835356
Source: know almost

In fact, this is the direction you want to tell three big data platform to build / Optimization / operation and maintenance / monitoring, Big Data development / design / architecture, data analysis / mining. Please do not ask me which is easy, which is good prospect, which more money.
First pull about big data 4V features:
large volumes of data, TB-> PB
data type range of structured, unstructured text, log, video, pictures, location and so on;
high commercial value, but this value requires massive above the data, through data analysis and mining machine learning more quickly out;
processing timeliness high, massive data processing requirements no longer confined to them off-line calculation.
Now, in order to deal with these formal characteristics of Big Data, Big Data open source framework, and more and stronger, to list some common:
File Storage: Hadoop HDFS, Tachyon, KFS off-line calculation: Hadoop MapReduce, Spark streaming, real-time calculation: Storm, Spark streaming, S4, HeronK-V, NOSQL database: HBase, Redis, MongoDB resource management: YARN, Mesos log collection: Flume, Scribe, Logstash, Kibana message system: Kafka, StormMQ, ZeroMQ , RabbitMQ analysis: Hive, Impala, Pig, Presto , Phoenix, SparkSQL, Drill, Flink, Kylin, Druid distributed coordination services: Zookeeper cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera
Manager data mining, machine learning: Mahout , Spark MLLib data synchronization: Sqoop task scheduling: Oozie ......
dazzled it, there are 30 kinds of above it, let alone mastered, will all use, I guess not many.
For me personally, the main experience is in a second direction (development / design / architecture), and listen to my advice it.
Chapter One: acquaintance Hadoop
1.1 learn Baidu and Google
regardless of any problems, first try and solve their own search.
Google preferred, climb over, just use Baidu bar.
1.2 References official document of choice
especially for entry, official documents will always be the first choice for the document.
I believe this is mostly engaged intellectuals, English improvise on the line, I could not stand, please refer to the first step.
Let it run Hadoop 1.3
Hadoop can be regarded as a large data storage and computing pedigree, now mostly open-source Hadoop big data framework or rely very good compatibility with it.
About Hadoop, you at least need to figure out what is the following:
Hadoop 1.0, 2.0 Hadoop
MapReduce, HDFS
the NameNode, DataNode
the JobTracker, TaskTracker
the Yarn, the ResourceManager, the NodeManager
build their own Hadoop, please use the first and second stages, allowing it to run up on the line.
It is recommended to use the command line to install the installation package, do not use the management tools installed.
Also: Hadoop1.0 know it's the line, now with 2.0 Hadoop.
1.4 Try using Hadoop
HDFS directory operation command; upload, download files command; submit MapReduce run the sample program;
open Hadoop WEB interface to view the Job running, view the Job run log.
Know where Hadoop system logs.
1.5 You should understand their principles of
MapReduce: how to divide and conquer; HDFS: Where is the data, what is a copy; Yarn in the end is what it can do; NameNode in the end are doing; ResourceManager in the end are doing;
1.6 to write their own a MapReduce program
please follow WordCount example, to write a (copy is also OK) WordCount program, and submit the package to run Hadoop.
You will not Java? Shell, Python can be, there is something called Hadoop Streaming.
If you're serious completed the above steps, congratulations, you are one foot already entered.
Chapter II: More efficient WordCount
2.1 SQL learn it
, you know databases? SQL you write it? If not, please learn SQL it.
2.2 SQL version WordCount
In 1.6, you write (or copy) of WordCount There are a few lines of code?
I give you a look of:
the SELECT Word, COUNT (1) the FROM wordcount the GROUP BY Word;
this is the charm of SQL programming required dozens of lines, or even hundreds of lines of code, I have to get this one; use SQL processing analysis Hadoop data on, convenient, efficient, approachable, and more is the trend. Whether calculated or off-line real-time computing, more and more large data processing framework are actively provides a SQL interface.
2.3 SQL On Hadoop's Hive
What is the Hive? The official explanation given is:
The Apache Hive data warehouse software facilitates reading , writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Why Hive is a data warehouse tool, rather than a database tool? Some people may not know the data warehouse, data warehouse is a logical concept, using the underlying database, the data warehouse has two characteristics: the most comprehensive historical data (mass), relatively stable; the so-called relative stability , referring to the business system database is different from the data warehouse, data will be updated frequently, the data once into the data warehouse rarely updated and deleted, will be a large number of queries. The Hive, also have these two characteristics, therefore, Hive suitable for mass data of the data warehousing tools, rather than a database tool.
2.4 Installation Configuration Hive
refer to 1.1 and 1.2 to complete the installation configuration Hive. Hive can normally enter the command line.
Try using Hive 2.5
Refer to 1.1 and 1.2, create wordcount table in the Hive and run the 2.2 SQL statement. SQL task to find just running Hadoop WEB interface.
SQL query results and see if the results are consistent in MapReduce 1.4.
2.6 Hive how the work
is clearly written SQL, see Why Hadoop WEB interface is MapReduce task?
Hive 2.7 learn basic commands
to create, delete table; load data into tables; data download Hive table;
See Section 1.2 to learn more about the Hive and command syntax.
If you've followed "Big Data written to the development of beginner words" in the flow of the first chapters of the second full seriously gone again, then you should already have the following skills and knowledge:
the difference between 0 and Hadoop2.0 of;
MapReduce principle (or that classic title, a 10G size of the file, given the size of 1G memory, statistics highest number of 10 the number of occurrences of words and how to use Java programs);
HDFS read and write data flow; to the HDFS PUT data; download from HDFS data;
they will write a simple MapReduce program, run problems, you know view the log where;
write simple SELECT, wHERE, GROUP BY and other SQL statements;
Hive SQL converted into MapReduce general process;
Hive common statement: create table, drop table, load data into a table, partition, table data downloaded to the local;


from the above study, you have learned, HDFS Hadoop is a distributed storage framework provides, it can be used for mass data storage, distributed computing frameworks Hadoop the MapReduce is provided, it can be used on a massive data statistics and analysis of the HDFS, Hive is SQL On Hadoop, Hive provides a SQL interface, developers only need to write SQL statements simple and approachable , responsible for the Hive SQL translated into MapReduce, filed to run. At this point, your "big data platform," is this:



So the question is, how vast amounts of data to HDFS it?
Chapter 3: get the data on Hadoop elsewhere
herein can also be called the data collection, the collection of data for each data source to Hadoop.
3.1 HDFS PUT command
this in front of you should have used before.
put the command in a real environment is relatively common, usually with the shell, python and other scripting languages to use.
Recommended master.
HDFS API 3.2
HDFS provides an API to write data, they used programming language to write data to HDFS, put the command itself also use the API.
Usually their actual environment using fewer programming API to write data to the HDFS, other frameworks are usually packaged method. For example: Hive in the INSERT statement, Spark of saveAsTextfile and so on.
Understand the principles of suggestions, write Demo.
Sqoop 3.3
Sqoop is a major open-source framework for the exchange of data between Hadoop / Hive traditional relational database Oracle / MySQL / SQLServer like.
Like the Hive SQL translated into MapReduce, like, Sqoop to translate the parameters you specify into MapReduce, Hadoop running submitted to complete the exchange of data between Hadoop and other databases.
Download and configure their own Sqoop (recommended to use Sqoop1, Sqoop2 more complex).
Identifying common configuration parameters and methods Sqoop.
Sqoop completed using MySQL to synchronize data from the HDFS; Sqoop used to complete the synchronization data from MySQL Hive table;
PS: If subsequent selection is determined using Sqoop as a data exchange tool, it is recommended to master, otherwise, will be used to understand and Demo can be.
Flume 3.4
Flume is a massive distributed log collection and transmission framework, because "the acquisition and transmission framework," so it is not suitable for data collection and transmission of relational databases.
Flume from the network protocol, message system, real-time file system logs collected and transmitted to the HDFS.
So, if you have a business data from these data sources, and the need for real-time collection, then you should consider using the Flume.
Download and configure Flume.
Constantly monitor a file using Flume additional data, and transfers the data to HDFS;
PS: Flume configure and use more complex, if you do not have enough interest and patience, you can skip Flume.
3.5 Ali open source DataX
reason to introduce this, because Hadoop and relational database tools we currently use data exchange is based on previously developed DataX, very easy to use.
You can refer to my blog "heterogeneous data sources mass data exchange tool -Taobao DataX download and use."
DataX now has a 3.0 version, it supports many data sources.
You can also do the secondary development above it.
PS: Interested parties can study and use it, compare it with Sqoop.
If you earnestly fulfill the above study and practice, at this time, your "big data platform" should look like this:



Chapter 4: data on Hadoop get elsewhere
seen how the collected data to a data source Hadoop, after data on Hadoop, Hive and MapReduce can use to analyze the. Then the next question is how the results from the analysis of the complete Hadoop synchronized to the other systems and applications go?
In fact, the methods herein and the third chapter basically the same.
4.1 HDFS GET command
to a file on HDFS GET locally. Need to master.
4.2 HDFS API
with 3.2.
4.3 Sqoop
with 3.3.
Use Sqoop complete file on HDFS synchronized to MySQL; use Sqoop complete data Hive table are synchronized to MySQL;
4.4 DataX
. With 3.5
if you earnestly fulfill the above study and practice, at this time, your "big data platform" should look like this:


If you've followed "Big Data written to the development of beginner, then 2" process in earnest Chapters III and IV Complete gone again, then you should already have the following skills and knowledge:
know how to put existing data collected to the HDFS, including off-line collection and real-time acquisition;
you know Sqoop (or also DataX) exchange of data between the tool and the other data sources HDFS;
you already know flume can be used in real-time log collection.
From the previous study, for big data platform, you already know a lot of knowledge and skills to build a Hadoop cluster, the collected data to Hadoop, Hive and MapReduce use to analyze the data, the results to synchronize other data sources.
The next question came, Hive used more and more, you will find a lot of unhappy place, especially slow, in most cases, obviously my small amount of data, it must apply for resources, start to perform MapReduce .
Chapter V: get on with it, my SQL
fact, we have found that the use of MapReduce Hive background as an execution engine, it is a bit slow.
Therefore, SQL On Hadoop framework for more and more, according to my understanding, the most commonly used according to popularity were SparkSQL, Impala and Presto.
These three frame memory based on half or full memory, provides a SQL interface to quickly analyze Hadoop queries data on. About the comparison of the three, please refer to 1.1.
We are currently using SparkSQL, as to why there is probably less use SparkSQL, the reasons for it:
the Spark also do other things, and not to introduce too many frameworks;
Impala demand for memory too not too many resources deployed;
5.1 on Spark and SparkSQL
What is Spark, what is SparkSQL. Spark some core concepts and Glossary. SparkSQL and Spark What is the relationship, SparkSQL and Hive What is the relationship. SparkSQL Why run faster than Hive.
5.2 how to deploy and run SparkSQL
the Spark which deployment model? How to run SparkSQL on Yarn? Use SparkSQL Hive query tables.
PS: Spark is not a short period of time will be able to master the technology, it is suggested in the understanding of the Spark, you can start SparkSQL start, step by step.
About Spark and SparkSQL, refer to
if you earnestly fulfill the above study and practice, at this time, your "big data platform" would look like this:

Chapter VI : polygamy
Do not be tempted by that name. In fact, I want to say is a collection of data, multiple consumption.
In the actual business scenarios, especially for some monitoring logs, want immediate understanding of some of the indicators (on real-time calculation, will be introduced later chapters) from the log, this time, from the analysis of HDFS too slow, although by Flume acquisition, but Flume can not be started for a very short interval scroll through files on HDFS, this will lead to a particularly large number of small files.
In order to meet a collection of data, multiple consumer needs, to say here is Kafka.
About Kafka 6.1
What is Kafka?
The core concept of Kafka and Glossary.
6.2 How to deploy and use Kafka
use standalone deployment Kafka, and producers and consumers bring their own examples of successful operation.
You have written using the Java program and run the program producers and consumers.
Flume and Kafka integration, using the Flume monitoring logs, and real-time log data is sent to Kafka.
If you earnestly fulfill the above study and practice, at this time, your "big data platform" should be like this:
At this time, using data collected Flume, not directly onto the HDFS, but first to Kafka, Kafka in data can be consumed by multiple consumers simultaneously, in which a consumer is to synchronize data to HDFS.
If you've followed "Big Data written to the development of beginner, then 3" process in chapters V and VI of serious intact gone again, then you should already have the following skills and knowledge:
Why Spark faster than MapReduce.
Instead of using SparkSQL Hive, faster running SQL.
Use Kafka once complete data collection, many consumer architecture.
You could write a program to complete Kafka producers and consumers.
From the previous study, you have mastered the big data platform data acquisition, most data storage and computing, data exchange and other skills, which every step, we need a task (program) to complete, between the various tasks and there is a certain dependency, for example, must wait for the data acquisition task is completed successfully, the data can begin to run computing tasks. If a task fails, we need to send an alert to the development of operation and maintenance personnel, and the need to provide complete logs to facilitate troubleshooting.
Chapter 7: More and more analysis task
is not just the task analysis, data acquisition, data exchange is also one of the tasks. These tasks, some timed trigger, you need to rely on other tasks a little to trigger. When the platform there are hundreds of thousands of tasks required to maintain and run time, crontab alone is not enough, and then they need a control and supervision system to get this done. Control and supervision system is the backbone of the entire data platform system, similar to AppMaster, responsible for allocating and monitoring tasks.
7.1 Apache Oozie
what 1. Oozie that? What are the features? 2. Oozie What types of tasks can be scheduled (program)? 3. Oozie can support task which trigger? 4. Installation Configuration Oozie.
7.2 Other open-source task scheduling system
Azkaban:
Light-Task-Scheduler:
Zeus:
etc ......
In addition, here is my task scheduling and monitoring system previously developed separately details, please refer to "big data platform for task scheduling and monitoring system." .
If you earnestly fulfill the above study and practice, at this time, your "big data platform" would look like this:
Chapter 8: My real-time data to be
mentioned that need real-time metrics in the sixth chapter of Kafka when business scenario, real time could be divided into absolute substantially real-time and near real-time, real-time absolute delay requirements usually in milliseconds, quasi-real time in the second delay requirements in general, the order of minutes. The need for absolute real-time business scene, with more of a Storm, other quasi-real-time business scenarios can be Storm, it can also be a Spark Streaming. Of course, if you can, you can also write programs to do their own.
Storm 8.1
1. What is Storm? What are the possible scenarios there? 2. Storm which consists of core components, each What is the role? 3. Storm simple installation and deployment. 4. Write your own Demo program, use Storm to complete real-time data flow calculation.
Spark Streaming 8.2
1. What is Spark Streaming, Spark and it is what is the relationship? 2. Spark Streaming and Storm comparison, what advantages and disadvantages? 3. Use Kafka + Spark Streaming, complete Demo program calculated in real time.
If you earnestly fulfill the above study and practice, at this time, your "big data platform" should be like this:
At this point, your big data platform has been forming the underlying architecture, including data collection, data storage and computing ( off-line and real-time), data synchronization, scheduling and monitoring these large modules. Next is the time to consider how best to provide external data up.
Chapter 9: Foreign my data to be
generally open for (business) to provide data access, generally includes the following aspects:
Offline: for example, the day before the day will provide data to the specified data source (DB, FILE, FTP) and so on; Offline may be employed to provide data Sqoop, DataX other offline data exchange tool.
Real time: for example, the online site of the recommendation system, we need to get real-time data from the platform to the recommended data users, which requires very low latency (less than 50 milliseconds).
According to the query latency requirements and need real-time data, there are possible scenarios: HBase, Redis, MongoDB, ElasticSearch and so on.
OLAP analysis: OLAP data model underlying addition to requiring more standardized, In addition, the query response time requirements are also increasing, the program may have: Impala, Presto, SparkSQL, Kylin . If you compare the size of the data model, then Kylin is the best choice.
Ad hoc query: ad hoc queries data more casual, usually difficult to establish a common data model, so there are possible scenarios: Impala, Presto, SparkSQL.
So much more mature frameworks and programs need to combine their business needs and technical data platform architecture, choose the right. Only one principle: the more simple the more stable is the best.
If you already know how good the external (business) to provide data, then your "big data platform" would look like this:
Chapter X: Machine Learning on fast hardware tall
About this, my layman can only be a brief. I am very ashamed mathematics graduate, regret did not properly learn math.
The machine can be used in our business, learning to solve the problem encountered so probably three categories:
Classification: including binary and multi-classification, binary classification is to solve the problem of prediction, just predict whether an email spam; the solution is multi-classification text classification;
clustering problem: from keyword a user searched for, the user probably classified.
Recommended questions: related recommendations based on browsing history and click behavior of the user.
Most industries, using machine learning to solve, that is, these types of problems.
Getting started learning line:
the foundations of mathematics;
machine learning combat (Machine Learning in Action), Python knows best;
SparkMlLib provide some packaged algorithms, as well as handling characteristics, methods of feature selection.
Machine learning really fast hardware tall, is my learning goals.

So, you can put the machine learning section is also added to your "big data platform," the.

Guess you like

Origin www.cnblogs.com/CQ-LQJ/p/11628646.html