No programming foundation, Big Data learning step is kind of how?

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/wwdede/article/details/100165760

Many beginners after the initiation of the idea to the development of large data direction, can not help but have some doubts, how should the entry? What technologies should learn? What learning route is?

The idea that all initiation into line with the original intention of the students want to learn Java is the same. Job very fire, payroll employment is relatively high,, prospects are very impressive. Basically this reason yearning big data, but big data are poorly understood.

If you want to learn, then first you need to learn programming, secondly you need to have knowledge of mathematics, statistics, and finally converged applications, you can want to develop in the direction of the data, generally speaking, is one such. But this alone does not help.

Now you need to ask yourself a few questions:

For the computer / software, what are you interested in?

Is a computer professional, interested in the operating system, hardware, network, server?

Is a professional software, software development, programming, writing code that interest?

Or Math, Statistics, particularly interested in data and numbers.

What is your specialty?

If you are a financial professional, you can learn, because it combines your own professional, will make you stand out among the competitors only your expertise, after all, AI + has been involved in the financial industry.

Having said that, nothing more than to tell you that three major development direction of big data:

Platform to build / Optimization / operation and maintenance / monitoring;

Big Data development / design / architecture;

Data analysis / mining.

Please do not ask me which is easy, can only say that money is not simple. If you are interested in big data development, I want to learn the system big data, you can join the big data exchange technology to learn buckle group: Digital 4583+ numbers 45782, private letters administrator can receive a free development tools and entry-learning materials

 

Talk about four typical characteristics of big data:

Big amount of data;

Many types of data, (structured, unstructured text, log, video, pictures, geographic location, etc.);

High commercial value, but on top of huge amounts of data required by the data analysis quickly excavated and machine learning;

High processing timeliness, massive data processing requirements no longer confined to them off-line calculation.

Now, in order to deal with these features Big Data, Big Data open source framework, and more and stronger, to list some common ones:

File Storage: N, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Message system: Kafka, StormMQ, ZeroMQ, RabbitMQ

Analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed Coordination Services: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, Spark MLLib

Data synchronization: Sqoop

Task scheduling: Oozie

Is not dazzled, the above content, not talk proficient, and should not be even a few will all be used. Let the next big data development / design / architecture study to look at the direction of the route.

In the subsequent study, whether any problems, first try and solve their own search. Google first choice, followed by Baidu.

For beginners, the official documents will always be the first choice for the document.

Chapter One: Hadoop

In large data storage and computation can be regarded as Hadoop pedigree, now mostly open-source Hadoop big data framework or rely very good compatibility with it.

About Hadoop, you at least need to figure out what these are:

Hadoop 1.0、Hadoop 2.0

MapReduce、HDFS

NameNode, DataNode

JobTracker、TaskTracker

Yarn、ResourceManager、NodeManager

How they learn to build Hadoop, let it run up. It is recommended to use the command line to install the installation package, do not use the management tools installed. Now with Hadoop 2.0.

HDFS directory operation command; upload, download files command; submit MapReduce run the sample program; open Hadoop WEB interface to view the Job running, view the Job run log. Know where Hadoop system logs.

After completion of the above, you should get to know their principles of:

MapReduce: How to Divide and conquer; HDFS: Where is the data, what is the copy;

Yarn in the end is what it can do; NameNode in the end are doing; Resource Manager in the end in doing;

If a suitable learning website, go to the video lectures, if not or prefer books, you can also assiduously. Of course, the best way is to go to search out these are doing, then probably have a concept, and then go listen to the video.

Looking after yourself is an example:

Write a (copy is also OK) WordCount program,

Packaged and submitted to run Hadoop. You will not Java? Shell, Python can be, there is something called Hadoop Streaming. If you're serious completed the above steps, congratulations, you are one foot already entered.

Chapter II: More efficient WordCount

Here, we must learn SQL, it would be your job a great help.

Like you write (or copy) of WordCount There are a few lines of code? But you use SQL is very simple, such as:

SELECT word,COUNT(1) FROM wordcount GROUP BY word;

This is the charm of SQL programming required dozens of lines, or even hundreds of lines of code, and SQL get a line; use SQL processing and analysis of data on Hadoop, convenient, efficient and approachable, even more is the trend. Whether calculated or off-line real-time computing, more and more large data processing framework are actively provides a SQL interface.

The other is SQL On Hadoop's Hive in terms of big data must be learned.

What is the Hive?

官方解释如下:The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax。

Why Hive is a data warehouse tool, rather than a database tool?

Some people may not know the data warehouse, data warehouse is a logical concept, using the underlying database, the data warehouse has two characteristics: the most comprehensive historical data (mass), relatively stable; the so-called relative stability , referring to the business system database is different from the data warehouse, data will be updated frequently, the data once into the data warehouse rarely updated and deleted, will be a large number of queries. The Hive, also have these two characteristics, therefore, Hive suitable for mass data of the data warehousing tools, rather than a database tool.

After understanding its role, is to install the configuration Hive link, when you can normally enter the Hive command line, installation configuration is successful.

Hive understand how it works

Learn the basic commands Hive:

Create, delete table; load data into tables; download data Hive table;

MapReduce principle (or that classic title, a 10G size of the file, given the size of 1G memory, how to use Java programs up to 10 times the number of words and statistics appear);

HDFS read and write data flow; PUT data to the HDFS; download data from the HDFS;

I will write a simple MapReduce program, run problems, know where to view the log;

Write simple Select, Where, group by and other SQL statements;

Hive SQL procedure substantially converted to the MapReduce;

Hive common statement: create table, drop table, load data into a table, partition, download the data in the table to the local;

From the above study, you have learned, distributed storage framework Hadoop HDFS is provided, which can be used to store huge amounts of data, MapReduce is a Hadoop distributed computing framework provided, it can be used on the huge amounts of data and statistical analysis HDFS while the Hive is SQL On Hadoop, Hive provides a SQL interface, developers only need to write SQL statements simple and approachable, responsible for the Hive SQL translated into MapReduce, filed to run.

At this point, your "big data platform," is this: So the question is, how vast amounts of data to HDFS it?

Chapter 3: Data Collection

The collected data to the respective data sources Hadoop.

3.1 HDFS PUT command

This in front of you should have used before. put the command in a real environment is relatively common, usually with the shell, python and other scripting languages ​​to use. Recommended master.

3.2 HDFS API

HDFS provides an API to write data, they used programming language to write data to HDFS, put the command itself also use the API.

Usually their actual environment using fewer programming API to write data to the HDFS, other frameworks are usually packaged method. For example: Hive in the INSERT statement, Spark of saveAsTextfile and so on. Understand the principles of suggestions, write Demo.

3.3 Sqoop

Sqoop is a major open-source framework for the exchange of data between Hadoop / Hive traditional relational databases, Oracle, MySQL, SQLServer like. Like the Hive SQL translated into MapReduce, like, Sqoop to translate the parameters you specify into MapReduce, Hadoop running submitted to complete the exchange of data between Hadoop and other databases.

Download and configure their own Sqoop (recommended to use Sqoop1, Sqoop2 more complex). Identifying common configuration parameters and methods Sqoop.

Sqoop completed using MySQL to synchronize data from HDFS; use Sqoop complete sync data from MySQL to Hive table; if the subsequent selection is determined using Sqoop as a data exchange tool, it is recommended to master, otherwise, will be used to understand and Demo can be.

3.4 Flume

Flume is a distributed massive log collection and transmission frame, because "acquisition and transmission frame", so it is not suitable for data collection and transmission relational database. Flume from the network protocol, message system, real-time file system logs collected and transmitted to the HDFS.

So, if you have a business data from these data sources, and the need for real-time collection, then you should consider using the Flume.

Download and configure Flume. Constantly monitor a file using Flume additional data, and transfers the data to HDFS; configure and use Flume more complicated, if you do not have enough interest and patience, you can skip Flume.

3.5 Ali open source DataX

The reason why this introduction, because the tool Hadoop and relational database data we currently use the exchange, is based on previously developed DataX, very easy to use.

You can refer to my blog "heterogeneous data sources mass data exchange tool -Taobao DataX download and use." DataX now has a 3.0 version, it supports many data sources. You can also do the secondary development above it. Interested parties can study and use it, compare it with Sqoop.

Chapter 4: data on Hadoop get elsewhere

Hive and MapReduce were analyzed. Then the next question is how the results from the analysis of the complete Hadoop synchronized to the other systems and applications go? In fact, the methods herein and the third chapter basically the same.

HDFS GET command: the file on HDFS GET locally. Need to master.

HDFS API: with 3.2.

. Sqoop: 3.3 Sqoop with complete file to the HDFS synchronizing the MySQL; Sqoop completed using the data table are synchronized to Hive MySQL.

If you've followed the process carefully complete gone again, then you should already have the following skills and knowledge:

We know how to collect existing data on HDFS, including offline collection and real-time acquisition;

Sqoop tool known data exchange between the HDFS and other data sources;

Know flume can be used as a real-time log collection.

From the previous study, for big data platform, you already know a lot of knowledge and skills to build a Hadoop cluster, the collected data to Hadoop, Hive and MapReduce use to analyze the data, the results to synchronize other data sources.

The next question came, Hive used more and more, you will find a lot of unhappy place, especially slow, in most cases, obviously my small amount of data, it must apply for resources, start to perform MapReduce .

Chapter V: SQL

In fact, we have found that the use of MapReduce Hive background as an execution engine, it is a bit slow. Therefore, SQL On Hadoop framework for more and more, according to my understanding, the most commonly used according to popularity were SparkSQL, Impala and Presto. These three frame memory based on half or full memory, provides a SQL interface to quickly analyze Hadoop queries data on.

We are currently using SparkSQL, as to why there is probably less use SparkSQL, the reasons for it: the Spark also do other things, and not to introduce too many frameworks; Impala memory requirement is too big, not too much resource deployment.

About Spark 5.1 and SparkSQL

What is Spark, what is SparkSQL.

Spark some core concepts and Glossary.

SparkSQL and Spark What is the relationship, SparkSQL and Hive What is the relationship.

SparkSQL Why run faster than Hive.

5.2 How to deploy and run SparkSQL

Which deployment model Spark have?

How to run SparkSQL on Yarn?

Use SparkSQL Hive query tables. Spark is not a short period of time will be able to master the technology, it is suggested in the understanding of the Spark, you can start SparkSQL start, step by step.

About Spark and SparkSQL, if you earnestly fulfill the above study and practice, at this time, your "big data platform" should be like this.

Chapter 6: multiple use of data

Do not be tempted by that name. In fact, I want to say is a collection of data, multiple consumption.

In the actual business scenarios, especially for some monitoring logs, want immediate understanding of some of the indicators (on real-time calculation, will be introduced later chapters) from the log, this time, from the analysis of HDFS too slow, although by Flume acquisition, but Flume can not be started for a very short interval scroll through files on HDFS, this will lead to a particularly large number of small files.

In order to meet a collection of data, multiple consumer needs, to say here is Kafka.

About Kafka: What is Kafka? The core concept of Kafka and Glossary.

How to deploy and use Kafka: Use standalone deployment Kafka, and examples of successful operation of producers and consumers comes. You have written using the Java program and run the program producers and consumers. Flume and Kafka integration, using the Flume monitoring logs, and real-time log data is sent to Kafka.

If you earnestly fulfill the above study and practice, at this time, your "big data platform" should be like this.

Then, using data collected Flume, not directly onto the HDFS, but first to Kafka, Kafka can consume data simultaneously by multiple consumers, which a consumer is to synchronize data to HDFS.

If you have carefully studied the complete contents of the above, then you should already have the following skills and knowledge:

Why Spark faster than MapReduce.

Instead of using SparkSQL Hive, faster running SQL.

Use Kafka once complete data collection, many consumer architecture.

You could write a program to complete Kafka producers and consumers.

From the previous study, you have mastered the big data platform data acquisition, most data storage and computing, data exchange and other skills, which every step, we need a task (program) to complete, between the various tasks and there is a certain dependency, for example, must wait for the data acquisition task is completed successfully, the data can begin to run computing tasks. If a task fails, we need to send an alert to the development of operation and maintenance personnel, and the need to provide complete logs to facilitate troubleshooting.

Chapter 7: More and more analysis tasks

Not just analysis tasks, data acquisition, data exchange is also one of the tasks. These tasks, some timed trigger, you need to rely on other tasks a little to trigger. When the platform there are hundreds of thousands of tasks required to maintain and run time, crontab alone is not enough, and then they need a control and supervision system to get this done. Control and supervision system is the backbone of the entire data platform system, similar to AppMaster, responsible for allocating and monitoring tasks.

7.1 Apache Oozie

What Oozie that? What are the features?

Oozie What types of tasks can be scheduled (program)?

Oozie can support task which trigger?

Installation configuration Oozie.

7.2 Other open-source task scheduling system

Azkaban, light-task-scheduler, Zeus, and so on. In addition, here is my task scheduling and monitoring system developed separately before, specifically refer to the "big data platform for task scheduling and monitoring system." If you earnestly fulfill the above study and practice, at this time, your "big data platform" should look like this:

Chapter 8: I want to real-time data

In the sixth chapter of Kafka mentioned some time need real-time indicators of business scenarios, real time can be basically divided into absolute real-time and near real-time, real-time absolute latency requirements usually in milliseconds, quasi-real-time delay requirements are generally in seconds, minutes level . The need for absolute real-time business scene, with more of a Storm, other quasi-real-time business scenarios can be Storm, it can also be a Spark Streaming. Of course, if you can, you can also write programs to do their own.

8.1 Storm

What is Storm? What are the possible scenarios there?

Storm which consists of core components, each What is the role?

Storm simple installation and deployment.

Write your own Demo program, use Storm to complete real-time data flow calculation.

8.2 Spark Streaming

What is Spark Streaming, Spark and it is what is the relationship?

Spark Streaming and Storm comparison, what advantages and disadvantages?

Use Kafka + Spark Streaming, complete Demo program calculated in real time.

At this point, your big data platform has been forming the underlying architecture, including data collection, data storage and computing (off-line and real-time), data synchronization, scheduling and monitoring these large modules. Next is the time to consider how best to provide external data up.

Chapter IX: Data to Foreign

Foreign typically provide (business) data access, generally includes the following aspects.

Offline: for example, the day before the day will provide data to the specified data source (DB, FILE, FTP) and the like; may be employed to provide offline data Sqoop, DataX other offline data exchange tool.

Real time: for example, the online site of the recommendation system, we need to get real-time data from the platform to the recommended data users, which requires very low latency (less than 50 milliseconds). According to the query latency requirements and need real-time data, there are possible scenarios: HBase, Redis, MongoDB, ElasticSearch and so on.

OLAP analysis: OLAP data model underlying addition to requiring more standardized, In addition, the query response time requirements are also increasing, the program may have: Impala, Presto, SparkSQL, Kylin. If you compare the size of the data model, then Kylin is the best choice.

Ad hoc query: ad hoc queries data more casual, usually difficult to establish a common data model, so there are possible scenarios: Impala, Presto, SparkSQL.

So much more mature frameworks and programs need to combine their business needs and technical data platform architecture, choose the right. Only one principle: the more simple the more stable is the best.

If you already know how good the external (business) to provide data, then your "big data platform" should look like this:

Machine learning on fast hardware tall: Chapter X

About this, it can only be brief, and no in-depth study. Use machine learning to solve problems in business, so probably encountered three categories:

Classification: including binary and multi-classification, binary classification is to solve the problem of prediction, just predict whether an email spam; multi-classification solution is to classify text;

Clustering problem: From keyword a user searched for, the user probably classified.

Recommended questions: related recommendations based on browsing history and click behavior of the user.

Most industries, using machine learning to solve these types of problems is.

Study entry line, the mathematical basis; machine learning practical, best understand Python; SparkMlLib provide some packaged algorithms, and features of the process of feature selection.

Machine learning really fast hardware tall, is my learning goals. So, you can put the machine learning section is also added to your "big data platform," the.

Ready to accept large data yet? Start learning, improve skills, enhance core competitiveness. But also to their future a chance.

 

Guess you like

Origin blog.csdn.net/wwdede/article/details/100165760