Hadoop big data ecosystem and common components (Shandong Shumanjianghu)

After years of informatization construction, we have entered a magical era of "big data", whether it is WeChat, QQ, phone calls, and text messages used in the process of social communication, or group purchases, e-commerce, and mobile payments used in eating, drinking, and playing. , are constantly generating massive amounts of information and data, data and our work and life are inseparable and inseparable.

>>>>

what is big data

What is big data, how big is big, is 100G big? If it is used to store 1080P high-definition movies, it is the capacity of several movies. However, if 100G is all text data, such as the data in the backend kafka of Cloud Wisdom Perspective, extract a mobileTopic data as follows: [107, 5505323054626937, LAN, LAN, unknown, 0, 0, 09f26f4fd5c9d757b9a3095607f8e1a27fe421c9, 1468900733003], this kind of data We can imagine how many 100G can have.

The reason why the data is large is not only because of the huge amount of data, but also the data generated by various channels include standard data generated by IT systems, as well as a large number of non-standard data such as multimedia, various data types, and a large amount of useless data. In the meantime, it has a great impact on the authenticity of the data. In addition, a lot of data must be processed in real time to be most valuable.

Generally, when the amount of data is large (a lot) or the business is complex, conventional technologies cannot process such a large amount of data in a timely and efficient manner. At this time, Hadoop can be used. It is a distributed system infrastructure developed by the Apache Foundation. Writing and running distributed applications takes advantage of clusters to process large-scale data without the low-level details of distribution. Hadoop can be built on cheap machines, such as our outdated PC Servers or rented cloud hosts.

Today, Li Lin from Cloud Wisdom will introduce some commonly used components in the Hadoop ecosystem.

A Gartner study shows that in 2015, 65% of analytical applications and advanced analytical tools will be based on the Hadoop platform. As the mainstream big data processing technology, Hadoop has the following characteristics:

• Convenience: Hadoop runs on large clusters of general commercial machines, or cloud computing services

• Robust: Hadoop is designed to run on commodity hardware, and its architecture assumes that hardware will fail frequently, and Hadoop can handle most of these failures gracefully.

• Scalable: Hadoop can scale linearly to handle larger datasets by adding cluster nodes.

The areas where Hadoop is currently used the most are:

1) Search engine, the original intention of Doug Cutting's design of Hadoop is to quickly build an index for large-scale web pages.

2) Big data storage, using the distributed storage capabilities of Hadoop, such as data backup, data warehouse, etc.

3) Big data processing, using the distributed processing capabilities of Hadoop, such as data mining, data analysis, etc.

>>>>

Hadoop Ecosystem and Basic Components

When Hadoop 2.0 introduced HA (high availability) and YARN (resource scheduling), this is the biggest difference from 1.0. Hadoop is mainly composed of 3 parts: Mapreduce programming model, HDFS distributed file storage, and YARN.

The above picture is the Hadoop ecosystem. The bottom layer is HDFS as data storage. Other components are combined or used on the basis of HDFS. HDFS has the advantages of high fault tolerance, suitable for batch processing, suitable for big data processing, and can be built on cheap machines. The disadvantages are low-latency data access, small file access, concurrent writing, and random file modification.

Hadoop MapReduce is a software framework based on which applications can be easily written that can be run on large clusters of thousands of commodity machines and processed in parallel in a reliable, fault-tolerant manner Massive datasets at the terabyte level. There are several keywords in this definition: software framework, parallel processing, reliable and fault-tolerant, large-scale clusters, and massive data sets are the characteristics of MapReduce.

MapReduce classic code (wordCount)

The above code is to receive a bunch of text data and count the number of occurrences of each word in the text data. MapReduce is also a computing model. When the amount of data is large, such as 10G, it can divide the 10G data into 10 pieces, distribute them to 10 nodes for execution, and then aggregate them. This is parallel computing, and the computing speed is faster than yours. A machine computes much faster.

>>>>

HBase

The main components of Hadoop have been introduced. Now let’s take a look at HBase. It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. HBase technology can be used to build large-scale structured storage clusters on cheap PC Servers. HBase is an open source implementation of Google Bigtable. Similar to Google Bigtable's use of GFS as its file storage system, HBase uses Hadoop HDFS as its file storage system; Google runs MapReduce to process massive data in Bigtable, and HBase also uses Hadoop MapReduce to process HBase. of massive data; Google Bigtable uses Chubby as a collaborative service, and HBase uses Zookeeper as a counterpart.

Some people ask what is the relationship between HBase and HDFS. HBase uses HDFS storage, just like MySQL and disk, MySQL is an application, and disk is a specific storage medium. Due to its own characteristics, HDFS is not suitable for random search and is not very friendly to update operations. For example, Baidu Netdisk is built with HDFS. It supports uploading and deletion, but does not allow users to directly modify the content of a file on the Netdisk .

HBase tables have the following characteristics:

1) Large: a table can have hundreds of millions of rows and millions of columns.

2) Column-oriented: List (cluster)-oriented storage and permission control, column (cluster) independent retrieval.

3) Sparse: For empty (NULL) columns, no storage space is occupied, so the table can be designed to be very sparse.

The access methods provided by HBase include command line shell, java API (the most efficient and commonly used), and Thrift Gateway supports C++, PHP, Python and other languages.

HBase usage scenarios:

• Need to perform random read or random write operations on data;

• High concurrent operations on big data, such as thousands of operations per second on petabyte-level data;

• Read and write access are very simple operations, such as historical records, historical order query, and query of the traffic call list of the three major operators.

Application scenarios of HBase in Taobao

>>>>

Hive

We talked about the MapReduce computing model before, but only those who know Java can write code to do this. If you don't know Java, can you use Hadoop's computing model? For example, with the massive data in HDFS, what should a data analyst do if he wants to get some data out? So we need to use Hive, which provides SQL-style access for people to use.

Hive is an ETL (Extraction-Transformation-Loading) tool that was open sourced by Facebook and was originally used to solve massive structured log data statistics problems. Hive is a data warehouse platform built on Hadoop. The design goal is to use traditional SQL to operate on Hadoop. data, so that those who are familiar with SQL programming can also embrace Hadoop (note. It is a data warehouse. It is not a database.)

• Use HQL as query interface

• Use HDFS as the underlying storage

• Use MapReduce as the execution layer

Therefore, Hive is a data warehouse tool based on Hadoop. It is born to simplify MapReduce programming. It is very suitable for statistical analysis of data warehouses. It converts SQL into MapReduce by parsing SQL, and forms a DAG (directed acyclic graph) for execution.

>>>>

Flume

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection; at the same time, Flume provides Data is simply processed and written to various data recipients (customizable).

Currently, there are two versions of Flume. Flume 0.9X version is collectively called Flume-og, and Flume1.X version is collectively called Flume-ng. Because Flume-ng has undergone major refactoring, it is very different from Flume-og. Please pay attention to the distinction when using it.

Flume is a data pipeline that supports many sources (sources), sinks (targets), and is very similar to the suro of the perspective treasure. For example, to pull nginx logs, you can use this tool with a simple configuration. Of course, a flume must be configured and started on each nginx server.

Let's take a look at the configuration file (the configuration of writing kafka data to hdfs). The configuration is very simple. It completely eliminates the workload of writing a kafka consumer and then calling the hdfs API to write data.

>>>>

YARN

YARN is a resource management system in Hadoop 2.0. Its basic design idea is to split the JobTracker in MRv1 into two independent services: a global resource scheduler ResourceManager and each application-specific application manager ApplicationMaster , the scheduler is a "pure scheduler", which no longer participates in any work related to the specific application logic, but only allocates according to the resource requirements of each application. The unit of resource allocation uses a resource abstract concept "Container" to Indicates that Container encapsulates memory and CPU. In addition, the scheduler is a pluggable component, and users can design new schedulers according to their own needs. YARN itself provides Fair Scheduler and Capacity Scheduler.

The application manager is responsible for managing all applications in the entire system, including application submission, negotiating resources with the scheduler to start the ApplicationMaster, monitoring the running status of the ApplicationMaster and restarting it if it fails, etc.

>>>>

Ambari

Ambari is a cluster installation and management tool. Cloud Wisdom used Apache Hadoop before. The operation and maintenance students use the source package to install, change the configuration files one by one, and distribute them to each node. Which step is wrong in the middle, the entire cluster won't start. Therefore, several manufacturers provide Hadoop installation and management platforms, mainly CDH and HDP. Many people in China use CDH, which is owned by Cloudera. If it is installed using its management interface, the cluster nodes will exceed a certain number. To be charged.

Ambari is Apache's top-level open source project, free to use, and many people use it now. Ambari uses Ganglia to collect metrics and Nagios to support system alerts that will send emails to administrators when they need to be brought to their attention (for example, when a node is down or running out of disk space).

>>>>

ZooKeeper

As the number of computing nodes increases, cluster members need to synchronize with each other and understand where to access services and how to configure them. ZooKeeper is born for this. ZooKeeper, as its name implies, is a zookeeper. It is an administrator for managing elephants (Hadoop), bees (Hive) and pigs (Pig). Zookeeper is used in projects such as Apache Hbase, Apache Solr, and LinkedIn sensei. ZooKeeper is a distributed, open source distributed application coordination service, based on the Fast Paxos algorithm to implement distributed applications such as synchronization services, configuration maintenance and naming services.

>>>>

other components

The above are the more commonly used and mainstream components used by Hadoop for computing and querying. It is good to have a brief understanding of the other components in the ecological diagram above:

Pig is a programming language that simplifies the common tasks of Hadoop. Pig provides a higher level of abstraction for processing large data sets. Compared with MapReduce, Pig provides richer data structures, generally multi-valued and Nested data structures.

Mahout is provided by Hadoop for machine learning, and it supports relatively few algorithms, but there are some commonly used k-means clustering and classification. It is done with MapReduce, but MapReduce is not very good at this, so Mahout's The author also switched to the spark ML camp.

Sqoop is a database ETL tool for importing data from relational databases into Hadoop and its related systems, such as Hive and HBase. The core design idea of Sqoop is to use MapReduce to speed up data transmission, that is to say, the import and export functions of Sqoop are implemented through MapReduce jobs, so it is a batch method for data transmission, and it is difficult to import and export real-time data. For example, the previous business data of Cloud Wisdom Monitor all existed in MySQL. With the increasing amount of data, if you want to import the data to Hbase, you can use Sqoop to operate directly.

The things introduced in this article are all used for offline computing, and the previously published "Facing Big Data Challenges: How to Use Druid to Achieve Data Aggregation" is about Druid, a framework for real-time computing. The commonly used stream computing frameworks for big data mainly include Storm, Spark Streaming, Flink, and although Flink joined Hadoop in 2014, there are not many people who use it in the production environment. It seems that everyone is taking a wait-and-see attitude.

Talk about the difference between stream computing (Druid, Spark Streaming) and batch processing (MapReduce, Hive), such as personalized advertising on e-commerce websites, when we visit Amazon to search for laptops, he will recommend many laptops to you Computer links, your requests and hobbies are received in real time by Amazon servers, and stream computing analysis will then recommend things you might buy at that time. If this thing is done in batch processing, the server collects it, and it takes half an hour to figure out that you may need to buy a computer. At this time, it is obviously inappropriate to recommend a computer to you, because at this time you may be searching for an electric frying pan...

Finally, let’s talk about the workflow of big data. For example, there are two MapReduce tasks that are dependent, and the first one must be completed before the second one can be executed, which requires a scheduling tool to schedule. MapReduce also provides scheduling API, but a lot of code needs to be written. The above code screenshot is only part of it. I wrote about 150 lines of this dependency. So at this time, the workflow appeared, and the workflow was used to manage our various jobs. I currently know oozie and azkaban. The configuration of oozie is more flexible, and it is recommended for everyone to use it.

Hadoop big data ecosystem and common components (Shandong Shumanjianghu)

Guess you like