Big data framework Hadoop ecosystem Episode

1.5 to develop enterprise applications using Hadoop

In order to meet the new challenges brought about big data, we need to rethink the way to build a program of data analysis. Traditional data stored in a database, build an application process, will no longer be valid for large data processing. mainly because:

Traditional application-based transaction processing databases, this database will no longer be supported Hadoop.

As the amount of data stored in Hadoop increases, only be able to access real-time access to data on the part of the cluster.

Hadoop mass data storage function can store more data set version, it does not overwrite the original data the same as the traditional method.

Thus, a typical enterprise applications based Hadoop Figure 1-2. In these applications, including data storage layer, data processing layer, real-time access and security layers. To implement this architecture, not only we need to understand the API Hadoop components involved, and the need to understand the role of their capabilities and limitations, and each component in the overall architecture.

1-2, includes a source data storage layer data and the intermediate data. The main source of data from these external data sources, external data sources include enterprise applications, external databases, logs, and perform other data sources. Intermediate results from the execution Hadoop data, which are real-time applications using Hadoop, and delivered to other applications and end users.

 

Figure 1-2: Hadoop enterprise applications

Different mechanisms may be used to transfer data to Hadoop source, comprising: Sqoop, Flume, HDFS directly mounted as a network file system (NFS), or by real-time services and applications of Hadoop. In HDFS, the new data will not overwrite existing data, but a new version of the data. This is important, because HDFS is a "write once" file system.

For the data processing layer , oozie preprocessed source data, and converts it into intermediate data. Different from the source data, the intermediate data is overwritten, no more than one version, the amount of intermediate data will not be great.

For real-time access layer , Hadoop application supports both real-time direct access to data, and also supports access based on the dataset. The application reads the data based on the intermediate Hadoop, and the source data stored in Hadoop. The application can also be used to service users, or for other enterprise Hadoop integration.

Preliminary data source for storing and processing data, intermediate data for transfer and integration of data. Because the use of the source data and the intermediate data is completely separate structures, the developer is allowed without any transaction processing requirements of the case, and any virtual build complex applications. By the middle of the pretreatment process, and significantly reduce the amount of data services, more flexible so that real-time data access.

HADOOP scalability

Although many articles that, for developers in terms of, Hadoop hides the underlying complexity. But, in fact, these articles do not fully understand the Hadoop scalable.

By implementation of Hadoop design, it allows developers to easily and seamlessly integrate new functionality into Hadoop execution. Hadoop specifically designated to be responsible for a number of libraries at different stages of implementation of MapReduce. In this way, meet the requirements of the developer to perform specific questions to ensure that every job at the lowest cost, highest performance capability to perform.

You can customize the following Hadoop to perform:

Custom manner Hadoop parallel implementation issues, including divided manner and performing location

Support for new data types and positioning input

Support for new data types Output

Custom location output data

Part of the content of this book, based on the results of the work of others, custom methods, and implementation has been specifically described.

It covers all the major layers Hadoop enterprise applications, shown in Figure 1-2.

Chapter 2 describes a method of constructing layers of data, including HDFS and HBase (architectural and API). Then, comparative analysis of the two, and how to combine HDFS and HBase. It also describes Avro (Hadoop new framework sequence), and its role in the storing or accessing data. Finally, you will learn HCatalog, and how to use it for advertising and accessing data.

This book will be a lot of data processing are described. For data processing of the application, we recommend the use MapReduce and Oozie.

In this book, why MapReduce source to the core?

You might ask why this book will focus on MapReduce source, rather than allowing the MapReduce programming easier high-level language above. You can online or in the Hadoop community, to find a lot of discussion in this regard. Explanations given in these discussions are, MapReduce source code amount (in terms of the number of lines of code) to provide the same functionality than the amount Pig source usually much more. Although this is an indisputable fact, but there are other factors:

Not everything can be described as a high-level language. Some tasks, more suited to use traditional Java code to execute.

If you write code need only be done once, then the number of lines of code may be very important to you. However, if you are writing enterprise applications, then you should consider other criteria, including performance, reliability and security. Typically, MapReduce source provides more ways, making it easier to achieve these qualities.

More MapReduce, customized by mode, to provide users, to further improve the performance, reliability and security applications.

In Chapter 3, you will learn MapReduce architecture, main components and programming model. This chapter describes the MapReduce application design, design patterns and MapReduce precautions. This chapter also describes how to perform MapReduce talk is achieved. As mentioned, one of the strongest MapReduce characterized in that it can be customized performed. Chapter 4 will introduce details of customization options, and examples. Chapter 5 demonstrates by example, further discussion of MapReduce, build a reliable MapReduce application.

Although MapReduce function is very powerful, but for a practical solution, typically requires a set of multiple MapReduce applications to together. This process is very complex, by the use of Hadoop Workflow Coordinator (workflow / Coordinator) engine, can be greatly simplified / MapReduce integrated applications.

Oozie value

Oozie Hadoop is most likely to be underestimated components. Few people (or no) to discuss this very important component in Hadoop books. This book not only demonstrates the Oozie what can be done, but also provides an end to end case in point to demonstrate how to use Oozie functions to solve real problems. Similar to the rest of Hadoop, Oozie function also has good scalability. Developers can extend the functionality Oozie by different methods.

In Hadoop, the most easy to underestimate the challenge: the rest of Hadoop execution with business process integration. Oozie to coordinate the use of MapReduce applications, and Hadoop process disclosed by way of public Oozie API. This way, you can easily find a better integrated approach to Hadoop processing and business process integration section.

Chapter 6 describes what Oozie is Oozie architecture of the main components, programming language and execution model. To better explain the function and role of each component Oozie, Chapter 7, to solve practical problems by using Oozie end applications. Chapter 8, through a number of advanced features to show Oozie further description of Oozie. These advanced features include custom Workflow activities Workflow and the dynamically generated support super JAR file (a JAR file containing all of the packages and their dependencies).

Chapter 9 explains real-time access layer. This chapter introduces the example of a real-time application industry Hadoop, and then put forward the overall framework for implementation. Then, it introduces the three main methods of establishing this implementation - based HBase applications, based on real-time query and processing flow. This chapter describes the overall architecture, and provides two examples of real-time applications based on HBase. Then, we describe a real-time query architecture, and discussed two specific implementations --Apache Drill and Cloudera's Impala. Also introduced real-time query and MapReduce contrast. Finally, you will learn about the complex event processing based on Hadoop, as well as two specific implementation --Strom and HFlame.

Development of enterprise applications requires a lot of planning, and information security policies. Chapter 10 will focus on the Hadoop security model.

With the development of cloud computing, many companies try to Hadoop in the cloud on their run. The focus of Chapter 11 is that by using EMR implementation, running Hadoop applications on Amazon's cloud; and introduced the other AWS services (such as S3), to complement the Hadoop functionality. This chapter also describes, by using a different method to run Hadoop on the cloud, and discuss best practices.

In addition to Hadoop own security issues, usually also need to integrate with other enterprise Hadoop components, to achieve import and export data. Chapter 12 will focus on how to safeguard those who use the Hadoop enterprise applications, and provides examples and best security practices to ensure the safe operation of all Hadoop enterprise applications.

1.6 Summary

This chapter outlines the relationship between the height of Big Data and Hadoop. And introduces big data and its value, but also introduces big data companies face challenges, including data storage and processing. By this chapter, you learned about Hadoop and its history.

By this chapter, you learned about the characteristics of Hadoop and Hadoop know why so suitable for large data processing. This chapter also provides an overview of the main components of Hadoop, and describes some examples used to demonstrate how Hadoop simplify the process of scientific data, simplify the creation of enterprise applications.

This chapter introduces the basics of Hadoop releases, and why many companies tend to choose a particular vendor's release. Because they do not want to deal with compatibility issues Apache project, or they need vendor's technical support.

Highly recommended reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data engineers must understand the concept of the seven

The future of cloud computing and big data Five Trends

How to quickly build their own knowledge of large data

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92430903