Big data architecture solutions for the future

Big Data problem

Perhaps we all understand this: data is growing rapidly. If it can be used effectively, we found in the data from these very valuable insights; there are many traditional techniques are designed 40 years ago, such as RDBMSs, not enough to create business value, "big data" hype claims to be. In the use of Big Data technologies, a common case is "single view of the customer"; everything about the customer know together, in order to maximize their income and provide services, such as determining what specific promotions need to use, but also at what time to send through what channels.

Although the problem of big data that will allow us this potential into reality, key features include a high level of these capabilities at least the following:

• merge information silos, external factors and data flow;

• Control data access;

• conversion data as needed;

• integrate data;

• To provide data analysis tools;

• Publish data reporting;

• The opinions reflected in the operation of the process;

• Minimize total cost of ownership is completed and response time.

• Lake as the answer data

Many companies are waiting to see what some called a lake of data architecture, this platform has the flexibility in data merge information silos and data flow execution persistence data in a separate location in terms of logic, capable of data from the enterprise itself and third parties to dig out opinions. The Hadoop (including the Spark included) for data lake has become a trend, for many reasons: the use of lower cost of ownership of commodity hardware can be extended to allow time to receive large amounts of data with read mode (schema-on-read), support open source, including the construction of distributed processing layer SQL and common language. In addition, like Yahoo and Google are companies such webscale early benchmark borrowing the architecture was a great success in solving problems related to the site index.

Data in Hadoop persistence options

As a result, data from here to assess the prospects of the lake seems reasonable solution. Once you begin to understand the connotation of Hadoop from a deeper level, you will find items which included really is all-inclusive, covering all aspects of data processing. Hadoop with data stored in the data detection lake, there are two main options: HDFS and HBase. When using HDFS, you can decide how to encode data in only add files, including JSON, CSV, Avro and so on, because just a HDFS file system, encoding up to you. Instead, HBase is a database, which optimized speed specific recorded data encoding may be written to, read-only performed in the primary key query speed is also relatively fast.

This is the charm of Hadoop data with the lake lies, it needs to achieve in the real situation. Therefore, we will be able to use Hadoop to perform the above-listed high-level needs. In such as Spark and Hive Hadoop ecosystem, still need to use distributed processing layer, but do not need or HDFS HBase, so you can select from a persistent level of distributed processing layer. Before Bowen related cases, describes the use Spark to read and write data in MongoDB. There are also very similar to a blog post, it proved MongoDB only read data Hive another form.

The index is still very important

Most RDBMSs are familiar with the technical staff found that the ability to express queries from secondary indexes, to speed up queries all great value (even if the mode is fixed, the high cost of ownership as well as RDBMSs scalability is limited, which makes it difficult to be used as a data lake). If we only use HDFS and HBase database persistence, the database can not be achieved temporary index we look forward to, especially when several limitations encountered the following:

Temporary slice: not through secondary indexes, how we have more than one primary key identifies the data slice effectively analyze it, for example, to our best customers - those who consume more than X amount of customers because the data analysis is too great? , you want to scan through to find the best customers will make the job stuck.

Low Latency report: If no flexible indexing, how do we respond to customer needs in sub-second time, the report provides valuable data for them it again, we only use the consumer's account number or other primary key to? Quick report, rather than by the consumer's name, telephone number, zip code, cost and so on. Special mention: MongoDB has just released a BI Connector for the SQL-based reporting tools.

Operations of: Again, we introduced valuable insights into how application operations so that companies and consumers while maximizing the impact of the data realized imagine Customer Service Officer (CSR) to inform consumers, because only the data support this lake? primary key, he must provide account information to all queries; or visit takes 10 minutes.

Of course, some problems can be solved through alternative methods, but will lead to a higher cost of ownership, development or more operators work, delay is also higher. For example, using a search engine or materialized view instead of a query by the primary key; however, the need to return later to the database, query the master table again in the database is fully documented in order to obtain complete information required. In addition to doubling the delay, but also need to spend additional management, development, and infrastructure needs of individual search engines, as well as the required materialized view maintenance, plus write data to the consistency caused problems elsewhere . Keep our design principles, using only ordinary flexible indexing we are used to is not very good?

MongoDB is an important part of an effective data Lake

We began to discuss, explore whether alone Hadoop data to meet the needs of the lake, and found at least three questions. Can we in the architecture plus a layer of persistence level to address these issues, while keeping design principles - the use of low total cost of ownership of commodity hardware, open source model, when read mode as well as Hadoop distributed data layer - before consistent with it?

I chose the theme of this article is because, MongoDB Hadoop-only data is in the lake, fill the seats of the best database. If another NoSQL open source database, which will find almost no secondary index (using two indexes lead not synchronization data), and no aggregation packet. You can use some of the data is written to the database data lake, but if you want out of the business needs in a flexible manner using a secondary index to read it, is not. If you want to use the data in the lake of open source RDBMS, as we have said, they fixed pattern, expensive vertical expansion of our model are contrary to the principles of design data of the lake.

Therefore, it is recommended to use the following data architecture to build the lake.

This framework will need to join any MongoDB query expression as a persistent level data set, working with the three reasons you need an index (cited above) related. As the demand for data from consumers, regardless of whether or not to publish data to HDFS and / or MongoDB, I recommend a governance function to determine. Whether stored on HDFS or MongoDB, you can run a distributed processing tasks, such as Hive and Spark. However, if data on MongoDB, because into the database under the screening criteria, unlike in the scanned document as HDFS, you will be able to run on a data analysis of temporary sections were valid. Similarly, the data in MongoDB can also be used in real-time, low latency reporting, operational data and provide a service platform for all applications to build the system used.

Today, some companies simply copy the data into Hadoop in the conversion, and then copied to other places, for the completion of valuable work. Why not just use the data Lake, maximize value? Use MongoDB can be worth many times doubled.

in conclusion

Observe the long-term and short-term needs to ensure that these needs can be achieved by the best tool for core Hadoop distribution, and MongoDB such an environment, the data lake for you and that is valuable and feasible. Some companies use the data in the lake, took only a year to clean up all data before it is written to HDFS, we hope to use this data to extract value in the future. The results of these data was disappointed to find no value, in fact there is another batch layer between data and consumer level.

By Hadoop and MongoDB merger, the database can ensure success, and is maintaining a low total cost of ownership, the fastest response for all users (data scientists, analysts, business users, consumers themselves) a flexible data platform. With a data lake, companies and employees will be able to use it to obtain unique insights, effective communication with customers, the data is realized and beat the competition.

Recommended Reading

40 + annual salary of big data development [W] tutorial, all here!

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Basics tutorial to learn linux

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92847288