"Offline and Real-time Big Data Development Practical Combat" (1) Building a big data development knowledge system map

Preface

By reading this book, you can establish your own big data development knowledge system and graph, master various technologies of data development (including related concepts, principles, architecture, and actual development and optimization skills, etc.), and can analyze data in actual projects Development provides guidance and reference. I personally think that this book by Mr. Bangzhong is still very exciting and worth reading (•̀ ω • ́ )✧

Insert picture description here

Next, I will summarize and learn from the perspectives of offline data processing technology, real-time data processing technology, data development optimization, big data modeling, and data layering system construction.

Big picture

Mainly from the overall perspective of data, combined with the four major processes of data collection to consumption, the relevant data technology is introduced and portrayed.

  • Data is crude oil, data is the means of production, driven by data and technology. Mankind is moving from the IT era to the DT era. The strategic nature of data is increasingly recognized. More and more companies, institutions and organizations, especially Internet companies, have built Own data platform.

  • Whether it’s self-research based on open source technology, self-built, or purchase of mature business solutions, whether it’s in a private data center or in a public cloud, whether it’s self-built teams or service outsourcing, data platforms have been built one after another. These data platforms It not only physically carries all data assets, but also becomes the daily work platform and environment for data development engineers, data analysts, algorithm engineers, business analysts and other related data personnel.

  • Therefore, the data platform is a key infrastructure for "seeing data" and "using data" within a company or organization. It is as indispensable as water, electricity and coal. It is their existence that makes data realization possible.

1. Data flow

Whether it is fashionable big data or the traditional data warehouse before, whether it is the most widely used offline data or the real-time data that is getting more and more attention, its end-to-end process includes: data generation, data collection and transmission, data storage The four major processes of processing and data application, the specific data flow chart and its key links are shown in the figure.

Big picture of data flow

1.1 Data generation

Data generation is the source of the data platform. According to different types of source systems, we can divide the sources of data generation into the following types.

(1) Business system

Business system refers to the IT system used by the core business of the enterprise or used by the internal personnel to ensure the normal operation of the enterprise, such as the POS sales system of the supermarket, the ERP system of order/inventory/supply chain management, the CRM system of customer relationship management, and the financial system Regardless of the various administrative systems, the back-end data is generally stored in the back-end database.

(2) Web system

The web system will also have a back-end database for storing various formatted data, but in addition, there are also various user behavior logs, such as how the user visited this website (search engine, direct input of the Web URL, other System jump, etc.), what kind of behaviors are there in the website (which webpages are visited, which buttons are clicked, and how long they stay on). Through cookies and various front-end embedding technologies, these user behaviors can be recorded and saved in the corresponding log files.

(3) Mobile App, external system, manual organization, etc.

1.2 Data collection and transmission

The data files, log files, and embedded logs generated by the business system Web system, mobile apps, etc. are scattered on various systems and servers, and only with the help of data collection and transmission tools and systems can they be aggregated into a centralized area for correlation and analysis.

We need to pay attention to the timeliness of this. It is no exaggeration to say that data collection and transmission tools and systems are the key infrastructure in the era of big data.

1.3 Data storage and processing

After data collection and synchronization, the data is primitive and messy. It must be cleaned, correlated, standardized, and carefully organized and modeled. It must pass data quality testing before subsequent data analysis or data service can be used. It is the third key joint of the data platform construction-data storage and processing.

This is also the most exciting and blooming field in the data field. A variety of open source technology frameworks and innovations emerge in endlessly, but they are always changing. According to the timeliness of downstream data users, we can divide data storage processing tools and technologies into offline processing. , Near-line processing and real-time processing, the processed data is correspondingly stored in the offline data warehouse, near-line data storage area and real-time data storage area.

Offline processing generally performs data processing on a daily basis. After the data collected and synchronized in the early morning of each day is in place, the related data processing tasks will be pre-designed ETL (extraction, conversion, loading, generally used to refer to data cleaning, correlation, Data processing processes such as normalization) logic and the topological relationship between ETL tasks are called in turn, and the final data is written into the offline data warehouse. The data in the offline data warehouse is usually carefully organized according to a certain modeling idea (the most commonly used is the dimensional modeling idea), which can make the downstream users very intuitive and convenient to use, but also make the data processing process very convenient Extensions and modifications.

With the publication of Google's three papers on distributed computing and the rise of the open source Hadoop ecosystem (corresponding to Google's three papers-HDFS, MapReduce, HBase), the era of big data has truly arrived.

Insert picture description here

Now, the use of data in the era of big data is no longer limited to offline data analysis. Real-time data is becoming more and more important, and this can only be achieved with the help of professional stream computing tools and frameworks. At present, the most popular and widely used are Spark Streaming and Flink.

While using these open source frameworks, major domestic and foreign manufacturers also combine their own practice to improve and innovate these stream computing frameworks from different levels, such as stability, scalability, and performance. But the author believes that the most revolutionary of these is the emergence of the SQL abstraction layer. The SQL abstraction layer eliminates the need for real-time development users to write Java or other programming languages ​​to develop real-time processing logic, which not only greatly accelerates the efficiency of real-time development, but also greatly reduces Threshold for real-time development.

1.4 Data application

Careful data burying, massive offline data synchronization, millisecond-level real-time data collection, cumbersome data processing, and meticulous data modeling have laid a solid foundation for data use, but the ultimate value of data depends on the data application link .

The most widely used way of data application is to "watch", such as company/department business daily reports, business weekly reports, and business monthly reports regularly viewed by decision-makers and managers, operational indicators and reports viewed by front-line operators, and analysts give business decisions and business Operational reference data analysis reports, as well as ad hoc analysis by analysts and business personnel from time to time.

These data reports help corporate managers, product and project managers, and front-line operators locate problems, hidden dangers, and directions in corporate products and projects, and take early measures to correct the direction or increase investment after seeing the correct trend. To put it in exaggeration, the ability of an enterprise to "see" data represents the level of data application capabilities of the enterprise and is also one of its core competitiveness.

With the advent of the era of big data and the upsurge of artificial intelligence, data is no longer limited to "seeing" Google’s super search box, Taobao’s "Thousands of People and Thousand Faces" personalized recommendation system, and news aggregation recommendation App. Today’s headlines all represent data and The success of the algorithm combination. It also highlights the power of data + algorithms. With the help of data mining machine learning algorithms, deep learning algorithms, and online data services, data has become a part of online production systems.

Two, data technology

Mainly introduce the offline and real-time data platform architecture and related technologies from the perspective of the data platform.

The main open source technologies and frameworks of the current big data ecosystem
At present, big data-related technologies can be said to be flourishing, but they are inseparable. No matter how these technologies change and how novel the nouns are, they all belong to a specific process and link mentioned above, and there are many other open sources. The technical framework will not be repeated one by one.

But it is all these data technologies that together constitute the current big data ecosystem. Various technologies have me in you and you in me, learn from each other and inspire each other. In fact, many technologies and even their basic principles are similar. It is only for commercial, community, or even private reasons that they become independent. Perhaps it is precisely this way that has contributed to the prosperity and prosperity of the entire ecosystem of big data. Xinxin is moving upwards, just as a poem says: "A flower alone does not mean spring, a garden is full of purple and red."

2.1 The main technology of data collection and transmission

Data collection and transmission tools and technologies are mainly divided into two categories: offline batch processing and real-time data collection and transmission. As the name implies, offline batch processing is mainly to collect and export data in batches at once. Offline batch processing is currently the more famous and commonly used tool is Sqoop, downstream users are mainly offline data processing platforms (such as Hive, etc.). The most commonly used real-time data collection and transmission are Flume and Kafka, and their downstream users are generally real-time stream processing platforms, such as Storm, Spark, and Flink.

( 1 ) Sqoop

Sqoop, as an open source offline data transfer tool, is mainly used for data transfer between Hadoop (Hive) and traditional databases (MySQL, PostgreSQL, etc.).

( 2 )Flume

Flume Cloudera provides a highly available, highly reliable, distributed system for massive log collection, aggregation and transmission. It is currently a top-level sub-project of Apache. Using Flume can collect data such as logs and time, and store these data resources centrally. Up for downstream use (especially stream processing frameworks, such as Storm).

( 3 ) Kafka

Generally speaking, the speed at which Flume collects data and the speed of downstream processing are usually not synchronized, so the real-time platform architecture will use a message middleware to buffer, and the most popular and widely used in this regard is undoubtedly Kafka.

Kafka is a distributed messaging system developed by LinkedIn. It is widely used because of its horizontal scalability and high throughput. The current mainstream open source distributed processing systems (such as Storm Spark, etc.) support integration with Kafka.

2.2 Main technology of data processing

Data processing is the field where data open source technology flourishes. Offline and quasi-real-time tools mainly include MapReduce, Hive, and Spark. Stream processing tools mainly include Storm, and the more popular Flink and Beam recently.

( 1 ) MapReduce

MapReduce is the core computing model of Google. It highly abstracts the complex parallel computing process running on large-scale clusters into two functions: map and reduce. The greatest thing about MapReduce is that it gives ordinary developers the ability to process big data, so that even if developers do not have any knowledge of distributed programming, they can run their programs on distributed systems to process massive amounts of data.

( 2) Hive

Hive was developed by Facebook and contributed to the Hadoop open source community. It is a SQL abstraction built on the Hadoop architecture.

Hive is still the
mainstream offline data processing tool used by Internet companies including major international companies (such as Facebook's domestic BAT) .

( 3) Spark

Although MapReduce Hive completes most of the batch processing of massive data and has become the preferred technology for enterprise big data processing in the era of big data, its data query delay has been criticized, and it is also very unsuitable for iterative computing and DAG (directed Acyclic graph) calculation. Because Spark has the characteristics of scalability, memory-based computing, and can directly read and write data in any format on Hadoop, it better meets the needs of real-time data query and iterative analysis, so it has become more and more popular.

(4) Firm

In the field of data processing, batch processing tasks and real-time stream computing tasks are generally considered to be two different tasks. A data project is generally designed to only process one of these tasks. For example, Storm only supports stream processing tasks, while MapReduce, Hive only supports batch processing tasks. So can the two be accomplished with one technical framework? Is batch processing a special case of stream processing?

Apache Flink is an open source computing platform for distributed real-time stream processing and batch data processing at the same time. It can be based on the same Flink runtime (Flink Runtime) and provide functions that support stream processing and batch processing. Flink fully supports Stream processing, batch processing is regarded as a special stream processing, but its input data stream is defined as bounded.

2.3 The main technology of data storage

( 1 ) HDFS

Hadoop Distributed File System, or HDFS for short, is a distributed file system. It is
a GFS-like file system developed by Doug Cutting inspired by Google after Google's Google File System (GFS). It has a certain degree of fault tolerance and provides high throughput data access, which is very suitable for applications on large-scale data sets. HDFS provides a massive data storage solution with high fault tolerance and high throughput.

( 2) HBase

HBase is a distributed, column-oriented storage system built on HDFS. In scenarios such as real-time reading and writing and random access to ultra-large-scale data sets, HBase is currently the mainstream technology choice in the market.

In fact, traditional database solutions, especially relational databases, can also increase the single-point performance limit through replication and partitioning, but these are all hindsight, and installation and maintenance are very complicated. HBase deals with scalability issues from another perspective, that is, scaling up by adding nodes from bottom to top in a linear manner.

HBase is not a relational database, nor does it support SQL. The tables in it generally have the following characteristics:

  • Large: A table can have hundreds of millions of rows and millions of columns
  • Column-oriented: List (cluster) storage and access control, column (cluster) independent retrieval
  • Sparse: empty (NULL) columns do not occupy storage space, so the table can be designed to be very sparse
  • Modeless: Each row has a primary key that can be sorted and any number of columns. Columns can be dynamically added as needed, and different rows in the table can have completely different columns
  • Multiple versions of data: There can be multiple versions of data in each cell. By default, the version number is automatically assigned, which is the timestamp when the cell is inserted
  • Single data type: All data in HBase is a string without a type

2.4 Main technology of data application

Data can be used in many ways, such as fixed reports, real-time analysis, data services, data analysis, data mining, and machine learning.

Three, summary

This chapter mainly provides an overview of data as a whole, including the four major processes from data generation to consumption: data generation, data collection and transmission, data storage and processing, and data application. Each process involves many technologies, open source frameworks, and tools. And platform.

For example, the main offline data processing technology is Hive based on Hadoop MapReduce, and Hive is a SQL on Hadoop technology, but there are many similar SQL on Hadoop technologies and frameworks, such as Cloudera's Impala, Apache Druid, Presto, Shark, etc. , Beginners should focus on one technique and assist in understanding other related techniques, otherwise they will easily lose their focus and be at a loss.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/108639025