Big Data and Intelligent Data Application Architecture Tutorial Series: Big Data Visualization and Report Analysis

Author: Zen and the Art of Computer Programming

1 Introduction

With the vigorous development of new big data applications such as the Internet, mobile Internet, and the Internet of Things, traditional data processing, analysis, and application can no longer meet the increasing demand. The popularization of big data technology has also brought new challenges. How to extract effective information from massive data, gain insight into business opportunities, predict market trends and form decision support has become a very urgent need. Therefore, visualization and report analysis systems based on big data technology are essential in all walks of life.

This article will comprehensively analyze the visualization and report analysis system from the perspective of "big data", explain the development trends and application scenarios of big data visualization and report analysis, and provide solutions. The article is mainly intended for technical personnel in industry application fields and related departments of universities. Through this article, readers can understand the overall framework, key technical indicators and application processes of the big data visualization and report analysis system, and then master the technical skills of big data visualization and report analysis.

2. Background introduction

What is big data?

Generally speaking, big data refers to the storage, processing and analysis of massive, high-speed and diverse data collections. Its characteristics mainly include the following points:

  1. Volume: refers to the problem of data skew that will occur when the size of the data set exceeds a certain scale.
  2. Velocity: refers to data acquisition, storage, transmission, processing and other processes that need to be carried out very quickly.
  3. Variety: refers to data sources that may include different media forms such as text, images, videos, audio, network traffic, etc., and there are a wide variety of data types.
  4. Value-added: refers to data that has real business value or indicator significance and can be used to predict market trends or support decision-making.

It can be seen that big data refers to a collection of data that is massive, high-speed, diverse, and value-intensive, and also has the characteristics of complex correlation, repeated interpretation, rapid change, and rapid growth. Therefore, for this huge and endless data set, how to conveniently manage, mine, analyze, and present the data has become a very challenging challenge.

Why do we need big data visualization and report analysis?

With the popularity of big data, more and more companies are processing big data. In order to make the data more intuitive and understandable, reduce the maintenance cost of the enterprise, and improve work efficiency, many companies have begun to use visualization technology for data analysis, such as report visualization. , business monitoring chart display, intelligent BI tool construction, etc.

However, due to the large amount of data, multiple dimensions, and complex structure, the construction and deployment of visualization systems often takes a certain amount of time. Therefore, the selection and use of visualization tools are receiving more and more attention. Currently, common visualization technologies can be divided into the following categories:

  1. Data visualization: including data statistical charts, data maps, data relationship diagrams, heat maps, etc.
  2. Report visualization: For example, use Tableau, Davinci Chart, Power BI and other tools to create reports.
  3. Intelligent BI tools: Use business intelligence suites and data mining algorithms to provide customized BI solutions.

These visualization tools can help users quickly understand complex data and implement data analysis in an interactive manner.

Visual system architecture overview

The following is an introduction to the visualization system architecture, which consists of three main modules:

  1. Data acquisition module: Its main function is to obtain external data sources and store them as one or more tables in the data warehouse.
  2. Data cleaning module: Its main function is to clean data and delete dirty data, noise data, erroneous data, etc.
  3. Data display module: its main function is to provide data visualization functions. Among them, the data conversion module is responsible for converting the original data into a format that can be used for visualization; the data encoding module is responsible for encoding the data so that the data can adapt to computer visualization standards; the data rendering module is responsible for presenting the encoded results to user.

There are three commonly used visualization tool architectures:

  1. Front-end integrated architecture: This type of architecture encapsulates visualization tools as browser plug-ins, which can be directly installed into the browser for use. The advantage is that the visualization tool is simple to use and easy to use, but the disadvantage is that it takes up a lot of server resources.
  2. Server-side integrated architecture: This type of architecture uses the visualization tool as a server-side application, calls it through the HTTP interface, and returns the results to the browser for display. The advantage is that there is no need to install additional plug-ins, it does not occupy server resources, and it can be easily integrated with other services. The disadvantage is that it is cumbersome to use and it is difficult to intuitively feel the data.
  3. Client-integrated architecture: This type of architecture embeds visualization tools into the client and runs them as independent applications. The advantage is that there is no need to install and configure the server, direct access to data sources, and use of local computing power to improve performance. The disadvantage is that it relies on the local hardware environment and needs to consider cross-platform compatibility.

Based on the above introduction to the visualization system architecture, the knowledge points of big data visualization and report analysis are elaborated below.

3. Explanation of basic concepts and terms

(一)ETL(Extract-Transform-Load)

ETL refers to data extraction, transformation, and loading, that is, extracting data from various sources (such as databases, file systems, message queues, e-mails, etc.), cleaning, filtering, converting, and verifying it, and then loading it into the target system (such as database, file system, etc.). It is an important part of the data warehouse and is the core component of the data warehouse. The purpose of ETL is to uniformly transform data from heterogeneous data sources into the internal structure of the data warehouse, eliminate duplicate data, abnormal data and missing data, and achieve the purification of the data warehouse.

ETL is usually divided into the following stages:

  1. Connecting data sources: ETL extracts data from various sources (such as Oracle, MySQL, etc.) and calls corresponding methods according to the protocols and APIs of different data sources to read the data.
  2. Data cleaning: ETL cleans, filters, converts and verifies the original data obtained, and removes messy, repetitive and erroneous data.
  3. Data transformation: ETL transforms the cleaned data into the structure required for the data warehouse.
  4. Data loading: ETL loads the converted data into the target system of the data warehouse (such as Oracle, MySQL, etc.).

The execution frequency of ETL depends on the size of the data volume, data update frequency and data quality, usually once a day, once a week or once a month.

(2) Hive

Hive is an open source distributed data warehouse infrastructure in the Hadoop ecosystem. It is a system specially used to query and analyze massive data. It can serve as a database and can also be used to process unstructured and semi-structured data. Its biggest feature is that it divides data into columns according to columnar storage to reduce disk IO and memory overhead and improve query performance.

The basic data model of Hive is a table, which consists of a series of columns and rows. Each column has a name and data type, and each row corresponds to a unique set of values. Hive also supports complex data types, such as arrays, Maps, Structs, Unions, etc.

Hive's SQL interface provides a rich function library that can query, aggregate, join and other operations on tables through SQL statements. Hive's data storage uses HDFS (Hadoop Distributed File System) by default and provides customized storage formats.

Hive can be combined with Spark to provide high-performance analysis queries. Spark SQL can directly query tables in Hive to achieve data sharing and processing.

(3) Hive SQL

The SQL language supported by Hive has ANSI/ISO SQL syntax similar to MySQL, extends its own syntax, and adds some unique syntax.

The SELECT statement is used to retrieve data from Hive tables. You can use WHERE, ORDER BY, LIMIT, GROUP BY and other clauses to filter, sort and aggregate the results. The supported arithmetic operators include +, -, *, /, %, etc. The supported conditional operators include =, !=, >, <, >=, <=, etc.

The INSERT INTO statement is used to insert data into the Hive table. If the table does not exist, create the table first.

The CREATE TABLE statement is used to create a Hive table and specify attributes such as table name, column name, data type, etc.

DROP TABLE statement is used to delete Hive tables.

The SHOW TABLES statement is used to view the list of existing Hive tables.

The DESCRIBE FORMATTED statement is used to view the metadata information of the table, such as column names, data types, comments, etc.

(四)OLAP(OnLine Analytical Processing)

OLAP, which stands for Online Analytical Processing, is a multi-dimensional data analysis method that performs multi-dimensional analysis and mining of data through multi-dimensional square matrix data organization. Its main advantages are real-time, flexibility, and strong visualization capabilities.

The OLAP system is an integrated system for summarizing, analyzing, decision-making and predicting large-scale enterprise data, including data warehouse, data mining, multi-dimensional analysis, interactive query and visualization and other technical components. Data warehouse is used to store, organize, and analyze data, data mining is used to discover patterns and trends from the data warehouse, multidimensional analysis is used to analyze data, interactive query is used to identify potential business opportunities in advance, and visualization is used to display Analyze the results.

(五)OLTP(Online Transaction Processing)

OLTP, which stands for Online Transaction Processing, is a high-performance transaction processing system, mainly used to process real-time query requests. The OLTP system consists of a database, transaction engine, index structure, caching mechanism, etc. The database is a place where data is stored, including tables, views, stored procedures, triggers, etc. The transaction engine is responsible for submitting and rolling back transactions, the index structure is used to quickly locate the data location, and the caching mechanism is used to speed up data query.

The OLTP system is processed as follows:

  1. Transaction processing: Transaction processing means that the database performs a series of operations in a certain order to complete certain functions or save data. A transaction usually contains multiple statements. If any errors occur during execution, the entire transaction will be rolled back.
  2. Query processing: Query processing refers to querying the specified data in the database. When there is a large amount of data to be retrieved, the database should minimize the number of disk reads and use a caching mechanism to improve query performance.
  3. Update processing: Update processing refers to modifying or adding data that already exists in the database. When data changes, the database automatically records the old version of the data.

(6) Pentaho Data Integration

Pentaho Data Integration (PDI) is an open source data integration software that can be used for a series of data integration tasks such as ETL, ELT, data warehouse development, data collection, data summary, data conversion, data loading, and data auditing. Its design goals are lightweight, easy-to-use, and highly scalable, and can meet various needs of enterprise-level data processing.

Data sources supported by PDI include relational databases, file-based logs, APIs, etc. It has a series of built-in common components, including scheduled scheduling, FTP/SFTP file download, email sending, database connection, XML parsing, LDAP login, shell script execution, etc., which can realize simple data collection, conversion, loading, inspection, etc.

The output targets of PDI include relational databases, CSV, Excel and other files, Hadoop HDFS, JDBC, ActiveMQ, etc. It provides a friendly visual interface that can intuitively see the execution of tasks, and can easily track task execution and data errors.

(7) Tableau

Tableau is a business intelligence analysis tool that can quickly and intuitively display complex data and provide users with interactive and intuitive dashboards, dashboards and data reports. Its main features are data-driven, intuitive visualization, ease of use, linked analysis and mobile device support.

Tableau's main functions include data connection, data import, data cleaning, data merging, data calculation, data pivot, view editing, data export, authorization control, etc. It supports mainstream relational databases and NoSQL data sources, including MySQL, PostgreSQL, MongoDB, Amazon Redshift, Google BigQuery, Salesforce, SAP HANA, etc.

Both Tableau Desktop and Server versions can be used for personal and private analysis, but the Server version also supports team collaboration and provides cloud deployment and cloud analysis services.

(8) RapidMiner

RapidMiner is a machine learning platform based on Python language, which can be used to quickly build, train, and optimize machine learning models. Its main functions include data import, data cleaning, feature engineering, model training, evaluation and hyperparameter adjustment, model inference and evaluation, model release and deployment, model monitoring and tracking, model integration and servitization, etc.

RapidMiner also has some built-in machine learning algorithms, such as SVM, Naive Bayes, Decision Tree, etc. RapidMiner supports mainstream Python environments and programming languages, including Anaconda, IPython, Jupyter Notebook, PyCharm IDE, etc.

In addition to implementing model training, evaluation, monitoring and other functions, RapidMiner can also deploy the model as a RESTful API for calls by other applications to achieve model portability and scalability.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133446317