Big Data Architect Must-Know Series: Data Visualization and Exploration

Author: Zen and the Art of Computer Programming

1 Introduction

  Data Visualization is a means of presenting data through images, charts, statistical data, etc., to help business personnel analyze and discover data characteristics, relationships, and patterns more intuitively. Visualization is of great significance to business understanding, business decision support, and business process optimization. New features such as massive data, high-speed data collection, and distributed computing brought by big data technology have made data acquisition, processing, and storage more complex. How to effectively visualize data has become an essential skill for every data scientist.

  In this series of tutorials, we will first review and summarize the basic knowledge, core concepts, core algorithms and implementation processes of data visualization, and then implement the main functions of data visualization: histograms through Python language and open source toolboxes such as pandas and matplotlib. , scatter plot, line chart, bar chart, heat map, radar chart, box plot, map, stacked chart, network chart, etc. For students who need more challenges, we will also explore some best practices and challenges in practical applications of visualization, such as the trade-off between data size and visualization effects, how to create attractive visualization legends, and how to improve visual data quality, how to choose the appropriate graphics encoding method, etc. Finally, we will also provide readers with some learning resources and reference books. We hope that everyone can further improve their data visualization level and achieve twice the result with half the effort!  

  This series of tutorials is intended for big data architects, development engineers or data scientists with certain relevant experience and capabilities. Because the subject knowledge of data visualization is broad and involves many technical details, it is inevitably difficult to cover everyone. Therefore, this series of tutorials may have a wider audience and is suitable for engineers, practitioners and data science enthusiasts of all levels.

2.Basic concepts and terminology

2.1 Concepts and uses of data visualization

  Data Visualization (English Data Visualization) creatively converts data into information through the visual presentation of data, allowing users to clearly understand, analyze and make quick decisions. Data visualization can provide intuitive graphical display to facilitate data analysts, managers and non-technical personnel to understand data and improve work efficiency and accuracy. After years of development, visualization methods have become an irreplaceable tool in the analysis and decision-making process. Data visualization includes the following five categories:

  - Statistical charts: including bar charts, pie charts, line charts, histograms, bubble charts, tree charts, box and whisker plots, scatter plots, etc.;

  - Map mapping: through points or lines marking locations on the map, the spatial distribution of data and the correlation between related data are presented;

  - Network diagram: a relationship diagram drawn through the three dimensions of nodes, edges, and attributes, showing the complex network structure and the connections between each element;

  - Information charts: including bar charts, box charts, cuboid charts, strip charts, etc., which display data in the form of charts through colors, symbols, sizes, etc.;

  - Matrix chart: A matrix composed of different variables. Through different aggregation indicators, sorting methods, classification methods, cross analysis, etc., the association, correlation and impact between different variables are compared in different dimensions.

  The purpose of data visualization is to transform data into information to facilitate users' analysis, understanding, and quick decision-making, thereby improving work efficiency, accuracy, and satisfaction. In order to improve users' cognitive abilities, visualization usually uses graphical methods, including information charts, map mapping, network diagrams, matrix charts, infographics, etc. These charts are designed for specific analysis needs and can highlight key information and Can convey the meaning of data well. At the same time, based on the interactive method, users can adjust the chart at any time to facilitate analysis and decision-making.

  Data visualization has four main functions:

  • Provide intuitive graphical display: Data visualization can present data quickly and intuitively. Presented in a graphical way, it can intuitively display the differences and connections between data. Analysts can more easily identify data characteristics, relationships and relationships. abnormal situations, so as to quickly gain insight into problems and promote business development.

  • Better understand data: Visualization can make data more intuitive and understandable, and present it clearly. It displays data in different chart forms, making the data information hierarchical and easy to analyze instead of a mess. It can help users clearly understand the data and discover hidden information in the data, so as to better grasp its patterns and trends. .

  • Enhance the transparency of decision-making: Data visualization can improve the cognitive ability of decision-makers because it can support decision-making in business operations, product development, policy promotion, etc. Through the visual data model, decision makers can understand the truly important information, quickly make correct judgments, and ultimately achieve the purpose of improving the business.

  • Improve management efficiency: Data visualization helps improve work processes and reduce communication costs. Because it can formulate better data analysis plans based on the company's business needs and target customer groups, helping the company better grasp market conditions and improve management efficiency. Data visualization can reduce unnecessary duplication of work, increase work efficiency, save costs, and improve work quality and brand image.

2.2 Basic concepts of data visualization

2.2.1 Data types

  Data visualization generally deals with two types of data, namely raw data and analytical data. Raw data refers to data from a specific data source system, which can be data in a database, log files, data generated by a real-time monitoring system, or external data sources. These data must undergo data preprocessing before they can be used for visual analysis. On mission. Analytical data is the result data obtained through data mining, data analysis, machine learning and other technologies. These data can be obtained directly from the original data or exist independently of the original data.

  The key to data visualization lies in the analysis of data, so the basis of data analysis is to understand the data type. There are four data types: numerical, categorical, time series and structural.

  - Numerical data: Numerical data means that the data unit is numeric, and statistical indicators such as sum, average, and variance can be calculated. It is also called continuous data. For example, sales revenue, housing prices, temperature, population, traffic flow, production volume, etc.

  - Categorized data: Categorized data means that the data units are categories, such as color, race, city, occupation, whether it violates the law, etc. This data can be grouped, compared and analyzed, also known as discrete data. For example, product sales, customer groups, consumption habits, web page access trends, etc.

  - Time series data: Time series data refers to data arranged in chronological order, which can represent dynamic change trends. For example, stock prices, economic indexes, changing trends in cases, etc. This type of data often has sequential characteristics, and time analysis is required to discover patterns and regularities in the data.

  - Structural data: Structural data refers to relationships that exist in multiple dimensions in the data, such as a person's name, address, phone number, etc. This kind of data can be represented by multi-dimensional coordinate axes and described with relevant labels, which can effectively present the relationship between data. For example, personnel information, customer information, equipment information, etc.

  In addition to the above four data types, there are other data types, such as geographical data, text data, image data, video data, etc.

2.2.2 Data measurement method

  Another important concept in data visualization is measurement. Measurement method refers to the measurement method of data units, which is divided into continuous data and discrete data. There are four types of data measurement methods: scale, proportion, ordinal and calculation.

  - Scale type: Scale type data units are fixed, such as temperature, length, time, etc. Scaled data can be expressed simply and directly using numerical values, and can be directly used as the scale of the coordinate axis.

  - Proportional type: The proportional data unit is not fixed and the same, but the range of data values ​​is relatively uniform. Proportional data is usually visualized using bar charts, line charts, pie charts and other charts. Since the proportions of different groups are the same, it is not easy to display detailed data information.

  - Ordinal type: Ordinal type data units are arranged in order of size, similar to a hierarchy. Ordinal data can be visualized using charts such as radar charts, funnel charts, and sunburst charts. Ordinal data can quickly discover commonalities in the data, but lacks detailed information.

  - Calculation type: Calculation type data can neither be directly measured nor directly compared, such as height, wealth, population, industry distribution, etc. Computational data can only be analyzed through calculated data and cannot be used directly for visualization.

2.2.3 Visual classification

  Current data visualization research is mainly divided into the following categories:

  - Information visualization: used to present the relationship between single or multiple variables, consisting of information charts, scatter plots, bubble charts, contour charts, grid charts, relationship charts, etc.

  - Time series visualization: used to present time series data, consisting of curve charts, step charts, area fill charts, heat maps, time series bar charts, wave charts, etc.

  - Spatial visualization: used to present spatial data, consisting of maps, three-dimensional diagrams, network diagrams, etc.

  - Structural visualization: used to present the relationship between complex data, consisting of tree diagrams, cluster diagrams, nested diagrams, same-color donut diagrams, etc.

  - Symbolic coding visualization: used to present features in the data, consisting of violin plots, rose diagrams, hierarchical donut diagrams, etc.

2.3 Technology implementation process

  The core of data visualization lies in the analysis of data, and the process of how to realize data visualization is actually an important subfield of artificial intelligence. The technical implementation process of data visualization mainly includes the following stages:

  - Data preparation stage: First, collect and organize the required data, including raw data and analytical data. Raw data usually comes from a variety of data source systems, including databases, log files, monitoring systems, external data sources, etc. These data usually require data preprocessing before they can be used in visual analysis tasks.

  - Data analysis stage: Secondly, use data mining, data analysis, machine learning and other technologies to conduct data analysis. During the data analysis process, the original data can be cleaned, analyzed, organized, summarized, features extracted, models trained, and analysis data generated.

  - Data visualization stage: Finally, use visualization technology for data visualization to transform the analytical data into meaningful images for easy viewing, analysis, and understanding. Visualization techniques include information graphics, map mapping, network diagrams, matrix diagrams, symbol coding diagrams, etc.

  The final product of the data visualization technology implementation process may be a complete visualization system, including visual front-end interface, visual back-end services, visual analysis algorithms, etc. These modules can implement various functions of data visualization, including data import, export, query, data filtering, data analysis, data display, data export, data sharing, etc.

おすすめ

転載: blog.csdn.net/universsky2015/article/details/133385379