Introduction to Information Technology Chapter 5 Big Data Notes

Chapter 5 Big Data

5.1 Large number summary

5.1.1   The background, basic concepts and main characteristics of big data

1. The background of big data
1) Progress in information technology

The modern information technology industry has a history of more than 70 years, and its development process has experienced several waves. First came the mainframe wave of the 1960s and 1970s. At that time, computers were huge and had low computing power. After the 1980s, with the continuous development of microelectronics technology and integration technology, various types of computer chips continued to be miniaturized, and the wave of microcomputers emerged. PC became mainstream. At the end of the last century, with the rise of the Internet, network technology developed rapidly, which set off a wave of networking. More and more people have access to and use the Internet. In recent years, with the rise of mobile phones and other smart devices, the number of people online around the world has surged. Our lives have been surrounded by digital information, and this so-called digital information is what we usually call "data". We can call it As the big data wave, it can also be further seen that the continuous popularization of intelligent devices is the reason for the rapid growth of big data Important factors.

Facing the explosive growth of data, the performance of storage devices must also be improved accordingly. American scientist Gordon Moore discovered the "Moore's Law" of transistor growth. Under the guidance of Moore's Law, the computer industry will undergo periodic updates, reflected in the continuous improvement of computing power and performance. At the same time, the previous low-speed bandwidth is far from meeting the requirements of data transmission. Various high-speed and high-frequency bandwidths are constantly being put into use. The growth rate of optical fiber transmission bandwidth even exceeds the improvement rate of storage device performance, which is called Super Moore's Law. .

The popularization of smart devices, the widespread application of the Internet of Things, the improvement of storage device performance, and the continuous growth of network bandwidth are all advancements in information technology. They provide the material basis for storage and circulation of big data.

2) The rise of cloud computing technology

Cloud computing technology is an emerging technology in the Internet industry. Its emergence has brought about tremendous changes in the Internet industry. The various network cloud disks we usually use are a concrete manifestation of cloud computing technology. In layman's terms, cloud computing technology uses cloud-shared software, hardware and various applications to get the operation results we want, and the operation process is completed by a professional cloud service team. What we usually call the cloud is "data center". Now major domestic Internet companies, telecom operators, banks and even government ministries have established their own data centers. Cloud computing technology has been popularized in all walks of life and has further occupied the market. Dominance.

Cloud space is a new model of data storage. Cloud computing technology concentrates originally dispersed data in data centers, providing the possibility for processing and analysis of huge data. It can be said that cloud computing provides big data with huge data storage and dispersed users. Access provides the necessary space and ways, and is the technical basis for the birth of big data.

3) Trend of data resource utilization

Based on the source of generation, big data can be divided into consumer big data and industrial big data. Consumer big data is mass data generated in people's daily lives. Although it is just the mark left by people on the Internet, major Internet companies have already begun to accumulate and compete for data. Google relies on the world's largest web database to fully tap the potential of data assets. value, breaking Microsoft’s monopoly; Facebook launched the graph search search engine based on the interpersonal database; the two largest e-commerce platforms in China, Alibaba and JD.com, have also launched a data war, using data to evaluate their opponents’ strategic trends and promotional strategies. etc. In terms of industrial big data, many traditional manufacturing companies have successfully used big data to achieve digital transformation. This shows that with the rapid popularization of "intelligent manufacturing" and the deep integration and innovation of industry and the Internet, industrial big data technology and applications will become the key to improving manufacturing productivity and competition in the future. key elements of strength and innovation capabilities.

2. Definition of big data

① The definition of big data given by McKinsey Global Institute in its report "Big data: The next frontier for innovation, competition, and productivity" is: Big data refers to the acquisition, storage, management and analysis of database tools whose size exceeds conventional Ability data set. But it also emphasizes that it does not mean that a data set must exceed a certain TB value to be considered big data.

② The definition of big data in Wikipedia is: It refers to the amount of data involved that is so large that it cannot be captured, managed, processed, and organized within a reasonable time through current mainstream software tools to help business decisions. Information with a more positive purpose. The author believes that this is not a precise definition because the scope of mainstream software tools cannot be determined, and the acceptable time is also a rough description.

③ Internet Data Center (IDC) is defined from the four characteristics of big data, namely massive data scale (Volume), speed of data processing (Velocity), diverse data types (Variety), and low data value density (Value) , the so-called "4V" characteristic. However, IBM believes that big data should also have its veracity.

④ Gartner, a global information technology research and consulting company, gives this definition: Big data is massive, high-growth and diverse data that requires new processing models to have stronger decision-making power, insight discovery and process optimization capabilities. information assets.

⑤ The National Institute of Standards and Technology (NIST) defines big data as: large amounts of data that are acquired quickly or in various forms, which are difficult to effectively analyze using traditional relational data analysis methods, or require large-scale horizontal expansion for efficient processing.

3. Key characteristics of big data

Big data has 4V characteristics: Volume, Variety, Velocity, and Value.

Volume (huge data volume): A large amount of interactive data is recorded and saved, and the data size ranges from TB to PB.

Velocity (many data types): structured data, semi-structured data and unstructured data.

Variety (fast flow): The status and value of the data itself continue to evolve with changes in time and space.

Value (huge value but low density): The value of data does not increase in the same proportion as the exponential growth of data volume.

5.1.2   The development history of big data

In the entire development process of big data, we divide it into four stages according to the process, namely the embryonic stage, breakthrough stage, mature stage and application stage of big data.

1. The budding stage of big data (1980-2008)

In the book "The Third Wave" written by the famous futurist Alvin Toffler in 1980, "big data" was called "the cadenza of the third wave"; the end of the last century was the germination of big data. It is in the stage of data mining technology. As data mining theory and database technology mature, some business intelligence tools and knowledge management technologies begin to be applied. In September 2008, the British magazine "Nature" launched a cover column called "Big Data".

2. Big data breakthrough stage (2009-2011)

From 2009 to 2010, "big data" became a hot word in the Internet technology industry. In June 2011, McKinsey, a world-leading global management consulting firm, released a report on "big data", formally defining the concept of big data, which has gradually attracted attention from all walks of life. At this stage, a large amount of unstructured data appeared. Traditional database processing is difficult to handle, also known as the unstructured data stage.

3. Big data mature stage (2012-2016)

With the publication of the book "Big Data Era" in 2012, the concept of "big data" has played a pivotal role in all walks of life riding the wave of the Internet. In 2013, big data technology began to penetrate into various fields of business, technology, medical care, government, education, economy, transportation, logistics and society. Therefore, 2013 is also known as the first year of big data, and the era of big data has quietly begun.

4. Big data application stage (2017-2022)

Since 2017, big data has penetrated into all aspects of people's lives. Driven by multiple factors such as policies, regulations, technology, and applications, the big data industry has ushered in an explosive period of development. At least 21 big data management institutions have been established in 13 provinces across the country. At the same time, big data has also become a popular major in colleges and universities, with 293 schools applying for undergraduate majors in data science and big data technology. In recent years, the scale of data has grown exponentially at high speed. According to a report by the International Data Corporation (IDC), an international information technology consulting company, global data reserves will reach 44ZB in 2020 and will reach 2500ZB by 2030. As a country with a large population and a large manufacturing country, my country has huge data generation capacity and extremely rich big data resources. It is expected that by 2020, the total amount of data in China is expected to reach 8000EB, accounting for 21% of the total global data, and will become one of the top countries in data resources and global data. center. According to relevant statistics, as of the first half of 2019, 82 provincial, sub-provincial and prefecture-level governments in my country have launched data open platforms, involving 41.93% of provincial administrative regions, 66.67% of sub-provincial cities and 18.55% of Prefecture-level cities.

5.1.3   The relationship between big data, cloud computing and artificial intelligence technology

The big data industry is booming at a speed beyond our imagination. With the help of big data, cloud computing and artificial intelligence have also entered our field of vision. The three of them are inseparable and mutually influencing each other.

1. The concept of big data

Big data, or huge amounts of data, refers to massive, high-growth and diverse information assets that require new processing models to have stronger decision-making power, insight and process optimization capabilities. In short, the ability to quickly obtain valuable information from various types of data is big data technology. Understanding this is critical, and it’s what drives this technology’s potential to reach so many businesses.

The era of big data has arrived, and it will set off huge waves of change in many fields. But we must calmly see that the core of big data is to mine the value contained in the data for customers, rather than the stacking of software and hardware. Therefore, research on big data application models and business models in different fields will be the key to the healthy development of the big data industry. We believe that with the overall planning and support of the state, through local governments formulating big data industry development strategies according to local conditions, and through the active participation of domestic and foreign leading IT companies and many innovative companies, the future development prospects of the big data industry are very broad.

2. The concept of cloud computing

Cloud computing is a model for adding, using, and delivering Internet-related services. This model provides available, convenient, and on-demand network access into a configurable shared pool of computing resources (resources include networks, servers, storage, and applications). software, services), these resources can be provisioned quickly with minimal management effort or minimal interaction with the service provider. Typically involves providing dynamically scalable and often virtualized resources over the Internet. Cloud is a metaphor for network and Internet. In the past, cloud was often used to represent telecommunications networks in diagrams, and later it was also used to represent the abstraction of the Internet and underlying infrastructure. Therefore, cloud computing can even allow you to experience 10 trillion calculations per second. With such powerful computing power, you can simulate nuclear explosions, predict climate change and market development trends. Users access the data center through computers, laptops, mobile phones, etc., and perform calculations according to their own needs.

3. The concept of artificial intelligence

Artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It mainly includes the principles of computers realizing intelligence, manufacturing computers similar to human brain intelligence, and making computers Can achieve higher level applications. Artificial intelligence will involve disciplines such as computer science, psychology, philosophy and linguistics. It can be said that almost all disciplines of natural science and social science have its scope far beyond the scope of computer science. The relationship between artificial intelligence and thinking science is the relationship between practice and theory. Artificial intelligence is at the technical application level of thinking science. It is an application branch of it.

From a thinking point of view, artificial intelligence is not limited to logical thinking. Only by considering image thinking and inspired thinking can we promote the breakthrough development of artificial intelligence. Mathematics is often considered to be the basic science of many disciplines. Mathematics has also entered the fields of language and thinking. Artificial intelligence Intelligent disciplines must also borrow mathematical tools. Mathematics not only plays a role in standard logic, fuzzy mathematics, etc., but when mathematics enters the artificial intelligence discipline, they will promote each other and develop faster.

The relationship between big data, cloud computing, and artificial intelligence. The Internet of Things is an application expansion of the Internet. Rather than saying that the Internet of Things is a network, it is better to say that the Internet of Things is business and applications. Therefore, application innovation is the core of the development of the Internet of Things, and innovation with user experience as the core is the soul of the development of the Internet of Things. Cloud computing is equivalent to the human brain and the nerve center of the Internet of Things. Cloud computing is an Internet-based model for the addition, use, and delivery of related services that typically involves the provision of dynamically scalable and often virtualized resources over the Internet. Big data is equivalent to the massive amount of knowledge that the human brain memorizes and stores from elementary school to university. This knowledge can only create greater value through digestion, absorption, and reconstruction. The metaphor of artificial intelligence is that a person absorbs the vast amount of human knowledge (data), continuously learns deeply, and evolves to become an expert. Artificial intelligence is inseparable from big data, and it is based on cloud computing platforms to complete the evolution of deep learning.

Simple summary: Massive data is generated and collected through the Internet of Things and stored in cloud platforms, and then through big data analysis and even higher forms of artificial intelligence to provide better services for human production activities and life needs. This will definitely be the fourth The direction of evolution of the sub-industrial revolution.

4. Cloud computing and big data

From a technical point of view, the relationship between big data and cloud computing is as inseparable as the two sides of the same coin. Big data cannot be processed by a single computer, and a distributed computing architecture must be adopted. Its characteristic lies in the mining of massive data, but it must rely on distributed processing, distributed databases, cloud storage and virtualization technologies of cloud computing.

5. Artificial intelligence and big data

If we think of artificial intelligence as a baby with unlimited potential that is waiting to be fed, the massive and deep data of expertise in a certain field is the milk powder that feeds this genius. The quantity of milk powder determines whether the baby can grow up, while the quality of milk powder determines the baby's subsequent intellectual development level.

Compared with many previous data analysis technologies, artificial intelligence technology is based on neural networks and has developed multi-layer neural networks at the same time, allowing for deep machine learning. Compared with other traditional algorithms, this algorithm has no redundant assumptions (for example, linear modeling requires assuming a linear relationship between data), but completely uses the input data to simulate and build the corresponding model structure by itself. The characteristics of this algorithm determine that it is more flexible and can have the ability to self-optimize according to different training data.

But this obvious advantage brings a significant increase in the amount of calculations. Before breakthroughs in computer computing power, such algorithms had almost no practical application value. About ten years ago, we tried to use a neural network to calculate a set of data that was not massive. We waited for three days without necessarily getting results. But today the situation is very different. High-speed parallel computing, massive data, and more optimized algorithms have jointly contributed to breakthroughs in the development of artificial intelligence. This breakthrough, if we look back thirty years later, will be another technology that is as profound as the Internet in having a profound impact on mankind. The power it unleashes will once again completely change our lives.

6. Artificial intelligence and cloud computing

Artificial intelligence is the product of the combination of program algorithms and big data. While cloud computing is the algorithm part of the program, the Internet of Things is part of the root system of collecting big data. It can be simply thought of as: artificial intelligence = cloud computing + big data (partly from the Internet of Things). As the Internet of Things spreads in life, it will become the largest and most accurate source of big data.

Now that we have entered the era of big data, cloud computing, and artificial intelligence, we must understand their essence, seize opportunities, keep up with trends, and innovate and develop in order to remain invincible in the high-tech development tide.

5.2 Key technologies of big data

5.2.1 Basic functions of Hadoop, MapReduce, NoSQL and other technologies

1、Hadoop
1) Overview of Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without understanding the underlying details of distribution. Make full use of the power of clusters for high-speed computing and storage. Hadoop implements a Distributed File System, one of its components is HDFS. HDFS is highly fault-tolerant and designed to be deployed on low-cost hardware; and it provides high throughput to access application data, making it suitable for those with large data sets. set) application. HDFS relaxes POSIX requirements and allows streaming access to data in the file system. The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides calculation for massive data.

Hadoop is a distributed computing platform that allows users to easily build and use it. Users can easily develop and run applications that handle massive amounts of data on Hadoop. It mainly has the following advantages:

① High reliability: Hadoop’s ability to store and process data bit by bit is worthy of people’s trust.

② High scalability: Hadoop distributes data and completes computing tasks among available computer clusters. These clusters can be easily expanded to thousands of nodes.

③ Efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

④ High fault tolerance: Hadoop can automatically save multiple copies of data and can automatically redistribute failed tasks.

⑤ Low cost: Compared with all-in-one machines, commercial data collection libraries, and data marts such as QliWView and Yonghook Z-Suite, hadoop is open source, so the software cost of the project will be greatly reduced.

Hadoop comes with a framework written in the Java language, so it is ideal to run on Linux production platforms. Applications on Hadoop can also be written in other languages, such as C++.

2) Hadoop principle and operating mechanism

The core of Hadoop consists of three sub-projects: Hadoop Common, HDFS, and MapReduce. Hadoop is made up of many elements. At the very bottom is the Hadoop Distributed File System (HDFS), which stores files on all storage nodes in the Hadoop cluster. The upper layer of HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers. Through the introduction of the core distributed file system HDFS of the Hadoop distributed computing platform, the MapReduce processing process, the data warehouse tool Hive, and the distributed database Hbase, it basically covers all the technical core of the Hatop distributed platform.

① Hadoop Common (Hadoop public service module) is the lowest module of the Hadoop system, which provides the APIs required for development of various Hadoop sub-projects. In Hadoop 0.20 and previous versions, Hadoop Common includes HDFS, MapReduce and other project common content. Starting from Hadoop 0.21, HDFS and MapReduce are separated into independent sub-projects, and the remaining content is Hadoop Common, such as system configuration tool Configuration, remote Procedure calls RPC, serialization mechanism, etc.

② HDFS (Hadoop distributed File System) is an open source distributed file system similar to Google GFS and is the basis for data storage management in the Hadoop system. It provides a scalable, highly reliable, and highly available large-scale data distributed storage management system based on the file system of the local Linux system that is physically distributed on various data storage nodes, providing a logical whole for upper-layer applications. Large-scale data storage file system.

③ MapReduce (parallel computing framework) is a computing model used to calculate large amounts of data. Map performs specified operations on independent elements on the data set to generate intermediate results in the form of key-value pairs; Reduce reduces all "values" with the same "key" in the intermediate results to obtain the final result. Functional division like MapReduce is very suitable for distributed parallel environments composed of a large number of computers.

2、MapReduce
1) Overview of MapRedue

It mainly consists of four parts: Client, JobTracker, TaskTracker and Task:

① JobTracke is responsible for resource monitoring and job scheduling. JobTracker monitors the health status of all TaskTrackers and jobs. Once a failure is detected, the corresponding tasks will be transferred to other nodes: At the same time, JobTracker will track the task execution progress, resource usage and other information, and tell the task scheduler this information. The scheduler will select appropriate tasks to use these resources when they become idle. In Hadoop, the task scheduler is a pluggable module, and users can design the corresponding scheduler according to their own needs;

② TaskTracker will periodically report the resource usage and task running progress on this node to JobTracker through Heartbeat, and at the same time receive commands sent by JobTracker and perform corresponding operations (such as starting new tasks, ending tasks, etc.). TaskTracker uses "slot" equal amounts to divide the amount of resources on this node. "slot" represents computing resources (CPU, memory, etc.). A Task has a chance to run only after it obtains a slot, and the role of the Hadoop scheduler is to allocate the free slots on each TaskTracker to the Task. Slots are divided into two types: Map slot and Reduce slot, which are used by MapTask and ReduceTask respectively. TrackeTask limits the concurrency of Tasks through the number of slots (configurable parameters).

③ Tasks are divided into MapTask and ReduceTask, both of which are started by TaskTracker. HDFS stores data in fixed-size blocks as the basic unit, while for MapReduce, its processing unit is split. Split is a logical concept that only contains some metadata information, such as the starting position of the data, the length of the data, the node where the data is located, etc. Its division method is completely decided by the user. But it should be noted that the number of splits determines the number of MapTasks, because each split will only be handed over to one MapTask for processing.

2) Features of MapReduce
① Scalability

The scalability of MapReduce includes data scalability and computing scale scalability. The data scalability of MapReduce is reflected in its continuous effectiveness as the data scale increases; the scalability of the computing scale is reflected in its near-linear growth as the number of nodes in the cluster increases.

② High fault tolerance

The high fault tolerance of MapReduce is reflected in the dynamic addition/reduction of computing nodes, truly realizing elastic computing, and the framework uses a variety of effective mechanisms to improve fault tolerance, such as automatic node restart technology. When a node in the cluster fails, Can effectively handle the detection and recovery of failed nodes.

③ Efficiency

MapReduce uses data/code mutual positioning technology to reduce data communication in the cluster, thereby effectively reducing network bandwidth. Each computing node in the cluster is responsible for processing locally stored data as much as possible. Only when the local node cannot process local data, the proximity principle is used to find other available computing nodes and transmit the data to the available computing node.

④ Dynamic and flexible resource allocation and scheduling

MapReduce divides submitted computing jobs into multiple computing tasks. The task scheduling function is mainly responsible for allocating and scheduling computing nodes (Map nodes and reduce nodes) for these divided computing tasks. It is also responsible for optimizing some computing performance, such as supporting Job scheduling priority and task preemption, etc.

⑤ Hide underlying details

The biggest advantage of MapReduce is that it encapsulates the underlying implementation details, reducing the many detailed issues that programmers need to consider when processing large-scale data, such as data distribution storage management, data distribution, data communication and synchronization, etc., effectively reducing the difficulty of parallel programming. Programmers can be liberated from underlying details and only need to care about the algorithm design of the application layer itself, thereby improving programmers' programming efficiency in a distributed programming environment.

3、NoSQL

NoSQL generally refers to non-relational databases. With the rise of Web 2.0 websites, traditional relational databases are unable to handle Web 2.0 websites, especially ultra-large-scale and highly concurrent SNS-type Web 2.0 databases that are purely dynamic websites, and many insurmountable problems have arisen. Non-relational data has developed very rapidly due to its own characteristics. The NOSQL database was created to solve the challenges brought by large-scale data collections and multiple data types, especially big data application problems.

1) The basic meaning of NoSQL

The most common explanation for NoSQL is "non-relational", and "Not Only SQL" is also accepted by many people. NoSQL is just a concept that generally refers to non-relational databases. Different from relational databases, they do not guarantee the ACID characteristics of relational data. NOSQL is a brand-new revolutionary database movement. Its advocates advocate the use of non-relational data storage. Compared with the overwhelming application of relational databases, this concept is undoubtedly an injection of new thinking.

NoSQL has the following advantages: it is easy to expand, and there are many types of NOSQL databases, but a common feature is that it removes the relational characteristics of relational databases. There is no relationship between data, so it is very easy to expand. Invisibly, it also brings scalability at the architectural level. Large data volume, high performance, NOSQL database has very high read and write performance, especially under large data volume, it also performs well. This is due to its non-relational nature and simple structure of the database.

2) Classification of NoSQL
① Key-Value storage database

This type of database mainly uses a hash table, which has a specific key and a pointer pointing to specific data. The advantage of the Key/value model for IT systems is that it is simple and easy to deploy. But if the database administrator (DBA) only queries or updates some values, Key/value becomes inefficient. For example: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB.

② Column storage database

This part of the database is usually used to deal with the massive data stored in distributed systems. The keys still exist, but they have the characteristic of pointing to multiple columns. The columns are arranged by column families. Such as: Cassandra, Hbase, Riak.

③ Document database

The document database was inspired by the Lotus Notes office software and is similar to the first key-value store. This type of data model is a versioned document, a semi-structured document stored in a specific format, such as JSON. Document databases can be seen as an upgraded version of key-value databases, allowing key values ​​to be nested. When processing complex data such as web pages, document databases have higher query efficiency than traditional key-value databases. Such as: CouchDB, MongoDb. There is also a document database SequoiaDB in China, which has been open source.

④ Graph database

The graph-structured database is different from other row-column and rigid-structured SQL databases. It uses a flexible graph model and can be extended to multiple servers. NoSQL databases do not have a standard query language (SQL), so database queries require a data model. Many NoSQL databases have RESTful data interfaces or query APIs. Such as: Neo4J, InfoGrid, Infinite Graph.

3) Characteristics of NoSQL

There is no clear scope or definition for NoSQL, but they all have the following common characteristics:

① Easy to expand

There are many types of NoSQL databases, but a common feature is that they remove the relational characteristics of relational databases. There is no relationship between data, so it is very easy to expand. Invisibly, it brings scalability at the architectural level.

② Large data volume, high performance

NoSQL databases have very high read and write performance, especially when dealing with large amounts of data. This is due to its non-relationship and simple database structure. Generally, MySQL uses Query Cache. NoSQI's Cache is record-level and a fine-grained Cache, so the performance of NoSQL at this level is much higher.

③ Flexible data model

NoSQL does not need to create fields for the data to be stored in advance, and can store customized data formats at any time. In a relational database, adding and deleting fields is a very troublesome thing. If it is a table with a very large amount of data, adding fields is simply a nightmare. This is especially obvious in the era of Web 2.0 with large amounts of data.

④ High availability

NoSQL can easily implement a highly available architecture without affecting performance. For example, Cassandra and HBase models can also achieve high availability by replicating the model.

4) NoSQL system framework

NoSQL framework system The overall framework of NosoL is divided into four layers. From bottom to top, it is divided into data persistence layer (data persistence), overall distribution layer (data distribution model), data logical model layer (datalogical model), and interface layer (interface). They complement each other and coordinate their work.

The data persistence layer defines the storage form of data, which mainly includes four forms: hard disk, memory, interface, and customized pluggable. Memory-based data access is the fastest, but may cause data loss. Hard drive-based data storage may last for a long time, but access speeds are slower than memory-based forms. The combination of memory and hard disk combines the advantages of the first two forms, ensuring both speed and data loss. Customized pluggability ensures high flexibility in data access.

The data distribution layer defines how data is distributed. Compared with relational databases, NoSQL has more optional mechanisms, which mainly come in three forms: First, CAP support, which can be used for horizontal expansion. The second is multi-data center support, which can ensure smooth operation across multiple data centers. The third is dynamic deployment support, which can dynamically add or delete nodes in a running cluster.

The data logic layer describes the logical expression of data. Compared with relational databases, NoSQL is quite flexible in logical expression. There are four main forms: First, the key-value model. This model is relatively simple in expression, but But it has strong scalability. The second is the column model. Compared with the key-value model, this model can support more complex data, but its scalability is relatively poor. The third is the document model, which has great advantages in supporting complex data and scalability. The fourth is the graph model. There are not many usage scenarios for this model, and it is usually customized based on the data of the graph data structure.

The interface layer provides a convenient data calling interface for upper-layer applications, providing far more choices than relational databases. The interface layer provides five options: Rest, Thrift, Map/Reduce, Get/Put, and language-specific API, making the interaction between applications and databases more convenient.

The NoSQL layered architecture does not mean that each product has only one choice at each layer. On the contrary, this layered design provides great flexibility and compatibility, and each database can support multiple features at different levels.

4. Open source NoSQL database software

Open source NoSQL database software includes Membase and MongoDB.

5.2.2   Basic concepts of crawlers, cleaning and other technologies and knowledge of commonly used tools

1. Reptile
1) Overview of crawlers

A web crawler (also known as a web spider, a web robot, and more commonly known as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ants, autoindexers, emulators, or worms.

A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. Traditional crawlers start from the URL of one or several initial web pages and obtain the URL on the initial web page. During the process of crawling the web page, they continuously extract new URLs from the current page and put them into the queue until certain stopping conditions of the system are met. The workflow of the focused crawler is more complex, and it requires filtering links unrelated to the topic based on a certain web page analysis algorithm, retaining useful links and putting them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached. In addition, all web pages crawled by crawlers will be stored by the system, subjected to certain analysis, filtering, and indexing for subsequent query and retrieval; for focused crawlers, the analysis results obtained in this process may also be Give feedback and guidance for future crawling processes.

Compared with general web crawlers, focused crawlers also need to solve three main problems:

① Description or definition of the crawling target;

② Analysis and filtering of web pages or data;

③ Search strategy for URL.

2) Classification of crawlers

Web crawlers can be roughly divided into the following types according to system structure and implementation technology: General Purpose Web Crawler, Focused Web Crawler, Incremental Web Crawler, Deep Web Crawler (Deep Web Crawler). The actual web crawler system is usually implemented by a combination of several crawler technologies.

The description and definition of crawling targets are the basis for determining how to formulate web page analysis algorithms and URL search strategies. The web page analysis algorithm and candidate URL sorting algorithm are the key to determining the service form provided by the search engine and the crawler web page crawling behavior. The algorithms of these two parts are closely related.

Existing focused crawler descriptions of crawling targets are divided into three types: based on target web page features, based on target data patterns, and based on domain concepts.

① Based on the characteristics of the target web page

The objects crawled, stored and indexed by crawlers based on the characteristics of the target web page are generally websites or web pages. According to the method of obtaining seed samples, they can be divided into:

Ⅰ Pre-given initial crawling seed sample;

Ⅱ Pre-given web page categories and seed samples corresponding to the categories, such as Yahoo! Classification structure, etc.;

Ⅲ Examples of crawling targets determined by user behavior are divided into:

a) Display marked crawl samples during user browsing;

b) Obtain access patterns and related samples through user log mining.

Among them, the webpage characteristics can be the content characteristics of the webpage, or the link structure characteristics of the webpage, etc.

② Based on target data model

Crawler based on the target data schema targets the data on the web page. The captured data generally must conform to a certain schema, or can be converted or mapped to the target data schema.

③ Based on domain concepts

Another way to describe is to establish an ontology or dictionary of the target domain, which is used to analyze the importance of different features in a certain topic from a semantic perspective.

3) Crawler web search strategy

Web crawling strategies can be divided into three types: depth first, breadth first and best first. Depth priority will often lead to crawler trapping problems. Currently, breadth priority and best priority methods are commonly used.

① Breadth first search

The breadth-first search strategy means that during the crawling process, the next level of search is performed only after the current level of search is completed. The design and implementation of this algorithm is relatively simple. At present, in order to cover as many web pages as possible, the breadth-first search method is generally used. There are also many studies that apply the breadth-first search strategy to focused crawlers. The basic idea is that web pages within a certain link distance from the initial URL have a high probability of topic relevance. Another method is to combine breadth-first search with web filtering technology. First, use the breadth-first strategy to crawl web pages, and then filter out irrelevant web pages. The disadvantage of these methods is that as the number of crawled web pages increases, a large number of irrelevant web pages will be downloaded and filtered, and the efficiency of the algorithm will become low.

② Best first search

The best first search strategy uses a certain web page analysis algorithm to predict the similarity between the candidate URL and the target web page, or the correlation with the topic, and selects one or several URLs with the best evaluation for crawling. It only accesses web pages that are predicted to be "useful" by web analytics algorithms. One problem is that many relevant web pages on the crawler path may be ignored because the best-first strategy is a locally optimal search algorithm. Therefore, it is necessary to improve the best priority combined with specific applications to jump out of the local optimal point. A detailed discussion will be made in Section 4 combined with the web page analysis algorithm. Research shows that such closed-loop adjustments can reduce the number of irrelevant web pages by 30% to 90%.

③ Depth first search

The depth-first search strategy starts from the starting web page, selects a URL to enter, analyzes the URLs in this web page, and selects one before entering. Fetch link by link in this way until one route is processed before processing the next route. The depth-first strategy design is relatively simple. However, the links provided by portal websites are often the most valuable and have a high PageRank. However, with each deeper level, the value of the web page and PageRank will decrease accordingly. This implies that important web pages are usually close to the seed, while web pages crawled too deeply have low value. At the same time, the crawling depth of this strategy directly affects the crawling hit rate and crawling efficiency, and the crawling depth is the key to this strategy. Compared to the other two strategies, this strategy is rarely used.

4) Crawler web page analysis algorithm

Web page analysis algorithms can be summarized into three types: based on network topology, based on web page content, and based on user access behavior.

① Topology analysis algorithm

Based on the links between web pages, an algorithm is used to evaluate the items (which can be web pages or websites, etc.) that have direct or indirect link relationships with known web pages or data. It is divided into three types: web page granularity, website granularity and web page block granularity.

② Web page content analysis algorithm

The analysis algorithm based on web page content refers to the evaluation of web pages using the characteristics of web page content (text, data and other resources). The content of web pages has evolved from being mainly hypertext to later being dominated by dynamic page (or Hidden Web) data. The data volume of the latter is about 400~500 times that of directly visible page data (PIW, Publicly Indexable Web). times. On the other hand, various network resource forms such as multimedia data and Web Services are becoming increasingly abundant. Therefore, analysis algorithms based on web page content have also developed from the original relatively simple text retrieval method to a comprehensive application covering various methods such as web page data extraction, machine learning, data mining, and semantic understanding. According to the different forms of web page data, this section summarizes the analysis algorithms based on web page content into the following three categories: the first is for unstructured or very simple web pages that are mainly text and hyperlinks; the second is for structured web pages. For pages dynamically generated by data sources (such as RDBMS), the data cannot be directly accessed in batches; the data targeted by the third type is between the first and second types of data, has a better structure, and the display follows a certain pattern or style. and can be accessed directly.

Text-based web page analysis algorithms are divided into:

Ⅰ Pure text classification and clustering algorithm

It borrows technology from text retrieval to a large extent. Text analysis algorithms can quickly and effectively classify and cluster web pages, but because they ignore structural information between and within web pages, they are rarely used alone.

Ⅱ Hypertext classification and clustering algorithms

Web pages are classified according to the related types of the web pages linked to the web pages, and the type of the web pages is inferred based on the associated web pages.

5) Composition of web crawler
① Controller

The controller is the central controller of the web crawler. It is mainly responsible for allocating a thread according to the URL link passed by the system, and then starting the thread to call the crawler to crawl the web page.

② Parser

The parser is the main part responsible for the web crawler. Its main tasks include: downloading web pages, processing the text of web pages, such as filtering functions, extracting special HTML tags, and analyzing data.

③ Resource library

It is mainly a container used to store data records downloaded from web pages and provides a target source for generating indexes. Medium and large database products include: Oracle, Sql Server, etc.

6) Summary of commonly used crawler tools
① Page downloader

Requests, scrapy, selenium+chrome+PhantomJ (crawl dynamic web pages), Splash (crawl dynamic web pages).

② Page interpreter

BeautifulSoup (entry-level), pyquery (similar to jQuery), lxml, parsel, scrapy's Selector (strongly recommended, more advanced package, based on parsel)

③ Data storage

txt text, csv file, sqlite3 (comes with python), MySQL, MongoDB.

④ Other tools

exec js; execute js, pyv8; execute js, html5lib.

2. Data cleaning
1) Basic concepts of data cleaning
① Data cleaning

Data cleaning is the process of re-examining and verifying data, with the purpose of removing duplicate information, correcting existing errors, and providing data consistency.

From the name, data cleaning can also be seen as "washing out" the "dirty". It refers to the last procedure to discover and correct identifiable errors in the data file, including checking data consistency and processing invalid values. and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain topic. These data are extracted from multiple business systems and contain historical data. In this way, it is inevitable that some data are wrong data and some data are inconsistent with each other. Conflicts, these erroneous or conflicting data are obviously unwanted and are called "dirty data". We need to "wash out" "dirty data" according to certain rules. This is data cleaning. The task of data cleaning is to filter the data that does not meet the requirements, and hand the filtering results to the business department to confirm whether to filter out or let the business unit Extract after correction. Data that does not meet the requirements mainly fall into three categories: incomplete data, erroneous data, and duplicate data. Data cleaning is different from questionnaire review. Data cleaning after entry is generally completed by computers rather than manually.

② Consistency check

Consistency check is to check whether the data meets the requirements based on the reasonable value range and mutual relationship of each variable, and to find data that is beyond the normal range, logically unreasonable or contradictory.

③ Handling of invalid values ​​and missing values

Due to survey, coding and entry errors, there may be some invalid values ​​and missing values ​​in the data, which need to be dealt with appropriately. Commonly used processing methods include: estimation, whole case deletion, variable deletion and pairwise deletion.

2) Principle of data cleaning

Principle of data cleaning: Use relevant technologies such as mathematical statistics, data mining or predefined cleaning rules to transform dirty data into data that meets data quality requirements.

3) Main types of data cleaning
① Incomplete data

This type of data is mainly missing information that should be there, such as the name of the supplier, the name of the branch, the customer's regional information is missing, the main table and the detailed table in the business system cannot match, etc. This type of data is filtered out, and the missing content is written into different Excel files and submitted to the customer, which is required to be completed within the specified time. After completion, it is written to the data warehouse.

② Wrong data

The reason for this type of error is that the business system is not sound enough, and it is written directly into the backend database without making a judgment after receiving the input. For example, numerical data is input into full-width numeric characters, string data is followed by a carriage return operation, and the date format is incorrect. Correct, date out of bounds, etc. This type of data must also be classified. For problems such as full-width characters and invisible characters before and after the data, they can only be found through SQL statements, and then the customer is required to extract them after the business system is corrected. Errors such as incorrect date format or date out-of-bounds will cause ETL operation to fail. This type of error needs to be picked out in the business system database using SQL, and submitted to the business department for correction within a time limit, and then extracted after correction.

③ Duplicate data

For this type of data, especially in dimension tables, export all fields of duplicate data records and let customers confirm and organize them.

Data cleaning is an iterative process that cannot be completed in a few days. The only way is to constantly discover and solve problems. Customers are generally required to confirm whether to filter or correct, and write the filtered data into an Excel file or write the filtered data into a data table. In the early stages of ETL development, emails with filtered data can be sent to business units every day to urge them to expedite the process as soon as possible. to correct errors and also serve as a basis for future verification of data. What needs to be paid attention to when cleaning data is not to filter out useful data, carefully verify each filtering rule, and require user confirmation.

4) Data cleaning methods

Generally speaking, data cleaning is the process of streamlining a database to remove duplicate records and convert the remainder into a standard acceptable format. The standard model for data cleaning is to input data into a data cleaning processor, "clean" the data through a series of steps, and then output the cleaned data in the desired format. Data cleaning deals with issues such as missing values, out-of-bounds values, inconsistent codes, and duplicate data from the aspects of data accuracy, completeness, consistency, uniqueness, timeliness, and validity.

Data cleaning is generally targeted at specific applications, so it is difficult to summarize unified methods and steps. However, corresponding data cleaning methods can be given based on different data.

① Methods to solve incomplete data (i.e. missing values)

In most cases, missing values ​​must be filled in manually (i.e., manually cleaned). Of course, some missing values ​​can be derived from this data source or other data sources, which can be used to replace the missing values ​​​​with averages, maximums, minimums, or more complex probability estimates to achieve the purpose of cleaning.

② Detection and solution of error values

Use statistical analysis methods to identify possible erroneous values ​​or outliers, such as deviation analysis, identifying values ​​that do not follow distributions or regression equations. You can also use a simple rule base (common sense rules, business-specific rules, etc.) to check data values, or use Constraints between different attributes and external data to detect and clean data.

③ How to detect and eliminate duplicate records

Records with the same attribute values ​​in the database are considered duplicate records. By judging whether the attribute values ​​​​between records are equal, it is detected whether the records are equal, and equal records are merged into one record (that is, merged/cleared). Merge/purge is the basic method of deduplication.

④ Detection and resolution of inconsistencies (within and between data sources)

Data integrated from multiple data sources may have semantic conflicts. Integrity constraints can be defined to detect inconsistencies, and connections can be discovered by analyzing the data to make the data consistent.

5) Data cleaning tools

The data cleaning tools developed can be roughly divided into three categories:

① The data migration tool allows you to specify simple conversion rules, such as replacing the characters gender with sex. Sex Company's PrismWarehouse is a popular tool that falls into this category.

② Data cleaning tools use domain-specific knowledge (such as postal addresses) to clean the data. They usually use syntax analysis and fuzzy matching techniques to clean data from multiple data sources. Some tools can indicate the "relative cleanliness" of a source. The tools Integrity and Trillum fall into this category.

③ Data audit tools can discover patterns and connections by scanning data. Therefore, such tools can be seen as variations of data mining tools.

5.2.3   Basic concepts such as big data analysis, mining, and visualization technology

1. Big data analysis
1) Overview of big data analysis

Big data analysis refers to the analysis of huge amounts of data. Big data can be summarized as five V's: Volume, Velocity, Variety, Value, and Veracity.

Big data is the hottest term in the IT industry nowadays, and the subsequent utilization of the commercial value of big data, such as data warehouse, data security data analysis, data mining, etc., has gradually become a profit focus that industry professionals are eager to pursue. With the advent of the big data era, big data analysis has also emerged.

2) Steps of big data analysis
1. Analytic Visualizations

Whether for data analysis experts or ordinary users, data visualization is the most basic requirement for data analysis tools. Visualization can display data intuitively, let the data speak for itself, and let the audience hear the results.

2. Data Mining Algorithms

Visualization is for people to see, while data mining is for machines to see. Clustering, segmentation, outlier analysis, and other algorithms allow us to dig deep into the data and discover value. These algorithms must handle not only the volume but also the speed of big data.

3. Predictive Analytic Capabilities

Data mining allows analysts to better understand data, while predictive analysis allows analysts to make some predictive judgments based on the results of visual analysis and data mining.

4. Semantic Engines

We know that the diversity of unstructured data brings new challenges to data analysis, and we need a series of tools to parse, extract, and analyze data. Semantic engines need to be designed to intelligently extract information from "documents".

5. Data Quality and Master Data Management

Data quality and data management are some of the best practices in management. Processing data through standardized processes and tools ensures a predefined, high-quality analysis result.

If big data is really the next important technological innovation, we'd better focus on the benefits that big data can bring us, not just the challenges.

6. Data storage, data warehouse

Data warehouse is a relational database established to facilitate multi-dimensional analysis and multi-angle display of data, and to store data in a specific mode. In the design of business intelligence systems, the construction of data warehouse is the key and the foundation of the business intelligence system. It undertakes the task of integrating business system data, provides data extraction, transformation and loading (ETL) for the business intelligence system, and organizes data analysis by theme. Data can be queried and accessed to provide a data platform for online data analysis and data mining.

2. Data mining
1) Overview of data mining

Data mining refers to the process of searching for information hidden in large amounts of data through algorithms. Data mining is often associated with computer science and achieves the above goals through many methods such as statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (relying on past rules of thumb) and pattern recognition.

The object of data mining can be any type of data source, it can be a relational database, which contains structured data; it can also be a data warehouse, text, multimedia data, spatial data, time series data, Web data, which includes Data sources for semi-structured data and even heterogeneous data. Methods of discovering knowledge can be numerical, non-numeric, or inductive. The knowledge finally discovered can be used for information management, query optimization, decision support, and maintenance of the data itself.

2) Steps of data mining

Before implementing data mining, it is necessary to formulate what steps to take, what to do at each step, and what goals are necessary to achieve. Only with a good plan can data mining be implemented in an orderly manner and achieve success. Many software vendors and data mining consulting companies provide some data mining process models to guide their users step by step in data mining work. For example, SPSS's 5A and SAS's SEMMA.

The data mining process model steps mainly include defining problems, establishing data mining libraries, analyzing data, preparing data, building models, evaluating models and implementation. Let's take a closer look at the specific content of each step:

① Definition problem

The first and most important requirement before starting knowledge discovery is to understand the data and business problem. You must have a clear and clear definition of your goals, that is, decide what you want to do. For example, when you want to improve the utilization rate of your e-mail, you may want to "increase user utilization rate" or "increase the value of one user use." The models established to solve these two problems are almost completely different. A decision must be made.

② Establish a data mining library

Building a data mining library includes the following steps: data collection, data description, selection, data quality assessment and data cleaning, merging and integration, building metadata, loading the data mining library, and maintaining the data mining library.

③ Analyze data

The purpose of the analysis is to find the data fields that have the greatest impact on the forecast output and determine whether export fields need to be defined. If the data set contains hundreds or thousands of fields, then browsing and analyzing the data will be a very time-consuming and tiring task. In this case, you need to choose a tool software with a good interface and powerful functions to assist you in completing these tasks. .

④ Prepare data

This is the last step of data preparation before building the model. This step can be divided into four parts: selecting variables, selecting records, creating new variables, and converting variables.

⑤ Build a model

Building a model is an iterative process. Different models need to be carefully examined to determine which model is most useful for the business problem faced. First use a portion of the data to build a model, and then use the remaining data to test and validate the resulting model. Sometimes there is a third data set, called the validation set, because the test set may be affected by the characteristics of the model, and an independent data set is needed to verify the accuracy of the model. Training and testing data mining models requires splitting the data into at least two parts, one for model training and the other for model testing.

⑥ Evaluation model

After the model is established, the results obtained must be evaluated and the value of the model explained. The accuracy obtained from the test set is only meaningful for the data used to build the model. In practical applications, it is necessary to further understand the types of errors and the related costs caused by them. Experience has proven that a valid model is not necessarily a correct model. The direct reason for this is the various assumptions implicit in the establishment of the model. Therefore, it is important to test the model directly in the real world. Apply it to a small area first, obtain test data, and then promote it to a large area after you feel satisfied.

⑦ Implementation

Once a model is built and validated, it can be used in two main ways. The first is to provide analysts with a reference; the other is to apply this model to different data sets.

3) Analysis methods of data mining

Data mining is divided into guided data mining and unguided data mining. Guided data mining uses available data to build a model that describes a specific attribute. Unguided data mining is to find some relationship among all attributes. Specifically, classification, valuation, and prediction belong to guided data mining; association rules and clustering belong to unguided data mining.

① Classification

It first selects a training set that has been classified from the data, uses data mining technology on the training set to build a classification model, and then uses the model to classify unclassified data.

② Valuation

Valuation is similar to classification, but the final output result of valuation is a continuous value, and the amount of valuation is not predetermined. Valuation can be used as a preparation for classification.

③ Forecast

It is performed through classification or valuation, and a model is obtained through classification or valuation training. If the model has a high accuracy for the test sample group, the model can be used for unknown variables in new samples. Make predictions.

④ Relevance grouping or association rules

The aim is to discover which things always happen together.

⑤ Clustering

It is a method of automatically finding and establishing grouping rules. It divides similar samples into a cluster by judging the similarity between samples.

4) Classic algorithms for data mining

At present, data mining algorithms mainly include neural network method, decision tree method, genetic algorithm, rough set method, fuzzy set method, association rule method, etc.

① Neural network method

The neural network method simulates the structure and function of the biological nervous system. It is a nonlinear prediction model learned through training. It regards each connection as a processing unit and attempts to simulate the function of human brain neurons. , which can complete various data mining tasks such as classification, clustering, and feature mining. The learning method of neural network is mainly reflected in the modification of weights. Its advantages are anti-interference, non-linear learning, and associative memory functions, and it can obtain accurate prediction results for complex situations; its disadvantages are first It is not suitable for processing high-dimensional variables, the intermediate learning process cannot be observed, and the "black box" output results are difficult to interpret; secondly, it requires a long learning time. Neural network method is mainly used in clustering technology of data mining.

② Decision tree method

Decision tree is a process of constructing classification rules based on the different effects on target variables, and classifying data through a series of rules. Its expression is a flow chart similar to a tree structure. The most typical algorithm is the ID3 algorithm proposed by J.R. Quinlan in 1986. Later, the extremely popular C 4.5 algorithm was proposed based on the ID3 algorithm. The advantage of using the decision tree method is that the decision-making process is visible, it does not require a long time to construct the process, the description is simple, easy to understand, and the classification speed is fast; the disadvantage is that it is difficult to combine based on multiple variables Discovery rules. The decision tree method is good at processing non-numeric data and is particularly suitable for large-scale data processing. Decision trees provide a way to display rules such as what values ​​will be obtained under what conditions. For example, when applying for a loan, you need to make a judgment on the risk of the application.

③ Genetic algorithm

The genetic algorithm simulates the phenomena of reproduction, mating, and genetic mutation that occur in natural selection and inheritance. It is a machine based on evolutionary theory that uses operations such as genetic combination, genetic crossover mutation, and natural selection to generate implementation rules. study method. Its basic viewpoint is the principle of "survival of the fittest", which has properties such as implicit parallelism and easy combination with other models. The mainadvantage is that it can handle many data types and can process various data in parallel; the disadvantage is that it requires too many parameters, is difficult to code, and generally requires a large amount of calculation. Genetic algorithms are often used to optimize neuron networks and can solve problems that are difficult to solve with other techniques.

④ Rough set method

Rough set method, also known as rough set theory, was proposed by Polish mathematician Z Pawlak in the early 1980s. It is a new mathematical tool for dealing with vague, imprecise and incomplete problems. Issues such as data reduction, data correlation discovery, and evaluation of data significance. Itsadvantage is that the algorithm is simple, and no prior knowledge about the data is required during its processing, and the inherent laws of the problem can be automatically found; the disadvantage is that it is difficult to directly process continuous attributes and must be First, discretize the attributes. Therefore, the discretization problem of continuous attributes is a difficulty that restricts the practical application of rough set theory. Rough set theory is mainly used in approximate reasoning, digital logic analysis and simplification, and building predictive models.

⑤ Fuzzy set method

The fuzzy set method uses fuzzy set theory to carry out fuzzy evaluation, fuzzy decision-making, fuzzy pattern recognition and fuzzy clustering analysis on problems. Fuzzy set theory uses membership degrees to describe the attributes of fuzzy things. The higher the complexity of the system, the stronger the ambiguity.

⑥ Association rules method

Association rules reflect the interdependence or correlation between things. Its most famous algorithm is the Apriori algorithm proposed by R.Agrawal et al. The idea of ​​its algorithm is: first find all frequency sets whose frequency is at least the same as the minimum support of a predetermined meaning, and then generate strong association rules from the frequency sets. Minimum support and minimum confidence are two thresholds given for discovering meaningful association rules. In this sense, the purpose of data mining is to mine association rules that meet the minimum support and minimum credibility from the source database.

5) Problems with data mining

There are also privacy concerns associated with data mining, such as an employer being able to access medical records to screen out those with diabetes or serious underlying diseases in an effort to cut insurance payouts. However, this practice can lead to ethical and legal issues. For the mining of government and commercial data, issues such as national security or commercial secrets may be involved. This is also a big challenge for confidentiality.

There are many legitimate uses for data mining, such as looking up relationships between a drug and its side effects in a database of patient populations. This relationship may not occur in one case in 1,000 people, but pharmaceutical-related projects can use this method to reduce the number of patients with adverse reactions to drugs and potentially save lives; but there is also the possibility that the database may be Abuse issues.

Data mining enables the discovery of information in ways that would otherwise be impossible, but it must be regulated and should be used with appropriate instructions. If data is collected from specific individuals, then a number of confidentiality, legal and ethical issues arise.

3. Visualization technology

Visualization is a kind of mapping that maps information from the objective world into a visual pattern that is easy for humans to perceive. Data visualization maps data into visual patterns, explores and explains the information hidden behind the data, and seeks beauty while ensuring the transmission of information. Telling stories with data is both a science and an art!

Depending on the goals, data visualization can be divided into exploratory analysis and explanatory analysis:

① Exploratory analysis: Explore and understand data, and find information that is uncertain in advance but worth paying attention to or sharing.

② Explanatory analysis: Explain the identified issues to the audience, and communicate and display them in a targeted manner.

Data visualization tools include:

Integrated software tools: Excel, Google Sheet, Tableau, etc.

Programming language tools: R, Python visualization third-party library, D3.js, etc.

Specific data tools: Gephi, QGIS, Gapminder, etc.

Post-production tools: Adobe Illustrator, Inkscape, Infographics Piktochart, etc.

5.3 Main application areas of big data

5.3.1   Typical applications of big data

In the past few decades, big data has affected everyone. Big data is impacting many major industries, including the retail financial industry, medical industry, etc. Big data is also completely changing our lives.

1. Smart city

Today, more than half of the world's population lives in cities, and this number will grow to 75% by 2050. The government needs to use some technical means to manage the city well so that the resources in the city can be well allocated. As one of the technologies, big data can effectively help the government realize scientific allocation of resources, refine city operations, and build smart cities.

2. Financial industry

Big data has a wide range of applications in the financial industry. Many financial industries have established big data platforms to collect and process transaction data in the financial industry. The application of big data in the financial industry is mainly used in five aspects: precision marketing, risk management and control, decision support, efficiency improvement, and financial product design.

3. Medical industry

The medical industry has a large number of cases, pathology reports, medical plans, drug reports, etc. If this data is organized and analyzed, it will greatly help doctors and patients. In the future, with the help of big data platforms, we can collect basic characteristics, cases and treatment plans of diseases, establish disease-specific databases, and help doctors diagnose diseases.

4. Agriculture and animal husbandry

Agricultural products are not easy to preserve, so it is very important for farmers to grow and breed agricultural products rationally. With the help of consumption power and trend reports provided by big data, the government will provide reasonable guidance for agricultural and animal husbandry production, produce according to demand, and avoid overcapacity, causing unnecessary waste of resources and social wealth.

Big data technology can help the government achieve refined management of agriculture and achieve scientific decision-making. Driven by data and combined with drone technology, farmers can collect agricultural product growth information, disease and pest information.

5. Retail industry

The retail industry can understand customers' purchasing preferences for related products through customer purchase records, and put related products together to increase product sales. The retail industry can also record customer purchasing habits and, for essential daily necessities, remind customers to purchase them through precise advertising before they are about to run out. Or make regular deliveries through the online mall, which not only helps customers solve problems but also improves customer experience. Using big data technology, the retail industry will increase sales by at least 30% and improve customer purchasing experience.

6. Big data technology industry

After entering the mobile Internet, unstructured data and structured data have grown exponentially. Human society now generates more data every two years than the entire amount of data in human history. These big data provide huge business opportunities for the big data technology industry. It is estimated that the business opportunities generated by big data collection, storage, processing, clarity, and analysis around the world will exceed 200 billion US dollars, including investments by governments and enterprises in big data computing and storage, data mining and processing, etc. In the future, China's big data industry will grow exponentially. Within 5 years, China's big data industry will form a market worth trillions.

7. Logistics industry

With the help of big data, the logistics industry can establish a national logistics network, understand the freight demand and capacity of each node, rationally allocate resources, reduce the return empty rate of trucks, reduce the overload rate, reduce repeated route transportation and reduce the proportion of small-scale transportation. Through big data technology, we can timely understand the cargo transportation needs of each route, and at the same time establish logistics ports based on geographical location and industrial chain to achieve the actual ratio of cargo and transportation capacity and improve the transportation efficiency of the logistics industry.

Optimizing resource allocation in the logistics industry with the help of big data technology can increase the revenue of the logistics industry by at least 10%, and its market value will be around 500 billion.

8. Real estate industry

With the help of big data, the real estate industry can understand important information such as the number of permanent population, number of floating population, consumption capacity, consumption characteristics, age stage, and demographic characteristics within the area where the land is developed. This information will help real estate developers conduct scientific planning in commercial real estate development, merchant investment, housing types, and community scale. Using big data technology, the real estate industry will reduce planning risks before real estate development, rationally determine housing prices, rationally determine development scale, and rationally conduct business planning. Some real estate companies have already applied big data technology to user profiling, land planning, commercial real estate development and other fields, and have achieved good results.

9. Manufacturing

The manufacturing industry has faced overproduction pressure in the past. Many products, including home appliances, textile products, steel, cement, electrolytic aluminum, etc., were not produced in accordance with actual market needs, resulting in a huge waste of resources. Using e-commerce data, mobile Internet data, and retail data, we can understand future product market demands, rationally plan product production, and avoid overproduction. Big data technology can also understand customer needs based on social data and purchase data, helping manufacturers develop products, design and produce products that meet customer needs.

10. Internet advertising industry

Big data technology can record customer behavior on the Internet, analyze customer behavior, label it, and create user portraits. Precision marketing using mobile Internet big data technology will increase customer conversion rates by more than ten times. Programmatic buying in the advertising industry is gradually replacing broadcast advertising. Big data technology will help advertisers and advertising companies deliver ads directly to target users, which will reduce advertising investment and increase advertising conversion rates.

5.3.2   Future Development Trend of Big Data

In the process of big data development, the following trends have gradually emerged:

1. Resource utilization of data

What is resourceization means that big data has become an important strategic resource that enterprises and companies are paying attention to, and it has become a new focus for everyone to compete for. Therefore, companies must formulate big data marketing strategic plans in advance to seize market opportunities.

2. In-depth integration with cloud computing

Big data is inseparable from cloud processing. Cloud processing provides elastic and scalable infrastructure for big data and is one of the platforms for generating big data. Since 2013, big data technology has begun to be closely integrated with cloud computing technology, and it is expected that the relationship between the two will become even closer in the future. In addition, emerging computing forms such as the Internet of Things and mobile Internet will also help the big data revolution, allowing big data marketing to exert a greater influence.

3. Breakthroughs in scientific theories

With the rapid development of big data, just like computers and the Internet, big data is likely to be a new round of technological revolution. The subsequent rise of data mining, machine learning, artificial intelligence and other related technologies may change many algorithms and basic theories in the data world and achieve scientific and technological breakthroughs.

4. The establishment of data science and data alliances

In the future, data science will become a specialized discipline and be recognized by more and more people. Major universities will set up special data science majors, which will also create a number of new jobs related to it. At the same time, based on the basic platform of data, a cross-domain data sharing platform will also be established. Later, data sharing will be extended to the enterprise level and become a core part of the future industry.

5. Data breaches are rampant

The growth rate of data breaches may reach 100% in the next few years unless the data is secured at its source. It can be said that in the future, every Fortune 500 company will face data attacks, regardless of whether they have taken security precautions or not. And all businesses, regardless of size, need to reexamine today’s definition of security. Among Fortune 500 companies, more than 50% will have the position of chief information security officer. Enterprises need to secure their own and customer data from a new perspective. All data needs to be secure at the beginning of creation, not at the last stage of data storage. Merely strengthening the security measures of the latter has proven to be ineffective.

6. Data management is called core competitiveness

Data management has become a core competitiveness and directly affects financial performance. When the concept of "data assets are the core assets of enterprises" is deeply rooted in the hearts of the people, enterprises will have a clearer definition of data management. Data management will be regarded as the core competitiveness of enterprises, continue to develop, strategically plan and use data assets, and become enterprise data. The core of management. Data asset management efficiency is significantly positively correlated with main business revenue growth rate and sales revenue growth rate; in addition, for companies with Internet thinking, data asset competitiveness accounts for 36.8%, and the management effect of data assets will directly Impact on a company's financial performance.

7. Data quality is the key to the success of BI (Business Intelligence)

Companies that adopt self-service business intelligence tools for big data processing will stand out. One of the challenges is that many data sources bring a lot of low-quality data. To be successful, companies need to understand the gap between raw data and data analysis to eliminate low-quality data and achieve better decisions through BI.

8. The data ecosystem becomes more complex

The world of big data is not just a single, huge computer network, but an ecosystem composed of a large number of active components and multiple participant elements. Terminal equipment providers, infrastructure providers, network service providers, network access An ecosystem built by a series of participants such as service providers, data service enablers, data service providers, touchpoint services, data service retailers, etc. Now, the basic prototype of such a data ecosystem has been formed, and the next development will tend to be the subdivision of roles within the system, that is, market segmentation; the adjustment of system mechanisms, that is, the innovation of business models; the improvement of system structure. Adjustments, that is, adjustments to the competitive environment, etc., thus gradually increasing the complexity of the data ecosystem.

Guess you like

Origin blog.csdn.net/Mengxizhouu/article/details/131315488