【AI Bottom Logic】——Chapter 4: Big Data Processing and Mining

Table of contents

introduce

1. Overview of big data

2. Process and method of data processing

1. Data collection - "from scratch"

2. Data processing - "From what is available to what can be used"

3. Data Analysis

3. What has big data changed?

 Past highlights:


introduce

The performance of AI relies on big data. At one time, the accuracy of image recognition could only reach 60% to 70%, which was limited by machine learning algorithms and computer hardware performance, but the most important thing was the lack of data . In 2009, Stanford University professor Li Feifei and Princeton University professor Li Kai established a project to collect 50 million high-definition images, label more than 80,000 words, and hold an ImageNet image recognition competition to promote the development of computer vision. Then a research group proposed a deep learning model based on big data, which further improved the accuracy of image recognition.

Big data can not only be used to describe customer behavior and business rules, but also the basic raw material for training AI models . However, AI has strict requirements on data. Not all data is suitable. The data must be complete, large, have business meaning, and have characteristic labels . Some data needs to be processed and processed, analyzed and mined.

1. Overview of big data

The concept of "big data" was proposed as early as the 20th century, and McKinsey & Company defined it as "a collection of data whose scale is so large that it greatly exceeds the capabilities of traditional database software tools in terms of acquisition, storage, management, and analysis." Today, big data has different meanings in different contexts. It not only refers to complex and large data collections, but also refers to a series of massive data processing technologies, and can also represent a data-driven business model .

The "big" of big data is relative, and there is no exact definition. Big data does not only refer to the size of data capacity, but also depends on the difficulty of processing these data according to specific needs. Big data not only refers to a large amount of data, but also depends on the characteristics of rich data types, fast processing speed, and low value density. "Big" also brings some problems - there are few truly valuable data in big data, and this phenomenon is called value depression. The larger the volume of data, the more difficult it is to mine valid data, the more errors there may be in the data, and the greater the technical difficulty faced.

2. Process and method of data processing

Two basic ways of using data :①Data is oriented to "results": directly analyze and process data, find data associations, and mine valuable information. ② Data-oriented "process":Process data or build AI modelsthroughmachine learningIn practice, the two can be used in combination.

The following mainly introduces the first method, and the second method will be discussed in the subsequent chapters on machine learning algorithms:

1. Data collection - "from scratch"

This step is the most difficult and the most important . Many people mistakenly think that the key to AI is the algorithm, but it is not. Most of the algorithms of AI have been developed relatively maturely. A lot of research work is focused on algorithm improvement and optimization. The underlying logic is the same as that of more than ten years There is no essential difference before, but data collection is different, which is the premise and key. —— "The data determines the upper limit of machine learning, and the algorithm is only as close as possible to this upper limit!"

Data collection channels: ① primary data: the original data of direct investigation, which is the source of the data, the latest and most valuable; ② secondary data: the data investigated by others, or the data released after the construction and summary of the original data, which may be mixed with errors .

Not only for scientific research, data collection is also crucial to the development of AI. In many fields, researchers open their algorithms but seldom disclose their own data. For example, Google Chief Scientist Norvig commented on Google products: " We don't have better algorithms, we just have more data."

2. Data processing - "From what is available to what can be used"

a、ETL

Data processing is divided into three steps: extraction, transformation, loading , referred to as ETL . The purpose is to integrate a lot of scattered, messy, and non-uniform data to provide data support for analysis and decision-making.

Data extraction : The difficulty lies in the variety of data sources and data stored in different places, which may involve different database software products and different data type formats , so different extraction methods need to be selected.

Data conversion : Data is aggregated, counted, and summarized according to specific needs . The data processing link takes the longest time , 60% to 70% of the total workload, and a lot of workload, such as changing character variables into numeric variables, or processing missing values, processing abnormal data, eliminating duplicate data, and checking data consistency sex etc. The reason why this process is complicated is that the data quality, type, and storage type are different. In reality, most data have problems such as inconsistent caliber, incompleteness, and confusing format—all are "dirty data" that need to be cleaned, such as male patients. Ovarian cancer appeared in the case records of ! !

Data loading : Once the data conversion is completed, the data will be loaded and finally written to the data warehouse, where the data will be stored centrally. There are many ways to store data centrally. For example, various types of data can be correlated and analyzed, and batch queries and calculations can be performed on them .

Different scenarios have different requirements for data processing, and there are offline and real-time methods . Offline processing : low real-time requirements, large processing volume (total data volume), and more storage resources are required. Real-time processing : High real-time requirements, fast processing speed (data volume per unit time), and more computing resources are required.

The data processing process is the basic work that makes data play a role. There are many ETL tools on the market. These tools are very useful for only one data processing task. However, enterprises usually have hundreds or thousands of such tasks, and it is still huge to ensure that all tasks are correct. challenge!

b. One-hot encoding and feature engineering

例如有ABC三个人,A:32岁,男,程序员;B:28岁,女,老师;C:38岁,男,医生。
用计算机可识别的语言数字描述,年龄就是数字不用变;性别女0或男1;职业类型用向量表示,
比如世界上有30000种职业,编号程序员1,老师2,医生3,用30000维的向量表示为[1,0,0,0...,0]、
[0,1,0,0...,0]、[0,0,1,0...,0]。ABC三个人可用一个30002维的向量表示:
[32,1,1,0,0,0...,0]、[28,0,0,1,0,0...,0]、[38,1,0,0,1,0...,0],有点类似前面的老鼠试毒的例子。

However, there are many types of actual data, and machine learning has to deal with the massive dimensions of massive data, which requires a lot of storage and computing resources . The "curse of dimensionality" is also a factor that we must consider in the stage of selecting algorithms and models - in short, some features need to be converted and coded , some features need to be further dimensionally reduced , and some features may be unnecessary (can be eliminated and integrated ) .

Data preprocessing is required before using machine learning algorithms. An important step is feature engineering . Feature engineering is to characterize physical objects. It is the process of transforming original data into model training data . It removes duplication, fills in gaps, corrects outliers, etc. on the original data. It is necessary to find representative data dimensions and describe the solution to the problem. key features. For example, when depicting a car, "shape" is more representative, but "color" is not.

Feature selection is a complex combinatorial optimization problem. Too many features will bring "dimension disaster" , and too few features will make the model perform poorly . The purpose of feature engineering is to obtain good data. If this step is done well, a simple algorithm can achieve good results.

3. Data Analysis

Terms such as data analysis, data science, data mining, and knowledge discovery are sometimes used interchangeably without clear definitions. The purpose of data analysis is to help decision-making. Common analysis scenarios include: ①The question is known, but the answer is unknown. Such as how much is the monthly sales? Which sells the best? ; ② Both the question and the answer are unknown. For example, supermarket staff don’t know if there is a better way to place the goods on the shelf, so they can only try to find the rules through the user’s shopping data. In this case, it is not sure that the answer can be found, and it is not even clear what data is needed. ① is to explain with data, ② is to explore the data!

The following is a brief introduction to some common algorithms for data analysis:

a: Association analysis algorithm

Many APPs will recommend products in the form of "best combination", so that consumers can see the products they are interested in. There is an efficient algorithm to deal with this kind of problem - Apriori algorithm (a priori algorithm) . It is a classic association rule mining algorithm, which is used to find out the sets that often appear together - frequent itemsets .

The Apriori algorithm proposes two concepts: support and confidence . The support degree represents the proportion of a product or a collection of products appearing in the entire data set. For example, in 100 purchase records, people purchased product A 30 times, and 30% is the support degree. Confidence represents the probability of purchasing other commodities at the same time after purchasing a certain commodity. Assuming that among the 30 people who bought commodity A, 15 people bought commodity B at the same time, then 15/30=50% is the ratio of commodity B to A. Confidence.

Both support and confidence are important metrics. Operate as a store, first filter out some products with little purchase volume in the province through the support degree ; the confidence degree indicates the association rule of the two commodities, the confidence degree is equal to the conditional probability, the higher the correlation, the stronger the correlation, and the correlation can be found Strong product mix.

Apriori algorithm has a priori principle when calculating association rules : if a certain set is frequent (occurs frequently), then all its subsets are also frequent. This principle is very intuitive, but if you look at it in reverse, you will find another meaning: if a certain set is not frequent, then all its supersets are not frequent. That is, if {A} is not frequent, then all sets including A such as {A, B} are also infrequent. This conclusion greatly simplifies the calculation process:

举例,假设我们拥有一批顾客购买商品的清单,Apriori算法计算过程如下:
第1步:设定支持度、置信度的阈值。
第2步:计算每个商品的支持度、去除小于支持度阈值的商品。
第3步:将商品(或项集)两两组合,计算支持度,去除小于支持度阈值的商品(或项集)组合。
第4步:重复上述步骤,直到把所有非频繁集合都去掉,剩下的频繁项集,就是经常出现的商品组合。
第5步:建立频繁项集的所有关联规则,计算置信度。
第6步:去掉所有小于置信度阈值的规则,得到强关联规则。对应的集合就是我们要找的具有
高关联关系的商品集合。
第7步:针对得到的商品集合,从业务角度分析实际意义。

It can be seen from the above that the essence of the Apriori algorithm is " counting" , which loops through which combinations frequently appear together and finds them out. The Apriori algorithm screens the original data set layer by layer through the two thresholds of support and confidence , and eliminates some unqualified combinations each time until the best combination is found.

b: User portrait and product recommendation

In addition to association analysis, another common application scenario of data analysis is to build user portraits. User portraits are the overall business picture of users abstracted by enterprises through data, depicting consumers' social attributes, consumption habits, and consumption behaviors, and providing a basis for product design and advertising push. For example, Douyin uses data such as likes and collections to describe users and push content they are interested in.

c: Advertising psychology and AB testing

When you take the coupons given to you by the merchants and try to make various orders, combine orders, etc., you get certain discounts, but you spend more money and buy a lot of unnecessary items. Behind this is that businesses are using big data analysis, advertising psychology, behavioral economics and other means to guide users to make certain decisions and behaviors.

Psychological anchoring phenomenon : When people estimate unknown prices, the initial value (anchor point) will serve as a benchmark and starting point in people's minds . For example, when booking a flight ticket, not all flights are the most affordable in the recommended flight list, and it is likely to be significantly higher than other recommended flights. Its role is to not be selected to bring out the benefits of other fares; Put a watch worth 1 million at the entrance of a famous watch store. You don’t choose to buy it, but it has anchored in your heart, and your expected consumption will become higher than before (below the range of 1 million) ).

The virtual store's algorithm is constantly trial and error, trying to find the best recommendation. "Continuous trial and error" is often used in the development of Internet products. For example, when a product is faced with multiple options, A/B testing can be used to make a choice : that is, let some users use solution A and other users use solution B. But in fact, companies use A/B testing not just two versions , such as designing the title of an advertisement, its font, thickness, size, color, background, tone, sentence pattern, layout, etc. have countless changes .

Expansion: Humans are visual animals and are most sensitive to image information. The visual response area accounts for 40% of the cerebral cortex. The data visualization design should balance the relationship between the amount of information and readability, so as to be trustworthy (real), expressive (clear), and elegant (simple and beautiful) .


3. What has big data changed?

It has changed people's living habits, and all experience, time, and memory will be redefined in the era of big data!

Big data is changing the way humans discover and solve problems . In the past, only the sampling method could be used for massive data, but in the era of big data, the full amount of data can be directly analyzed, and some laws and conclusions that cannot be obtained by traditional methods can be obtained.

People think about problems from expert experience to data-driven . AlphaGo needs hundreds of millions of chess game data, smart cars need a lot of real-world road condition data during driving, and face recognition also needs a lot of face images!

"Knowing where the data is is more valuable than knowing the data itself!"

For example, it is obviously more useful to know how to find out the result of pi than memorizing pi! Replacing memory with understanding is another change brought to us by big data!

Conclusion: Massive, rich and high-quality data is the foundation of AI, which helps AI to continuously learn by itself and improve performance! It can be said that big data endows AI with "intelligence", and the process of enabling machines to realize "intelligent" learning must rely on powerful machine learning algorithms ! Stay tuned for the next chapters...


 Past highlights:

【AI Bottom Logic】—Chapter 3 (Part 2): Information Exchange & Information Encryption & Decryption & Noise in Information

【AI Bottom Logic】—Chapter 3 (Part 1): Data, Information and Knowledge & Shannon Information Theory & Information Entropy

[Machine Learning]—Continued: Convolutional Neural Network (CNN) and Parameter Training

[AI Bottom Logic]——Chapter 1&2: Statistics and Probability Theory & Data "Trap"

Guess you like

Origin blog.csdn.net/weixin_51658186/article/details/131426516