Data mining concepts and technologies answer the third edition of the first chapter

The first chapter answers


The answer is I reproduced the original [Ma_Jack] (https://blog.csdn.net/u013272948/article/details/71024949)

1.1 What is data mining? In your answer, emphasizing the following issues:
(A) It is yet another advertising it?
(b) It is a simple conversion or use a learning and pattern recognition technology evolved from a database, statistics, machine?
(c) We present a point of view that data mining is the result of the evolution of database technology. Do you think the result is data mining machine learning, evolutionary do? You can put forward this view it based on the history of the discipline? For statistical and pattern recognition, do the same thing.
(D) when seen knowledge discovery data mining process, data mining described step involved.

  • Data mining is not an advertisement, it is an application-driven areas, such as data mining to absorb the statistical learning, machine learning, pattern recognition, databases and data warehouses, information retrieval, visualization, algorithm, and high-performance computing applications in many areas a lot of technology. It is interesting patterns and mining knowledge from large data process. Data sources include databases, data warehouses, Web, other information repository data flows or dynamic system. When it is seen as knowledge discovery process, the basic steps are: (1) data cleaning: remove noise and clearly inconsistent data; (2) Data Integration: A variety of data sources may be combined; (3). data selection: extracting data relating to the analysis tasks from the database; (4) data conversion: by aggregating or aggregation operations, transforming the data and unified into a suitable excavation form; (5) data mining: using intelligent methods or data data mining algorithms to extract patterns; (6) model assessment: according to some measure of interest, really interesting pattern recognition on behalf of knowledge. . (7) Knowledge Representation: Using visualization and knowledge representation technology, has been digging into useful knowledge presented to the user.

1.2 data warehouse and database What is the difference? What are they similar?

  • A data warehouse is multiple heterogeneous data sources in a single site to a unified storage mode organizations to support management decisions. Data warehouse technology including data cleaning, data integration and online analytical processing (OLAP). Database systems, also known as a database management system, a set of internal data related to (referred to as a database) and a set of software programs and manage access to data components. Their similarities: both through a database software, some data model to organize, manage data.

1.4 gives an example where data mining for one vital to the success of the business. This business needs what data mining? Statistical analysis to achieve it, or they can simply query the data?

  • First summarize what types of patterns that can be tapped: characterization and distinction, frequent patterns, classification and regression, clustering, outlier analysis. An airline, for example, to improve the user experience, maximize efficiency of passenger boarding, used to reduce boarding time. This requires regression analysis, such as data nearly months boarding regression analysis to determine a time when people boarding customers in line with what the distribution of traffic, in order to predict the future flow of people so as to make the appropriate improvements in advance boarding improve user efficiency. In this case, a simple query statistics are unable to meet the airline.

Differences and similarities between the interpretation to distinguish between 1.5 and classification, clustering and characterization, classification and regression.

  • Differentiation and classification: the data to distinguish between the general characteristics of the target class data object with one or more general characteristics of the class object contrast comparing; is classified to identify and distinguish the data model described in the class or concept, can be used to model sample predict unknown class labels.
  • And clustering characteristics: Data is a summary of the general characteristics of the properties or characteristics of the target class data, i.e., when it is clear what are the characteristics of the feature data is performed such that the data characterizing; and cluster analysis of the data object only, in accordance with "maximize the similarity class, the minimum inter-class similarity" principle cluster or grouping.
  • Categories have said first point; regression model is primarily a function of continuous values, the regression is mainly used to predict the missing value or the numerical data are difficult to obtain, instead of the discrete class label, while also includes regression based on available data distribution trends identified.

1.6 According to your observations, describing one possible type of knowledge, it requires discovered by data mining methods, but is not listed in this chapter. It requires a different from the chapter lists the data mining technology?

  • Often require such as when the text to classify certain types of high frequency feature extraction, and in a category below has many documents, the document is often included many features vocabulary, then we need to find out the data mining can represent the class feature words, which involves the feature reduction, we can use the chi-square statistic methods such as feature extraction. This method not listed in this chapter.

1.7 Outlier often discarded as noise. However, one person's trash is another man's treasure. For example, a credit card transaction abnormalities may help us detect fraudulent use of credit cards. Fraud detection, for example, proposed two methods can be used to detect outliers, and discuss which method is more reliable.

  • The first chapter on the knowledge currently available in terms of outlier detection methods can be detected by clustering and classification methods. First speaking cluster, the cluster can be gathered by the data object has a certain similarity, and the data for these outliers, the cluster often from the clusters obtained by the distant, and compare the performance of dispersed, after passing through the cluster, clusters relatively far away from the observation of these data objects can easily find the outliers. I think that can also be detected by the outlier classification. Because the need to clear the data objects divided into several categories, we can pass an appropriate classification algorithm for data classification, such as the simplest similarity to sort through, so when it is a similarity smaller than a certain threshold, we these data are considered outliers, and analyze these data separately to detect outliers. When the classification for the outlier detection, often need to clearly know that these data can be divided into several categories, and the huge amount of data, according to the features may not be divided into a lot of categories, so that during data pretreatment may be cumbersome, while the cluster is to be relatively simple, and after passing through the cluster using some visualization techniques can clearly outliers will be displayed to the researchers, the user can easily etc. observation outliers. Therefore, these two methods, the clustering for more reliable detection of outliers.

1.8 describe three data about data mining and user interaction issues of mining challenges.

  • Data mining involves the challenge is more in-depth nature, the difficulty of data mining, such as the three areas of traffic congestion, environmental degradation, increased energy consumption and so on. First, in terms of traffic congestion for traffic congestion, each car will have sensors, and for the positioning of each vehicle can be positioned by GPS, Beidou navigation positioning system, in solving traffic congestion problems, may be more known data That information fusion of multi-source data fusion, data mining algorithms combined with certain so as to solve the traffic congestion problem. When real-time to solve the problem of traffic congestion, the congestion dynamic show to the driver is involved in data visualization, how to present these dynamic real-time traffic conditions as well as solutions to the driver in turn is a major challenge.
  • Environmental degradation, increased energy consumption: these two problems in real life, the more prominent the same performance, but only from the unilateral terms, the information we can get is relatively abundant, such as test data meteorological conditions, environmental indicators of each fuel sales, etc., then how effective integration of these data and come up with effective solutions or is to establish a good mathematical model is placed in front of many researchers is a major challenge.

1.9 Mining and small amounts of data (eg, data collection hundreds of tuples) compared to dig mass data (for example, billions of tuples) What are the main challenges?

  • When massive data mining, how to protect people's privacy;
  • Massive amounts of data typically stored in the cloud, how to ensure the security of data;
  • How quickly dig out interesting patterns in massive data;
  • After digging out interesting and valuable patterns in vast amounts of data, how to show up in a visual form.

1.10 Overview / sensor data analysis, spatio-temporal data analysis or data for a specific field of application of bioinformatics mining major challenges such as streaming?

  • These areas have a common characteristic is that there may exist multiple sources of data, when there are multiple sources of data, how to integrate data from multiple sources is a major challenge we face; Secondly, how far data pretreatment is more difficult, as it may influence each other multi-source data; and finally, for this complex object mining, data mining research is a major challenge faced by staff.
Published an original article · won praise 1 · views 37

Guess you like

Origin blog.csdn.net/qq_39621784/article/details/104055139