Zero-based programmers, on large data mining knowledge, all here

 Here are some big data knowledge mining, and today everyone together to learn about.

1. Data, information and knowledge are different forms of manifestation of generalized data.

2. The main mode of knowledge types: generalized knowledge associated knowledge, the knowledge-based, predictive knowledge, knowledge idiotypic

3. web mining research main schools are: Web structure mining, Web usage mining, Web content mining

4. In general, the process of KDD is a multi-step, into the general definition of the problem, data extraction, data preprocessing, basic data mining phase and evaluation mode.

The database model has knowledge discovery process: step process model, the process model coil, the processing structure of the user-centric model, KDD-line model, the model supports multiple data sources KDD process much knowledge model

6. Roughly speaking, knowledge discovery software development tools or experience of independent knowledge discovery software, knowledge discovery tool set horizontal and vertical knowledge discovery solutions for three main stages, the latter two of which reflects the current knowledge discovery software The two main development direction.

7. The establishment of the decision tree classification model is usually divided into two steps: decision tree, tree pruning.

8. From the main technical point of view to use, the classification can be attributed to four types:

Classification based on distance

Decision tree classification

Bayesian classification

Rule induction method

9. association rule mining may be divided into two sub-problems:

It found that frequent items: user given Minsupport, find all frequent item sets or maximal frequent item sets.

Generating association rules: the user through a given Minconfidence, focused on the frequent items, to find association rules.

10. Data mining is being proposed and developed the basis of relevant subjects on the full development of major technologies:

Development of databases and other information technology

Statistics depth application

Artificial intelligence technology research and application

11. measure the effectiveness of association rule mining results should be integrated from a variety of angles to consider:

Accuracy: dig out the rules must reflect the actual situation of the data. Require large data plus group: 834 325 294

Practicality: dig out the rules must be simple and available.

Novelty: mining association rules can provide valuable information for the new user.

12. Common types of constraints include:

Monotonicity constraints;

Anti monotonicity constraints;

May restrict transitions;

Simplicity constraints.

13. The rule hierarchy involved, can be divided into multi-level association rules:

The same layer association rules: If a rule is associated with a corresponding level of granularity is the project, then it is the same layer association rules.

The interlayer association rules: If you consider the issue at different levels of granularity, you may get an interlayer association rules

14. The main idea of ​​the clustering algorithm, clustering method may be summarized as the following categories.

Divisions: Construction of the divided data based on certain criteria.

Clustering methods belonging to this class are: k-means, k-modes, k-prototypes, k-medoids, PAM, CLARA, CLARANS like.

AHP: for a given set of data objects decomposition level.

Method Density: Density is connected to the evaluation based on the data object.

Grid method: the space is divided into a limited number of data units (the Cell) grid structure, grid-based clustering structure.

Model Law: Give each cluster assumed a model, and then go looking for ways to satisfy this dataset model.

15. A measure of the distance between the class are:

The shortest distance method: the distance between the two elements define two classes closest distance between the classes.

Neighbor method: the distance between the two elements define two classes farthest distance between classes.

Center method: the distance between the two centers is defined as the distance between two classes.

Group average method: it calculates the distance between any two elements in two classes, Class and integrated distance between them: sum of squares.

16. The specific hierarchical clustering method can be divided into:

Cohesion hierarchical clustering: a self-bottom-up strategy, the first of each object as a cluster, and then merge them into a growing cluster of clusters, until a termination condition is satisfied.

Split hierarchical clustering: a top-down strategy, it is first of all the objects placed in a cluster, then gradually broken down into smaller and smaller clusters, until reaching a termination condition.

Agglomeration is representative of AGNES algorithm. Split-level representative is DIANA algorithm.

17. Text Mining (TD) and the way the target is varied, the basic level are:

Keyword Search: The easiest way, and it is similar to traditional search techniques.

Mining projects associated with: focus on data mining association between the information page (including keyword).

Information classification and clustering: the use of data mining, classification and clustering technology category pages, more pages to the level of abstraction and finishing in one.

Natural Language Processing: reveal the natural language processing technology in semantics, more accurate processing Web content.

18. In web access commonly used technology Mining:

Path Analysis

The most commonly used path analysis is used to determine the application in a Web site most frequently visited paths, such knowledge for an e-commerce website or the information security assessment is very important.

Association rules

Association rules discovery methods can be accessed from the transaction centralized Web, to find a general knowledge related.

Sequential patterns

Timestamp orderly transaction set, found that the sequence mode refers to those found in the internal affairs of such models as "some of the items followed by another term."

classification

Discover classification rules can be given to identify a special group of public property description. This description could be used for new items.

Clustering

You can gather the customers who have similar characteristics from Web Usage data. In the Web transaction log, clustering customer information or data item, it is possible to facilitate the development and implementation of the future market strategies.

19. The focus and different functions, data mining language can be divided into three types:

Data Mining Query Language: hope in a SQL database query language such as the completion of the data mining tasks.

Data mining modeling language: for data mining model be described and defined language, a standard design language modeling data mining, data mining system that may follow a standard model in terms of definition and description.

General Language Data Mining: generic data mining language incorporating features of the two languages, both a function of defining the model, but as the query language communication systems and data mining, excavation interactive. Generic data mining language standardization is to solve the problems of attractive data mining industry research.

20. The rule induction, there are four strategies: subtraction, addition, after the first addition and subtraction, first decreased feeding strategy.

After concrete examples as a starting point, for example to promote or generalization, namely the promotion of subtraction condition (property value) or subtracted conjuncts (For convenience, we do not consider promoting increased disjunct of) the promotion: subtraction strategies examples or rules do not cover any counterexample.

Addition strategies: initial assumption rule condition part is blank (rule never true), if the override anti embodiment, the rule conditions are kept to increase or conjuncts, until the rule no longer covers counterexample.

After the first addition and subtraction strategy: Since there is a correlation between the properties, and therefore may join a condition can lead to conditions previously added no effect, and therefore need to subtract the previous conditions.

First decreased feeding strategy: the same reason after the first addition and subtraction, but also to deal with the correlation between the properties.

21. Data mining is defined with broad and narrow.

From a broad perspective, data mining from large data sets (probably incomplete, noisy, uncertainty, various forms of storage), the mining implicit in them, people do not know in advance, for the decision-making process of useful knowledge.

From this narrow point of view, we can define the data mining process is concentrated extract knowledge from a particular form of data.

22. web mining meanings: for including Web page content, structure between pages, user access information, e-commerce information, including a variety of Web data, data mining methods to help people extract knowledge from the Internet in order to access who provide decision support, site operators and Internet-based business activities including e-commerce.

23. K- nearest neighbor classification algorithm defined (K Nearest Neighbors, referred KNN) of: by calculating the distance of each training data to be sorted tuples, tuples to be classified and taken Nearest K training data, the data in the K which category the majority of the training data is to be classified tuples belong to which category.

24. K-means algorithm performance analysis:

main advantage:

It is a classical algorithm to solve clustering problem is simple and fast.

Processing of large data sets, this algorithm is relatively efficient and scalable.

When the result is a dense cluster, it's better.

The main drawback

In order to use the case of the average cluster was defined, it may not be suitable for certain applications.

Must be given in advance k (the number of clusters to be generated), and sensitivity to initial values, for different initial values, may lead to different results.

It found not suitable for large clusters or non-convex shape of the difference in size of the cluster. Moreover, it is for "impatient sound" and outlier data is sensitive.

25. ID3 algorithm performance analysis:

Hypothesis space contains all of ID3 decision trees, which is a complete function of the limited space of discrete values ​​on an existing property. So a search for ID3 algorithm avoids incomplete hypothesis space of the main risks: the hypothesis space may not contain the objective function.

ID3 algorithm at each step in the search use all the current training examples, greatly reduces the individual training sample error sensitivity. Thus, by modifying the termination criterion can be easily extended to handle noisy training data.

ID3 algorithm without backtracking in the search process. Therefore, it is susceptible to common risks affecting no backtracking in search of climbing: converge to a local optimum rather than the global optimum.

26. Apriori algorithm has two fatal performance bottlenecks:

Repeatedly scan the transaction database, it requires a lot of I / O load

For each cycle k, Ck Hou selection of each element must be verified by scanning a database is added Lk. If there is a frequent item set contains 10 large items, then you need to scan the transaction database at least 10 times.

It may have a huge selection of Hou

Ck is generated Selection k- Hou Lk-1 from the exponential growth, for example, 104 1- itemsets is possible to generate a selection of nearly 107 2- HOU elements. Hou such a large selection of main memory space and time is a challenge. a data partitioning method based on: the basic principle of the "support in a division of the support is less than the minimum k- itemsets not be frequent global."

The main improvement 27. Apriori algorithm to improve the adaptability and efficiency are:

Based on data division (the Partition) method: The basic principle is that "in a divided support is smaller than the minimum support k- itemsets not be frequent global."

Hash-based: the basic principle is "in a hash bucket support less than the minimum support of k- itemsets can not be global frequent."

Sampling-based: The basic principle is that "by the sampling technique to assess the subset is sampled and sequentially k- itemsets estimated global frequency."

Other: If, dynamic delete transaction useless: "does not contain any Lk transactions do not affect the future results of the scan, which can be deleted."

28. Web-oriented data mining than data-oriented database and data warehouse mining is much more complicated:

Heterogeneous data source environments: Information on the Web site is heterogeneous: information and organization of each site is different; there are a lot of unstructured text information, complex multimedia information; the use of the site and security, privacy requirements of each different, and so on.

The data is complex: some are unstructured (such as a Web page), usually expressed in the document class information with a long sentence or phrase; some may be semi-structure (such as Email, HTML pages). Of course, some have very good structure (such as a spreadsheet). Opened a general description of characteristics of these composite objects implied become an inescapable responsibility of data mining.

Dynamic application environment:

Web-based information is frequently changing, like news, stock information is updated in real time.

This change is also reflected in the high dynamic random access links and pages.

Users on the Web is difficult to predict.

Data on the Web environment is noisy.

29. Description of knowledge discovery process management project I-MIN process model.

MIN KDD process into the process model IM1, IM2, ..., IM6 other process steps, in each step, the focus on several issues, according to certain quality standards to control the implementation of the project.

IM1 mission and purpose: it is the planning stage KDD project, determine the target company's mining, knowledge discovery mode select, compile knowledge discovery metadata schema obtained; its purpose is to embed the company's mining goals to the corresponding knowledge mode.

IM2 tasks and purposes: it is a pretreatment stage of KDD, can IM2a, IM2b, IM2c the like corresponding to each data cleaning, data selection and data conversion stage. The aim is to generate high-quality target data.

IM3 mission and purpose: it is KDD mining preparation phase, data mining engineer mining experiments, the effectiveness of repeated testing and validation of the model. The aim is to get concentrated knowledge (Knowledge Concentrate) through experimentation and training, providing a model that can be used for end-users.

IM4 mission and purpose: it is the phase of data mining KDD, the user algorithm specified by the corresponding knowledge data mining.

IM5 mission and purpose: it is knowledge representation stages of KDD, knowledge normalized form according to specified requirements.

IM6 mission and purpose: it is knowledge and explain the use phase of KDD, its purpose is output according to user requirements or knowledge intuitively integrated into the corporate knowledge base.

30. The main improvement method to improve the flexibility and efficiency of Apriori algorithm are:

Based on data division (the Partition) method: The basic principle is that "in a divided support is smaller than the minimum support k- itemsets not be frequent global."

Methods hash (Hash) based on: the basic principle is "in a hash bucket support less than the minimum support of k- itemsets can not be global frequent."

Based sampling method (the Sampling): The basic principle is "through sampling technique, the sampled subset evaluation, and in turn the overall k- itemsets estimated frequency."

Other: If, dynamic delete transaction useless: "does not contain any Lk transactions do not affect the future results of the scan, which can be deleted."

31. What are the two steps of data classification is?

Establish a model describing a predetermined set of data clusters or concepts

Also referred to as sample group of data elements, or the object instance.

Data tuples for the establishment of model to be analyzed form a training data set.

Single tuple is referred to as the training data set of training samples, since a class label for each training sample, and therefore also referred to as supervised learning.

By analyzing training data set is constructed in the form of a classification model, available classification rules, decision trees, or mathematical formulas provide.

Using the model to classify

First assessment model (taxonomy) prediction accuracy.

If you think the accuracy of the model can be accepted, it can be classified with a class label of unknown data tuples or objects.

32. web access information mining features:

Web access to data capacity, wide distribution, rich in content and diverse forms

A medium-sized website can record user access to a few megabytes of information every day.

Widely distributed throughout the world.

Access information varied shapes.

Access to information has rich connotations.

Web access data includes information available to decision-making

Each user's access to features that can be used to identify characteristics of the user and the site visit.

The same type of user access, on behalf of the same class of the user's personality.

Data access period represents the group user behavior and user groups in common.

Web data access information is the bridge designer and website visitors to communicate.

Web access information data is to carry out a good target data mining research.

Web information access features of objects excavated

Access the transaction is an element of a Web page, there is a wealth of information about the structure of the transaction between the elements.

Element access transaction represents the relationship between the order of each visitor, there is a wealth of sequence information between the transaction elements.

The contents of each page can be abstracted out different concepts, access order and traffic portions decision concept.

When there are different users access to pages long, long visit on behalf of the user access interest.

33. web page within text information mining:

The goal is to tap the page is a summary and classification.

Page Summary: You can get the corresponding summary information for each page of text summary of the application of traditional methods.

Categories: classifier input is a Web page set (training set), then supervised learning according to the page text content, then you can put into learning classifier used to classify each new page input.

{Text learning method is commonly used TFIDF vector notation, which is a word set A document (Bag-of-Words) notation, all the words extracted from the document, regardless of the order of words between text and Structure. This method of constructing two-dimensional table is:

Each listed as a word set of columns (feature set) for all the words have to distinguish between the value of the dictionary, so the entire set of columns may have as many as hundreds of thousands of columns.

Information stored in each row of a page of the word, then, all the words of the page corresponding to a set of columns (feature set) on. Each column of a set of columns (s), if not present in the page, its value is 0; if k times, the value of k will occur; words, if the page does not appear on the set of columns, can be give up. Such a method can be characterized in terms of the frequency of the page.

For Chinese page, the first word and then needs to be more than a two-step process.

Such construction of two-dimensional table indicates the statistics word Web page set of final Naive Bayesian method or k-Nearest Neighbor classification mining and other methods can be used.

Prior to mining in general be selected first subset of features to reduce the number of dimensions.

 

Guess you like

Origin www.cnblogs.com/duozhishidai/p/11986549.html