Data Mining Question Set - True or False Questions

1. The measurement value of the attribute and the meaning of the value of the attribute are completely equivalent. F
analysis: The measurement value of the attribute and the value of the attribute are not always completely equivalent, because the value of the attribute is usually a discrete value with a specific meaning. The measured value of an attribute is a numerical value

2. "Sunny" and "Cloudy" in the weather attribute values ​​​​can be represented by different numbers, and they have no sequential relationship T

3. Ordinal attribute values ​​have magnitude or context, and comparison operations of greater than or less than T can be performed

4. Binary attribute values ​​are usually represented by 0 or 1, which can be compared. F
analysis: Binary attribute values ​​are a discrete attribute type, indicating whether an item belongs to a certain category. They are usually represented by 0 or 1, where 0 means it does not belong to the category, 1 means it belongs to the category. This attribute type has only two values ​​and is a special classification type. Since there are only two values ​​for binary class attribute values, size comparison cannot be performed, because size comparison requires comparability between attribute values, that is, whether one value is greater than the other is meaningful, and for binary class attribute values , there is no size relationship between 0 and 1. Therefore, when performing data analysis and processing on binary class attributes, they cannot be regarded as numerical attributes, but should be regarded as categorical attributes.

5. The Celsius temperature value of 24.4 is twice as warm as the Celsius temperature value of 12.2 F
Analysis: The Celsius temperature value of 24.4 is not twice as warm as the Celsius temperature value of 12.2 because the temperature comparison is not a linear comparison. Differences in Celsius temperature values ​​can be measured as temperature differences, but temperature comparisons cannot be calculated by simple division. The comparison between Celsius temperature values ​​is based on absolute zero, that is, 0°C represents the freezing point of water at standard atmospheric pressure, not the absolute zero point of temperature. Therefore, the comparison between Celsius temperature values ​​should be based on the magnitude of the temperature difference, rather than a simple proportional relationship. In other words, the difference between the temperature values ​​24.4 and 12.2 is 12.2, not that 24.4 is twice 12.2.

6. Data normalization mainly includes two aspects: data homogeneity processing and dimensionless processing, which can make the attribute values ​​fall into a specific interval in proportion, such as [-1,1] or [0,1] T

7. Use distance to measure the similarity between objects. The greater the distance, the greater the similarity between objects F
analysis: Using distance to measure the degree of similarity between objects is one of the common methods, but the greater the distance does not mean the greater the similarity between objects. In distance measurement, distance defines the difference between two objects. Therefore, the smaller the distance, the higher the similarity between the objects. On the contrary, the larger the distance, the greater the difference and the greater the similarity between the objects. Low. For example, if we use Euclidean distance to measure the similarity of two people, assuming that the height, weight, age and other attribute values ​​​​of the two people are very different, then the distance between them will be very large, indicating the difference between them Very large, very similar. On the contrary, if the difference between the height, weight, age and other attribute values ​​​​of two people is very small, then the distance between them will be very small, indicating that the difference between them is small and the degree of similarity is high. Therefore, the smaller the distance, the higher the similarity between objects.

8. Data reduction technology can be used to obtain a reduced representation of the data set. Although it is small, it still roughly maintains the integrity of the original data T

9. Information entropy provides a way to measure uncertainty, which is used to measure the uncertainty of random variables. Entropy is the expected value T of information.

10. The C4.5 algorithm selects the attribute with the highest information gain as the test attribute F.
Analysis: This question should be an error in the answer setting. The feature selection of the C4.5 algorithm is completed based on information gain. Information gain is a measure of the contribution of a feature to the classification task. The calculation formula is:
G ain ( D , A ) = E nt ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E nt ( D v ) Gain(D, A) = Ent(D) - \sum_{v=1}^{V} \frac{|D^v|}{|D|} Ent(D^v)Gain(D,A)=Ent(D)v=1VDDvEnt(Dv )
where,DDD is the data set,AAA is the characteristic,VVV is the characteristicAAThe number of values ​​of A ,D v D^vDv is the characteristicAAA equalsvvSample subset of v , E nt ( D ) Ent(D)E n t ( D ) is the data setDDEntropy of D , E nt ( D v ) Ent(D^v)Ent(Dv )is the sample subsetD v D^vDThe entropy of v .
The C4.5 algorithm calculates the information gain of each feature and selects the feature with the highest information gain as the test attribute. Specifically, the C4.5 algorithm calculates the information gain for each feature, and then uses the feature with the largest information gain as the test attribute. This process is called feature selection.

11. Information gain is for each feature, that is, looking at a feature, how much information the system has and does not have it. The difference between the two is the information gain T that this feature brings to the system .

12. Post-pruning of the decision tree is achieved by pruning the branches on the fully grown tree. The tree node T is pruned by deleting the branches of the node.

13. The entropy value H(X,A) of attribute A is the cost T that needs to be paid to obtain the sample’s information about attribute A.

Guess you like

Origin blog.csdn.net/qq_52331221/article/details/129824720