Machine Learning Final Review Questions


1. Which of the following is not a process of knowledge discovery? (D)

A. Data cleaning

B. Data Mining

C. Visual expression of knowledge

D. Data testing

2. Collaborative filtering analyzes user interests, finds similar (interested) users of the specified user in the user group, and synthesizes these users' evaluations of a certain information to form the system's preference ( D ) for the specified user's information, and Items liked by these users are recommended to users with similar interests.

A. similar

  1. same

C. Recommended

D. Forecast

3. Which of the following is not a common attribute type? (C)

A. Nominal attributes

B. Numerical attributes

C. High-dimensional attributes

D. Ordinal attributes

4. Which of the following measures is a description of data divergence? (C)

A. mean

B. Median

C. Standard deviation

D. Mode

5. Which of the following measures does not belong to the description of data center trend? (D)

A. mean

B. Median

C. Mode

D. quartiles

6. Which step of data mining is the task of cleaning, integrating, transforming and reducing data? (C)

A. Frequent pattern mining

B. Classification and prediction

C. Data preprocessing

D. Noise detection

7. Clustering analysis is an important technology of data mining. Which of the following algorithms is not a clustering algorithm? (C)

A、K-Means

B、DBSCAN

C、SVM

D、EM

8. Among the statements about Anconda components, the following description is wrong (B). 

A. Anaconda Prompt is the command line that comes with Anaconda

B. Jupyter Notebook is a client-based interactive computing environment that can edit documents that are easy for people to read, and is used to demonstrate the process of data analysis

C, Spyder is a Python language, cross-platform, scientific computing integrated development environment

D. Anaconda Navigator is a graphical user interface for managing toolkits and environments, and many subsequent management commands can also be implemented manually in Navigator

Jupyter Notebook is a web-based interactive computing environment that can edit human-readable documents to demonstrate the process of data analysis

9. Among the components of Anaconda, the one that can edit the document and display the data analysis process is (D).

A、 Anaconda Navigator

B、Anaconda Prompt

C、Spyder

D、Jupyter Notebook

Jupyter Notebook can reproduce the entire analysis process and integrate explanatory text, code, charts, formulas and conclusions into one document

10. Which language is Matplotlib mainly written in? (A)

correct answer

A、Python

B、java

C、C++

D、C

11. Among the following options, the one used to bridge the data warehouse and ensure data quality is (B).

A. Data collection

B. Data processing

C. Data analysis

D. Data presentation

12. Among the following options, (A) is a web-based interactive computing environment, which can edit documents that are easy for people to read, and is used to demonstrate the process of data analysis.

A、Jupyter Notebook

B、Anconda Navigator

C、Anconda Prompt

D、Spyder

13. Among the following options, the one that does not belong to the ndarray object attribute is (D).

A、shape

B、dtype

C、I am

D、map

14. Please read the following program:
import numpy as np
np.arange(1, 10, 3)
Run the program, and the final execution result is (B).

A、array([1, 4, 7, 10])

B、array([1, 4, 7])

C、array([2, 5, 8])

D、array([3, 6, 9])

15. Which of the following descriptions about ndarray objects is correct (B).

A. ndarray objects can store different types of elements

B. The types of storage elements in ndarray objects must be the same

C. ndarray objects do not support broadcast operations

D. ndarray objects do not have vector computing capabilities

According to the characteristics of ndarry, the types of elements in the object must be the same

16. Regarding the properties of ndarray objects, the following description is wrong (C).

A. The ndim attribute indicates the number of array axes

B. The shape attribute indicates the size of the array in each dimension

C. The size attribute indicates the total number of array elements, which is equal to the sum of the tuple elements of the shape attribute

D. The dtype attribute represents the object of the element type in the array

The size attribute indicates the total number of array elements, which is equal to the product of the tuple elements of the shape attribute

17. About creating ndarray objects. The following description is wrong (A).

A. Use the list() function to create an ndarray object

B. Create an array with element values ​​of 1 through the ones() function

C, ndarray objects can be created using the array() function

D. Create an array whose element values ​​are all 0 through the zeros() function

The list() function cannot create an ndarray object, but you can pass a list as a parameter into the array() function to create an ndarray object

18. The following description about ndarray object index is wrong (D).

A. The elements in the ndarray object can be accessed and modified by indexing and slicing

B. Fancy indexing is to index an integer array or list, and then use each element in the array or list as a subscript to obtain a value

C. Boolean index is to use a Boolean array as an array index, and the returned data is the value corresponding to True in the Boolean array

D. The multidimensional array index and slice of the ndarray object are used in exactly the same way as the list

For example, if you want to obtain a certain number in a two-dimensional array, you need to use the form "arr[x, y]" to obtain

  1. In the following array statistical calculations, the method used to calculate the maximum value in the array is (A).

A、max

B、maximum

C、min

D、maximal

20. Please read the following sample program:
import numpy as np
arr1 = np.array([[0], [1], [2]])
arr2 = np.array([1, 2])
result = arr1 + arr2
print(result.shape)
to run the above program, then the final output result is (A).

A、(3, 2)

B、(2, 3)

C、(3, 0)

D、(2, 0)

21. Which of the following descriptions about array operations is wrong (D).

A. In NumPy, any computation between arrays of equal size is applied element-wise

B. The broadcast mechanism means to expand the array so that the value of the shape attribute of the array is the same

C. Scalar operations produce a new matrix with the same number of rows and columns as the array, and each element of the original matrix is ​​added, subtracted, multiplied, or divided

D. Arrays do not support operations between arithmetic operators and scalars

Arrays support operations between arithmetic operators and scalars

22. Please read the following sample program:
arr2d = np.array([[11, 20, 5],[21, 15, 26],[17, 8, 19]])
arr2d[0:2, 0:2 ]
Run the above program, the final execution result is (A).

A、array([[11, 20],[21, 15]])

B、array([11, 20])

C、array([21, 15])

D、array([11, 21])

23. Please read the following program:
arr = np.arange(12).reshape(3, 4)
arr.shape
runs the above program, and the final execution result is (C).

A、3

B、4

C、(3, 4)

D、(1, 2)

24. Please read the following program:
arr2d = np.empty((4, 4))              
for i in range(4):
    arr2d[i] = np.arange(i, i + 4)   
arr2d[[0,4] ,[3,1]]
Execute the above program, and the final output result is (B).

A、array([3., 4.])

B. The program throws an IndexError exception

C、array([3., 5.])

D、array([4., 4.])

25. It is known that there is a two-dimensional array as follows:
arr2d = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
If you want to get element 5, Then you can use (A) to achieve.

A、arr2d[1, 1]

B、arr2d[1]

C、arr2d[2]

D、arr2d[1, 0]

26. Please see the following code:
import numpy as np   
arr = np.array([[6, 2, 7], [3, 6, 2], [4, 3, 2]]
arr.sort()
arr             
pair code The correct result of executing the sort() method on the NumPy array in is (A).

A、[[2 6 7] [2 3 6]]

B、[[2 6 7] [6 3 2]]

C、[[7 6 2] [6 3 2]]

D、[[7 6 2] [2 3 6]]

27. When creating an ndarray object, you can use the (A) parameter to specify the element type.

A、dtype

B、dtypes

C、type

D、types

28. To create a 3 * 4 array, the following option is correct (A).

A、np.arange(12).reshape(4, 3)

B、np.arange(12).reshape(3, 4)

C、np.arange(7).reshape(4, 3)

D、np.arange(7).reshape(3, 4)

29. Among the NumPy general functions, the function used to calculate the element-level maximum is (B).

A、max

B、maximum

C、min

D、maximal

30. Among the following functions, the one used to calculate the absolute value of an integer is (C).

A、square()

B、sqrt()

C、abs()

D、floor()

31. Regarding the Series structure, the following description is correct (B).

A. Series is an object similar to a two-dimensional array

B. Series consists of two parts: a set of data and related indexes

C, Series can only save data of integer and string types

D. The index of Series starts from 1 by default

32. Please read the following program:
import pandas as pd
ser_obj = pd.Series(range(1, 6), index=[5, 3, 1, 3, 2])
print(ser_obj)
After executing the above program, the final output The result is (B).

A、a 3.0d 2.0c 1.0b NaN

B、a 3.0b NaNc 1.0d 2.0

C. The program is abnormal

D、c 1d 2a 3

33. Which of the following statements about the Pandas library is correct (C).

A. There are only two data structures in Pandas

B. Pandas does not support reading text data

C. Pandas is a new library based on NumPy

D. Series and DataFrame in Pandas can solve all problems in data analysis

In addition to the two common data structures introduced in the book, there is another data structure Panel in Pandas

34. Regarding the sorting of data in Pandas, the following statement is correct (A).

A. It can be sorted by row index or by column index

B. The sort_index() method means sorting by value

C. The sort_values() method means sorting according to the index

D. By default, the sort_index() method sorts in descending order

35. When Pandas performs arithmetic operations, unaligned positions will be filled with (B).

A、Null

B、0

C、NaN

D、null_values

36. Among the following options, the one that cannot create a Series object is (D).

A、ser_obj = pd.Series([1, 2, 3, 4, 5])

B、ser_obj = pd.Series({2001: 17.8, 2002: 20.1, 2003: 16.5})

C、ser_obj = pd.Series((1,2,3,4))

D、ser_obj = pd.Series(1,2)

37. Among the statements about Pandas data reading and writing, the following description is wrong (A).

A. read_csv() can read all text data

B. read_sql() can read the data in the database

C. to_csv() can write structured data into csv files

D. to_excel() can write structured data into excel files

38. The following statement about DataFrame is correct (C).

A. The DataFrame structure is composed of indexes and data

B. The row index of the DataFrame is on the far right

C. You need to specify the index when creating a DataFrame object

D. The data type of each column of DataFrame must be the same

39. Please read the following program
import pandas as pd
ser_obj = pd.Series(range(1, 6), index=[5, 3, 0,4, 2])
print(ser_obj.sort_index())
After executing the above program , the final output result is (B).

A、5 13 20 34 42 5

B、0 32 53 24 45 1

C、5 14 43 22 50 3

D、2 54 40 33 25 1

40. Among the following options, the method used to delete missing values ​​is (C).

A、isnull()

B、delete()

C、droplet()

D、fillna()

The isnull method is used for detection. When there is True in the returned result, it means that there is duplicate data; the fillna method is used to fill in missing data.

41. Regarding the statement about outliers, the wrong description in the following options is (A).

A. Outliers refer to individual values ​​in the sample that deviate significantly from the rest of the observations

B. Outliers can be detected using the 3σ principle

C. You can use the boxplot in Pandas to detect outliers

D. Outliers can be replaced by other values

Abnormal data is not necessarily a data error, so it will be deleted or retained according to the actual situation.

42. Among the following statements about missing value detection, the correct one is (B).

A, null() and notnull() can handle missing values

B. The dropna() method can delete both observation records and features

C. The value used to replace missing values ​​in the fillna() method can only be a DataFrame object

D. The interpolate module in the Pandas library contains a variety of interpolation methods

43. Among the following options, the correct description of the fillna() method is (D).

A. The fillna() method can only fill data whose replacement value is NaN

B. Only the forward filling method is supported

C. The maximum number of padding that can be supported by default is 1

D. The fillna() method can fill data with replacement values ​​of NaN and None

44. Among the following options, the wrong description about the drop_duplicates() method is (A).

A. Only supports deduplication of single feature data

B. Only valid for Series and DataFrame objects

C. When deduplicating data, the first data is retained by default

D. This method will not change the original data arrangement

45. In the statement about data reshaping, the following option description is wrong (C).

A. Data reshaping can convert DataFrame to Series

B. The stack() method can convert the column index to the row index

C. After using the stack() method on a DataFrame, the return must be a Series

D. The unstack() method can convert the row index to the column index

When a DataFrame has a hierarchical index, using the stack() method returns a DataFrame object.

46. ​​Among the following options, the correct description of the dropna() method is (C).

A. The dropna() method will only delete data whose value is NaN

B. The dropna() method will not delete data whose value is None

C. The dropna() method will delete data whose values ​​are None and NaN

D. The dropna() method will only detect missing data and null values

dropna() deletes None or NaN by default, but you can specify parameters to delete the axis.

47. Please read the following program:
from pandas import Series
import pandas as pd
from numpy import NaN
series_obj = Series([2, 1, NaN])
print(pd.isnull(series_obj))
After executing the above program, the final output result for (A).

A、0 False1 False2 True

B、0 True1 True2 False

C、0 False1 False2 False

D、0 True1 True2 True

48. In the statement about dummy variables, the following option description is wrong (D).

A. Dummy variables are artificial variables

B. After the dummy variable is converted into an index matrix, its value is usually 0 or 1

C. The get_dummies() function in Pnadas can process dummy variables for categories

D. The use of dummy variables has no practical significance

49. Among the following statements about data preprocessing, the description is incorrect (D).

A. The purpose of data cleaning is to improve data quality

B. Outliers do not necessarily have to be deleted

C. The duplicate data can be deleted by the drop_duplicates() method

D. The concat() function can merge different DataFrames according to one or more keys

The concat() function can stack multiple objects along an axis.

50. In the statement about preprocessing, the description in the following option is incorrect (D).

A. The concat() function can stack multiple objects along an axis

B. The merge() function can merge different DataFrames according to one or more keys

C. You can use the rename() method to rename the index

D. The unstack() method can rotate the column index into a row index

The unstack() method can rotate the row index to the column index.

51. Among the following functions, the one used to stack Pandas objects along the axis is (A).

A、concat()

B、join()

C、merge()

D、combine_first()

52. The application of machine learning in the field of natural language processing does not include (C).

A. Question answering system

B. Information collection

C. Pathological analysis

D. Real-time translation

53. Which of the following steps will perform tasks such as transformation, variable correlation, and standardization on the original data (C).

A. Deployment

B. Business needs analysis

C. Data preprocessing

 D. Results evaluation

54. Data preprocessing is very important for machine learning. The following statement is correct (A).

A. The effect of data preprocessing directly determines the quality of machine learning results

B. Data noise has no effect on the training of neural networks

C. You can directly delete the problematic data

D. Preprocessing does not need to spend a lot of time

55. The following statement about machine learning engineers is correct (C).

A. No need to understand certain relevant business knowledge

B. No need to be familiar with data extraction and preprocessing

C. Requires certain data analysis and actual project training

D. After training, you will be able to perform actual data analysis

56. Which of the following machine learning methods can be used by mobile operators to segment customers, design packages and marketing activities (C).

A. Bayesian classifier

B. Association method

C. Clustering algorithm

D. Multi-layer feed-forward network

57. Which of the following steps is not a preprocessing work required for machine learning (D).

A. Standardization of numerical attributes

B. Variable correlation analysis

C. Outlier Analysis

D. Discuss analysis requirements with users

58. The following understanding of machine learning is incorrect (A).

A. Query a large amount of operational data to discover new information

B. The process of analyzing interesting novel knowledge from a large amount of business data to assist decision-making

C. The results of machine learning may not necessarily assist decision-making

D. Some algorithms that need the help of statistics or machine learning

59. For mobile operators to predict customer churn, which of the following machine learning methods can be used is more appropriate (D).

A. Univariate linear regression analysis

B. Association method

C. Clustering method

D. Multi-layer feed-forward network

  

60. The relationship between the amount of film investment and film income can be expressed by a linear regression equation. The following statement is correct (C).

A. The more investment, the less income

B. The less investment, the more income

C. The more investment, the more income

D. The relationship between investment and income is uncertain

61. Feature engineering does not include (B).

A. Feature construction

B. Feature Merging

C. Feature selection

D. Feature extraction

62. Which of the following data mining methods can be used to analyze the relationship between marketing input and sales revenue (B).

A. Correlation analysis

B. Regression analysis

C. Clustering method

D. Recommended algorithm

63. Which of the following statements about regression analysis is correct (D).

A. Regression analysis is a statistical method to analyze the linear relationship between one variable and one (or several) other variables

B. Regression analysis does not require sample training

C. It is impossible to predict the category of non-data type attributes

D. The nonlinear regression equation is generally transformed into a linear regression equation to solve the parameters easily.

64. For nonlinear regression problems, which of the following statements is wrong (A).

A. You can find the regression equation of a single independent variable and dependent variable separately, and then simply find the weighted sum of these equations

B. The coefficients of the nonlinear regression equation need to be converted into a linear regression equation to facilitate the solution

C. The test of nonlinear regression model can also use R2

D. Logistic regression is a typical generalized linear regression model

65. Regarding the coefficient of the regression model, which of the following statements is wrong (B).

A. The coefficients of the unary linear regression model can be obtained using the least squares method

B. The coefficients of the multiple regression model can be obtained using the gradient descent method

C. The coefficient size and positive or negative of the unary linear regression model indicate the relative influence of the independent variable on the dependent variable

D. The purpose of regression analysis is to calculate the coefficients of the regression equation so that the relationship between the input and output variables of the sample can be reasonably fitted

66. The error in the following description about principal component analysis PCA is (D).

A. PCA is to sequentially find a set of mutually orthogonal coordinate axes from the original space

B. The direction with the largest variance in the original data is the first coordinate axis

C. Realize PCA algorithm based on eigenvalue decomposition covariance matrix

D. Singular value decomposition can only be applied to matrix decomposition of specified dimensions

67. The error in the following description about singular value decomposition (SVD) is (A).

A. Singular value decomposition is to decompose a linear transformation into two linear transformations

B. The singular value often corresponds to the important information hidden in the matrix, and the importance is positively correlated with the size of the singular value

C. SVD is an improvement to PCA, and its calculation cost is lower. The same thing is that the goal of both is to reduce dimensionality.

D. Singular values ​​can not only be applied to data compression, but also to image denoising

68. Which of the following statements about the error of linear discriminant analysis is (B).

A. By linearly transforming the original data, the samples of different classes are separated as much as possible

B. Linear transformation in linear discriminant analysis can increase the variance of similar samples

C. Linear transformation can increase the distance between samples of different categories

D. Improve the separability of different types of samples

69. Which of the following is the wrong understanding of visualization (A).

A. Visualization is a method of simply displaying the original data in the form of a graph

B. Visualization can be used as a method of data preprocessing to find out the noise in it

C. Visualization itself is a method of data analysis, using charts to show the hidden laws in the data

D. Through the visualization of data, data analysts can promote the understanding of data and the discovery of laws

70. Which of the following statements about the principle of visualization is wrong (D).

A. Visualization is mainly to satisfy the sensitivity of human decision makers to visual information

B. The methodological basis of visual analysis is visual metaphor, which can abstract data to a certain extent

C. High-dimensional data visualization needs to transform the data and extract effective features to reduce the dimension

D. Pie charts can analyze the trend of data changes

71. Anconda is completely free. (√)

72. Jupyter Notebook can save files in ipynb format. (√)

  1. Anconda does not support Python2.x version. (√)

Anconda supports Python2.6, 2.7, 3.4, 3.5 and other versions, and can switch freely

74.Seaborn is a data visualization tool based on Matplotlib in Python, which provides many high-level encapsulated functions. (√)

75.Python is a glue language that can easily operate libraries written in other languages. (√)

76. Use the pip command to view the packages installed by Anconda. (√)

77. The advantage of Jupyter Notebook is that it can reproduce the entire analysis process and integrate explanatory text, codes, charts, formulas and conclusions into one document. (√)

78. As long as Anconda is installed in the current system, you already have Jupyter Notebook by default, and there is no need to download and install it separately. (√)

79. If you want to uninstall the package in the specified environment, you can directly use the remove command to remove it. (√)

80. Jupyter Notebook can be opened using the command line. (√)

81.Numpy is an open source numerical computing extension tool for Python. (√)

82. Pandas is a NumPy-based data analysis package created to solve data analysis tasks. (√)

83. Jupyter Notebook can use Markdown syntax (√)

84. Using Anconda for development can effectively solve the problem of package configuration and package conflicts. (√)

85.conda is an open source package management system and environment management system running on Windows, Mac OS, and Linux. (√)

  1. Arithmetic operations cannot be performed on two arrays if they have different shapes. (×)

When the array satisfies a dimension of equal length or one of the arrays is a one-dimensional array, operations can be performed through the broadcast mechanism.

  1. The data type of the ndarray object can be converted by the type() method. (×)

The data type of the ndarray object can be converted by the astype() method

88. Arrays use slicing and indexing in exactly the same way as lists. (×)

89. The element values ​​in the array created by the zeros() function are all 0. (√)

90. The types of storage elements in ndarray objects must be the same. (√)

91. If you want to create an array, you can only use the array() function. (×)

92. NumPy's array sorting defaults from small to large. (√)

93. Universal functions operate on every element in an array. (×)

94. Any arithmetic operation between arrays of equal size applies the operation element-wise. (√)

95. The sort() method can sort the data on any axis (√)

96.NumPy supports more data types than Python. (√)

  1. Assuming that there is currently an ndarray array with 3 rows and 3 columns, if you want to get the elements of row 3 and column 2, you can use ndarray[3,2]. (×)

Index counts from 0

98. NumPy's random module has more functions than Python's random module. (√)           

                      

99. NumPy arrays do not need to be looped over to perform batch arithmetic operations on each element. (√)

100. It is not necessary to specify the type of data when creating an array. (√)

101. The data in each column in DataFrame can be regarded as a Series object. (√)

102. The structure of DataFrame is composed of index and data. (√)

103.Series can save any data type. (√)

  1. The read_html() function can read all the data in the web page. (×)

The read_html() method can only read the data in the table tag in the web page

105. Hierarchical indexes can exchange hierarchical order. (√)

  1. Index objects in Pandas are modifiable. (×)

Index objects in pandas cannot be modified

107. Pandas can be sorted by index or by data. (√)

108. When operating a DataFrame object, you can obtain data by specifying the index name. (√)

109.Series objects can have a multi-level index structure. (×)

110.Series is an object similar to a one-dimensional array. (√)

111. Both Series and DataFrame support slice operations. (√)

112. Pandas only has two data structures, Series and DataFrame. (×)

Pandas has three data structures: Series, DataFrame, and Panel

113. The fillna() method can be filled with Series objects when dealing with missing data, but not with DataFrame objects. (×)

114. The join() method can use left join and right join to connect data. (√)

115. After the DataFrame object with multi-layer index is reshaped by stack(), it returns a Series object. (×)

116. Values ​​beyond the upper and lower bounds in the box plot are called outliers. (√)

117. When using the concat() function to merge data, you can connect in two ways: left join and right join. (×)

118.drop_duplicated() method can delete duplicate values. (×)

119. Multiple keys can be specified when merging data through the merge() function. (√)

120.dropna() method can delete all missing values ​​in the data. (√)

121. When using the merge() function for data merging, there is no need to specify a merge key. (×)

122. The rename() method can rename the index name. (√)

123. Missing data is intentional. (×)

124. Machine learning is a very important technology in artificial intelligence, and deep learning is a method in machine learning. (√)

125. Both the bubble chart and the scatter plot can represent the relationship between three-dimensional data. (×)

126.Matplotlib is a 3D graphics library for drawing arrays in Python. (×)

Matplotlib is a 2D graphics library for plotting arrays in Python.

127. The following is a supervised algorithm (ACD)

A. Decision tree

B. K-means

C. Bayesian network

D、SVM

128. For the original data in machine learning, the existing problems may have (ABCD).

A. Error value

B. Repeat

C. Outliers

D. incomplete

129. Which of the following analyzes requires machine learning (AC).

A. Predict the network traffic used by mobile operator users in the future

B. Comparing the usage of roaming services by users of different mobile operators

C. Looking for potential customers who mobile operator users use for a certain type of package

D. Count the number of short messages used by users of mobile operators in a certain period of time

  

130. The following descriptions about PCA and LDA are correct (ACD).

A, PCA and LDA can both reduce the dimensionality of high-dimensional data

B. PCA can retain class information

C. LDA can retain class information

D. PCA generally chooses the direction with large variance for projection

The classification method of the decision tree is correct (B)

A. The decision tree cannot identify variables that have a significant impact on the decision attributes

B. Decision trees can be used to discover the characteristics of various samples

C. Decision trees can be used to identify similar samples

D. The more complex the decision tree structure is, the more effective it is

Which statement about decision trees is wrong (C)

A. Can be transformed into a decision rule

B. Play the role of classification prediction for new samples

C. The greater the depth of the decision tree, the better

D. The algorithm of the decision tree is different from the principle of the neural network

 Which of the following statements about k-means is correct (B)

A. The importance of sample attributes can be determined

B. Clustering that can handle regularly distributed data

C. Grouping suitable for arbitrary datasets

D. The result of clustering has nothing to do with the initially selected hypothetical cluster center

The fundamental difference between supervised and unsupervised learning is that (B)

A. Does the learning process require human intervention?

B. Whether learning samples need manual labeling

C. Whether the learning results need human interpretation

D. Whether the learning parameters need to be manually set

The statement about the ensemble learning algorithm is correct (D)

A. A Parallel Algorithmic Framework

B. A serial algorithm framework

C. A new class of data mining algorithms

D. A type of algorithm that integrates existing algorithms

Which of the following descriptions about the metric silhouette coefficient of cluster analysis is inaccurate (C)

A. The maximum value of the silhouette coefficient is 1

B. The larger the overall silhouette coefficient of a cluster, the better the clustering effect

C. It is impossible for the silhouette coefficient to appear negative

D. Clusters with tight clusters have a larger overall silhouette coefficient than clusters with sparse clusters

Which of the following descriptions about hierarchical clustering methods is incorrect (C)

A. According to the process of hierarchical clustering, it is divided into two categories: bottom-up and top-down.

B. If the clustering process is repeated all the time, all samples can finally be classified into one category

C. The bottom-up clustering method is a splitting clustering method

D. No matter which calculation method is used for the inter-class distance, the two clusters with the smallest distance are finally merged

Initial center point (D) in K-Means algorithm

A. Can be set at will

B. Must be near the true center point of each cluster

C. Must be sufficiently dispersed

D. Directly affect the convergence result of the algorithm

The correct statement about the following description of the neural network is (C)

A. The neural network is not sensitive to the noise in the training data, so it doesn't matter if the data quality can be poor

B. Unable to determine the importance of input attributes

C. Training a neural network is a time-consuming process

D. Can only be used for classification

In a neural network, determining the weight and bias of each neuron is the goal of the model fitting training samples. What is the more effective way (C )

A. Random assignment based on human experience

B. Search all combinations of weights and biases until the best value is obtained

C. Give an initial value, and then iteratively update the weight until the cost function is minimized

D. None of the above is correct

The naivety of the Naive Bayes classifier is that (D)

A. Can only handle low-dimensional attributes

B. Can only handle discrete attributes

C. The classification effect is general

D. Conditional Independence Assumption Between Attributes

The difference between linear SVM and general linear classifier is mainly (A)

A. Is spatial mapping performed?

B. Whether to ensure that the interval is maximized

C. Can handle linear inseparable problems

D. Training error is usually lower

Support vectors (support vectors) refer to (B)

A. Sample points obtained by sampling the original data

B. Data points that determine the extent to which the classification surface can be translated

C. Points on the classification surface

D. Data points that can be correctly classified

For image recognition problems (such as identifying cats in photos), (C) neural network models are more suitable for solving such problems.

A. Perceptron

B. Recurrent Neural Networks

C. Convolutional Neural Networks

D. Multilayer Perceptron

The correct statement about the application background of the recommendation algorithm system is (D)

A. Help users find unwanted information

B. Find items that the user likes

C. A method of selling

D. Analyze user interests and predict user needs

Forecasting future housing prices, what kind of problem does this belong to in data mining? ( D )

A. Classification

B. Clustering

C. Association rules

D. Regression Analysis

The core of OLAP technology is: (D)

A. Online

B. Quick Response to Users

c. Interoperability 

D. Multidimensional analysis

In pandas, Series uses slice query code as follows:

data = [1,2,3,4,5]

res = pd.Series(data,index=["a","b","c","d","e"])

print(res[3])

The output is: ( A ) 

A.4

B. 3

C. c

D. d

What kind of problem in data mining does the collaborative filtering algorithm solve? (C)

A. Classification problems

B. Clustering problem

C. Recommendation Questions

D. Natural Language Processing Problems

If I use all the features of the data set and can achieve 100% accuracy, but only about 70% on the new data set, it means (C)

A. Underfitting

B. Normal conditions

C. Overfitting

D. Model selection error

Assuming that the linear model instance linear_model has been created using the python third-party library sklearn, the function of the attribute coef in linear_model.coef is ( C )

A. sigmoid function

B. Activation function

C. Parameters of the model

D. None of the above

The following explanation of the k-means clustering algorithm is correct (C)

A. Can automatically identify the number of classes, and then select the initial point as the center point for calculation

B. Can automatically identify the number of classes, instead of randomly selecting the initial point as the center point calculation

C. The number of classes cannot be automatically identified, and the initial point is selected as the center point for calculation

D. The number of classes cannot be automatically identified, and the initial point is not randomly selected as the center point for calculation

The recommendation system recommends products for customers, automatically completes the process of personalized product selection, and meets the individual needs of customers. The recommendation is based on (D) and predicts the customer's possible purchase behavior in the future.

A. Customer's friend

B. Personal Information of Customers

C. Customer's interests and hobbies

D. Customer's past purchase behavior and purchase records

Algorithms for discovering association rules usually go through the following three steps: connect data and prepare data; given the minimum support and (D), use algorithms provided by data mining tools to discover association rules; visually display, understand, and evaluate association rules.

A. Minimum interest

B. Maximum confidence

C. Maximum support

D. Minimum Confidence

Which of the following is not a commonly used natural language processing technique: (D)

A. Lemmatization

B. Part-of-speech tagging

C. Syntax Analysis

D. Cross Validation

Is there a function for Pandas to deal with missing values? (A)

A、fillna()

B、fit()

C、predict()

D、iloc()

Which algorithm is shown in the figure below ( C )

 

AK-nearest neighbor algorithm

B. Bayesian

C. Univariate linear regression

D. Polynomial regression

Common classification algorithms do not include (A)

A) linear regression

B. Logistic regression

C. Bayesian algorithm

DK-nearest neighbor algorithm

The tasks that linear regression can complete are ( B )

A. Predicting Discrete Values

B. Predict continuous values

c. Classification

D. Clustering

When analyzing the customer's consumption industry in order to recommend services of interest to them in a targeted manner, what kind of problem is it? ( C )

A. Classification

B. Clustering

C. Association rules

D. Principal Component Analysis

Which of the following statements is true about under-fitting? ( C )

A. The training error is large and the test error is small

B. The training error is small and the test error is large

C. The training error is large and the test error is large

D. The training error remains the same, but the test error is larger

The proximity of two clusters is defined as the shortest distance between any two points in different clusters. Which agglomerative hierarchical clustering technique is it? ( A )

A.MIN (single chain)

B.MAX (full chain)

C.group average

D.Ward method

In the following different scenarios, the analysis method used is incorrect ( B )

A. According to the business and service data of the business in the last year, use the clustering algorithm to determine the business level of the Tmall business under their respective main categories 

B. According to the transaction data of the merchant in recent years, use the clustering algorithm to fit the formula of the possible consumption amount of the user in the next month

C. Use the association rule algorithm to analyze whether the buyer who bought the car seat is suitable for recommending the car mat  

D. According to the product information recently purchased by the user, use the decision tree algorithm to identify whether the Taobao buyer may be male or female

Suppose X={1, 2, 3} is a frequent itemset, then (C) association rules can be generated from X.

A.4

B.5

C.6

D.7

If cross-validation is set to K=5, how many times will it be trained? (C)

A.1

B.3

C.5

D.6

We want to train a decision tree on a large data set, in order to use less time, we can ( C )

A. Increase the depth of the tree

B. Increase the learning rate (learning rate)

C. Reduce the depth of the tree

D. Reduce the number of trees

The following main factors that do not affect the results of the clustering algorithm are (A)

A. Sample Quality of Known Classes

B. Classification criteria

C. Feature Selection

D. Pattern Similarity Measures

Common methods of image data analysis do not include (D)

A. Image transformation

B. Image Coding and Compression

C. Image Enhancement and Restoration

D. Image Data Acquisition

In general, the KNN nearest neighbor method works better in the case of (B)?

A. There are many samples but the typicality is not good

B. Small sample but good typicality

C. The sample is distributed in clusters

D. The samples are distributed in a chain

The function realized by the following code is: (C)

>>> from sklearn.naive_bayes import GaussianNB

>>> gnb = GaussianNB()

>>>y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

A. Create a Gaussian Naive Bayesian model and train the model

B. Create a Gaussian Naive Bayesian model and make predictions on the model

C. Create a Gaussian Naive Bayesian model and train and predict the model

D. Create a Gaussian Naive Bayesian model and train and evaluate the model

Which of the following nodes is not included in a decision tree? (C)

A. The root node

B. Internal nodes

C. External nodes

D. Leaf nodes

The technique of improving classification accuracy by aggregating the predictions of multiple classifiers is called (A)  

A. Integration (ensemble)

B. aggregate

C. combination

D. to vote

Which of the following statements are correct? (C)

1 If a machine learning model can get a high accuracy rate, it means that it is a good classifier.

2 If you increase the complexity of a model, the test error will always increase.

3 If you increase the complexity of a model, the training error will always increase.

A. 1

B. 2

C. 3

D. 1 and 3

Which of the following scenarios belongs to machine learning? (D)

A. Have machines detect seismic activity

B. The computer runs the bionic program

C. Using a computer as a calculator

D. By recognizing photos of watermelons at different stages, the machine can identify ripe watermelons

Comparing machine learning programs with traditional computer programs, which of the following statements is incorrect: (C)

A. Both are computer programs

B. The output is different

C. The output is the same

D. Summary of experience Traditional procedures are more effective in dealing with problems

Which of the following statements can load the iris dataset of the scikit-learn module: (B)

A. iris = datasets.read_iris()

B. iris = datasets.load_iris()

C. iris = datasets.iris()

D. iris = datasets.load.iris()

The following code snippet

>>>from sklearn.model_selection import train_test_split >>>X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.4,random_state=0) The functions implemented are: (B)

A. Load data

B. Split data

C. Packet data

D. Delete some data

The interface API used for the machine learning training process is: (A)

A. fit()

B. predict()

C. learn()

D. train()

  

The evaluation indicators belonging to the classification model are: (B)

A. MSE

B.AUC

C. IT IS

D. RMSE

Guess you like

Origin blog.csdn.net/qq_53865517/article/details/128260121