Niuke.com special practice Pytnon analysis library (4)

1. There is a set of DataFrame type data df, as shown below, you need to delete the rows whose index labels are 'b' and 'd' in the data, which one (B) is wrong in the following options.

Index/columns

X Y Z
a 0 5 10
b 1

NaN

11
c 2 7 12
d 3 8 NaN
e 4 9 14

A.df.drop(['b',’d’])

B.df.drop(df.index[[2,4]])

C.df.drop(['b','d'],axis=0)

D.df.dropna()

Parse:

       The drop() function is an operation used for data cleaning, deleting rows or columns in the dataFrame, and the dropna() function is used to delete rows or columns with vacant values ​​in the data.

       Option A is to delete data rows according to the index name, and the description of A is correct;

       Option B is to use the index value to delete the data, but the index number of the row in the DataFrame data is counted from 0, so the correct writing should be .df.drop(df.index[[1,3]]), Therefore, the description of item B is wrong;

       The C option and the A option describe the same meaning. The default axis parameter of the drop() function is 0, so the C description is correct;

       Option D is to delete rows containing vacant values, so the description of option D is correct.


2. Two variables (A) that need to be correlated in the correlation analysis.

A. The dependent variable is a random quantity, and the independent variable is also a random quantity

B. The dependent variable is a random quantity and the independent variable is a controlled quantity

C. The dependent variable is a controlled quantity and the independent variable is a random quantity

D. The dependent variable is the controlled quantity, and the independent variable is also the controlled quantity

Parse:

       When conducting correlation analysis, it is not necessary to determine in advance which of the two variables is the independent variable and which is the dependent variable. Both variables in correlation analysis are random variables.


3. There is a set of numpy.ndarrry type data a=np.array([1,2,3,4]) and the weight of each value w=np.array([4,3,2,1]), which of the following method to find the weight average (C)

A.np.mean()

B.np.nanmean()

C.np.average()

D.np.std()

Parse:

       A. The function of mean() is to calculate the average value;

       B. The function of nanmean() is to calculate the average value of the array ignoring NaN values. If the array has NaN values, we can find the mean that is not affected by NaN values;

       C. mean() and average() both mean to take the average. Without considering the weighted average, the output of the two is the same, but considering the weight, np.average(a,weights=w ) can also calculate a weighted average;

       D.std() calculates the standard deviation of a matrix or array;

       So the correct answer is C.


4. When performing data preprocessing, use the deduplication function drop_duplicates in the pandas module. The code is as follows. The incorrect statement in the option is (B)

df.drop_duplicates(subset=['A','B','C'],keep= ,inplace= )

A. The parameter subset is used to specify the column name to be repeated

B.keep specifies to keep the row, there are two optional parameters first and last

C.inplace indicates whether to operate on the original data or save it as a copy

D. After deduplication, the row label remains unchanged. If you need to change it, you can use df.reset_index() to reset the index

Parse:

       subset: Indicates the name of the column to be deduplicated, item A is correct;

       keep: There are three optional parameters, namely first, last, and False. The default is first, which means that only the first occurrence of duplicates is kept and the rest are deleted. last means that only the last occurrence of duplicates is kept, and False means remove all duplicates; item B is wrong;

       inplace: Boolean value parameter, the default is False, which means to return a copy after deleting duplicates, if it is True, it means to delete duplicates directly on the original data; item C is correct;

       The reset index function is reset_index(), so D is correct;

       So the correct answer is B.


5. The following statement about the train_test_split function in sklearn is correct (D).

A. The ratio of data set division is fixed

B. Randomly select the training set, verification set and test set from the sample in proportion

C. The train_test_split function randomly selects samples, so there is no guarantee that the data will be the same every time

D. Set stratify parameters to deal with data imbalance

Parse:

       In item A, the proportion of data set division can be specified by the test_size sample proportion parameter;

       In item B, the train_test_split function can only divide the training set and the test set into two sets;

       Item C, although the train_test_split function randomly selects training data and test data in proportion from the sample, it can be fixedly divided by the seed parameter of the random_state random number;

       In item D, the parameter stratify can deal with the problem of data type imbalance. According to the label y, according to the proportion of various types in the original data y, it is allocated to train and test, so that the proportion of various types of data in train and test is consistent with the original data set;

       Therefore, the correct answer is D.

Guess you like

Origin blog.csdn.net/u013157570/article/details/129100312