Chapter 5 Pandas data loading and preprocessing

1: Multiple choice question

1: Which of the following visual drawings can be used to find abnormal points in the data?

A. Density plot
B. Histogram
C. Box plot

D. Analysis of knowledge points of probability graphs
:
Density graph: A theoretical graphical representation method that expresses the boundary or domain object corresponding to the data value. Histogram
: Histogram is an accurate graphical representation of the distribution of numerical data.
Box plot: It is an important part of structured programming. A visual modeling
probability graph: a theory that uses graphs to represent the probability dependencies of variables


2: Among the following statements about missing value detection, the correct ones are
A.null and notnull can process missing values.
B.dropna method can delete observation records and features.
C.fillna method is used to replace missing values. The value can only be a data frame
D. The interpolate module in the Pandas library contains a variety of interpolation methods
** Knowledge point analysis:
three methods for handling missing values: isnull(), notnull(), isna()


3: In real-world data, missing values ​​are common, and the general processing methods are

A. Ignore
B. Delete
C. Average value filling
D. Maximum value filling


2: True or False Question

1: When using the merge function to merge data tables in Pandas, the default inner join method is   correct.


2: Descriptive statistics in Pandas generally include missing data   errors


3: The statement dataframe.dropna(thresh=len(df)*0.9,axis=1) means that if the missing value of a column exceeds 90%, delete the column. Error knowledge point analysis: Format: DataFrame.dropna(self,   axis
=
0 , how='any', thresh=None, subset=None, inplace=False)
Purpose: Remove missing values.
thresh : int, optional requires many non-NA values.
axis: 0 or 'index', 1 or 'columns', default 0 determines whether to delete rows or columns containing missing values.
0 or 'index': delete rows containing missing values. 1 or 'column': Remove columns containing missing values.
The question should be: If less than 90% of the missing values ​​in a column are missing, delete the column


  4: When using the merge method to merge data, there is no connection key error between the merged DataFrames.


5: Dummy Variables, also known as dummy variables, are artificial variables used to reflect qualitative attributes.   Correct


6: Using isnull().sum() in Pandas can   correctly count missing values.


  7: When thresh=N in dropna in Pandas, it indicates that the data can only be retained when there are N NaN values ​​in a row.
8: The duplicates method of DataFrame can be used to delete duplicate data   errors
9: Network association is a common relationship in big data   . Correct

3: Fill in the blanks

1: When the value of parameter how in the drop method in Pandas is ___, it means that as long as there is a missing value in a row, any row will be changed and discarded .  


2: When the parameter how in the drop method in Pandas takes the value ___, it means that all rows in a row have missing values, and all rows will be discarded.  


3: Pandas reads ___ data JSON through the read_json function  


4: To read data in Mysql with Pandas, you must first install the ___ package, and then read the data file from Mysqldb  


5: To read data in SQL sever with Pandas, you must first install the ___ package, and then read the data file using pymssql  


4: Short answer questions

1: Briefly describe the use of the parameter thresh in the Pandas deletion method dropna. The parameter
  thresh in dropna. When thresh = N is passed in, it means that a row is required to have at least N non-NaNs to survive.


2: Briefly describe the common methods and principles of detecting outliers using statistical methods in Python
  : a. Scatter plot method observation b. Box plot analysis c. Principle of the 3σ rule
  : the curve under the standard normal distribution is bell-shaped Curve, the expected value μ determines its position, and its standard deviation σ determines the magnitude of the distribution. The normal distribution when μ = 0, σ = 1 is the standard normal distribution. Therefore, for a set of data, if it conforms to the normal distribution, outliers can be detected through empirical rules. In the same figure, it can be found that 68.2% of the measured values ​​fall within the interval of plus or minus one standard deviation σ at the μ value, and 95.4% The measured value will fall within the interval of plus or minus two standard deviations σ at the μ value, and 99.7% of the values ​​will fall within the interval of plus or minus three standard deviations σ at the μ value. Therefore, for a set of data that conforms to the normal distribution, if a value is more than three standard deviations σ from the μ value, it can be judged that this value belongs to abnormal data.


3: Briefly describe the main reasons for data standardization in data analysis.
  Different features often have different dimensions, resulting in large differences in values. Therefore, in order to eliminate the possible impact of differences in dimensions and value ranges between features, the data needs to be standardized.


4: Briefly describe the use of the cut method for data discretization in Pandas. The
  value range of the data is divided into intervals with the same width. The number of intervals is determined by the characteristics of the data itself or specified by the user. Pandas provides the cut function, which can perform equal-width discretization of continuous data. The basic syntax format of the cut function is:
pandas.cut(x,bins,right=True,labels=None,retbins=False,precision=3)


Guess you like

Origin blog.csdn.net/qq_52331221/article/details/128178231