TASK exploratory analysis of data (EDA)

TASK exploratory analysis of data (EDA)

Summary from "Datawhale zero-based introductory data analysis data mining -Task2" -AI snail car

Data Overview

1) a clear mandate: to predict the price of the used car trade mission to return to the main task, containing a total of 31 features, which includes not only continuous parameters, and also includes discrete parameters. By modeling these 31 characteristics to obtain the used car transaction price may be related features used car.
2) Statistics: statistics on the overall function of the amount of data carried out by describe () to observe and grasp the general data range, outliers can initially grasp the situation by Max and Min.
3) Statistics Data Type: () function of the type of data viewed through the whole info.

Missing values ​​/ value of the processing exception

1) missing values:
(1) by isnull () sum () function preliminary view nan situation.
(2) the case of using visual view, an intuitive understanding of the missing values. May be employed bar, matrix and the like.
(3) The supplemented or missing value or deletion processing. If the absence of more data can be deleted directly, if acceptable, it can be supplemented. If the model used for nan can be processed directly, it can not handle.
2) abnormal value processing:
(1) .info () the types of data for statistical analysis, non-numeric key observation feature class (Object)
(2) () statistical parameter information having a non-numeric type characterized by value_counts.
(3) If a vacancy value, can also be deleted or supplemented.
3) Other:
(1) If there is a serious deflection characteristics, for example, wherein a is 0 and discrete features represented, if 99% of a certain type, another 1% to pay special attention, if this deflection and research unrelated issues, such as categories and classifications unrelated, each category has 0 and 1 and skew, the general sense, can be deleted.

Data distribution profile

1) the prediction value distribution:
(1) the distribution of the prediction value distribution before fitting, when the predicted values are not normally distributed, it is necessary to convert the pre-model.
(2) check Skewness and Kurtosis distribution. Skewness used to describe the symmetry of the distribution of data, i.e., greater than 0 distribution skewed to the right, i.e., less than 0 left side distribution. Kurtosis to describe the sharpness of the distribution of data, i.e. sharp peak is greater than 0, less than 0 i.e., peak level.
DETAILED (3) to view the predicted frequency value.
(4) selecting for transformed normal distribution.

2) the distribution characteristics:
(1) can be explored by value_counts ().

  1. Wherein the predicted value relationship where:
    (1) Continuous wherein: correlation analysis, see skewness and kurtosis, distribution visualization, and characterized in including the relationship between and the predicted value visualized
    (2) discrete features: the distribution box FIG visualization , violin FIG visualization, visualization frequency

Data Reporting

  1. Available pandas_profiling report data.

Personal understanding and summary

  1. Data analysis at this stage is mainly to have a grasp of the whole data, including the sample size, feature meaning, feature type and distribution.
    2) data analysis at this stage should be mainly on the data as a whole have to understand, I usually contact the data analysis phase is mainly on sample size, abnormal values, etc. to understand, analyze, process, also known as data preprocessing stage . So in the beginning just to read the content of this section do not quite understand. But in different directions, different ideas have enriched my understanding.

Tianchi used car prices to predict example Practice:

  1. Data reading, pd.csv_read (open (path), sep = ''), the list may be in accordance 'separate' is.
    2) In accordance with the citation, sequentially "notRepairedDamage", "seller" and "offerType" substitutions and processing operations.
    3) direct use XGBoost modeling, using half of the cross-validation, and by uploading XGBoost output.
    4) The practice just want to complete a model, not in data preprocessing, feature extraction and modeling there are too many considerations, not the pursuit of precision, just to complete the modeling.
    5) In the process of data analysis, empirical knowledge is the business-related issues is also important, has a very important influence on the characteristics such as engineering and other parts.
Released five original articles · won praise 0 · Views 333

Guess you like

Origin blog.csdn.net/lybch1/article/details/105023409