Case Study - Data Titanic

Here the record about before the adoption of this case does not grasp the api (pandas)

 

1. The statistical description of data

Often df.decsribe ()

But you can score value type variables and objects

  • Numeric
# DESCRIBE function to see the distribution of the variable part of 
# because Survived 0-1 variables, so that a percentage of the average number of surviving, this usage is useful 
titanic_df [[ " Survived " , " Age " , " SibSp " , " PARCH " ]]. describe ()
  • Categorical variables
# Use include = [np.object] to see the categorical variables 
# COUNT: The number of non-missing values 
# UNIQUE: The number of non-repeating worth 
# Top: highest frequency value 
# FREQ: highest frequency value occurrences 
titanic_df.describe ( include = [np.object])

 

2. fill (the age of the data) of missing values

# Calculate the mean of all ages 
age_median1 = titanic_df.Age.median () 

# use fillna fill in missing values, inplace = True representation changes directly on the original data titanic_df 
titanic_df.Age.fillna (age_median1, inplace = True)

 

Characterized in approach (probability space and survivors) 3. Processing packets or a plurality of dimensions PivotTable

Two dimensions:

  • Each class of computing the probability of survival
# 1: using classical packet - polymerization - Calculation (core class VI) 
# Note: Because Survived 0-1 function, it means that the mean percentage of survivors 
titanic_df [[ ' pClass ' , ' Survived ' ]]. GroupBy ( ' pClass ' ) .mean () \ 
    .sort_values (by = ' Survived ' , Ascending = False)
# 2: We can also use pivot_table function to perform the same function (this lecture new content) 
# Pivot Table Chinese PivotTable 
# values: calculated value is applied after polymerization, here we mean function is applied 
# index: packet the variables 
# aggfunc: function definition applied 
titanic_df.pivot_table (values = ' Survived ' , = index ' pClass ' , aggfunc = np.mean)
  • Gender and the probability of survival

# 方法1:groupby
titanic_df[["Sex", "Survived"]].groupby('Sex').mean() \
    .sort_values(by='Survived', ascending=False)
# 方法2:pivot_table
titanic_df.pivot_table(values='Survived', index='Sex', aggfunc=np.mean)

 

Three dimensions:

  •   Considering the class and gender factors, the relationship between the probability of survival

# 方法1:groupby
titanic_df[['Pclass','Sex', 'Survived']].groupby(['Pclass', 'Sex']).mean() 
# 方法2:pivot_table
titanic_df.pivot_table(values='Survived', index=['Pclass', 'Sex'], aggfunc=np.mean)
# Method 3: pivot_talbe 
# the Columns specify another categorical variable, but we will list it in the column instead of row inside, which is why this variable is called the Columns 
titanic_df.pivot_table (values = ' Survived ' , index = ' pClass ' , the Columns = " Sex " , aggfunc = np.mean)

Exercise:

Were used groupby and pivot_table, the number of men and women in the different classes of passengers calculated.

# 1.groupby
titanic_df.groupby(['Pclass', 'Sex']).agg({"Sex": "size"})
titanic_df.groupby(['Pclass', 'Sex']).agg({"Sex": "count"})
titanic_df.groupby(['Pclass', 'Sex']).Sex.count()
# 2. PivotTable the pivot_table 
titanic_df.columns
 # titanic_df.pivot_table (values = 'Survived', index = [ 'pClass', 'Sex'], aggfunc = np.mean) 
titanic_df.pivot_table (values = ' the Name ' , index = [ ' pClass ' , ' Sex ' ], aggfunc = " COUNT " )   # where aggfunc acting values, values may be taken in addition to the remaining value of index

 

4. The continuous variables discretized

  • Discretization continuous variables is a commonly used method for modeling
  • Refers to the discretization interval where a variable is divided between several cells, a fall on the same interval observed value indicated by the same symbol
  • Age, for example, the minimum value is 0.42 (baby), the maximum is 80, if we want to produce a five-level (levels), we can use cut or qcut function
  • The function of age interval cut evenly divided into five points, and the selecting section qcut that the number of observations in each interval is the same (five aliquots), cut demonstration function used here.
# We use cut function 
# we can see the size of each section is fixed, is about 16 years 
titanic_df [ ' AgeBand ' ] = pd.cut (titanic_df [ ' Age ' ],. 5 ) 
titanic_df.head ()
  • View fall in the number of people in different age intervals
# Method 1: value_counts (), sort = False sorting the results indicated that no 
titanic_df.AgeBand.value_counts (sort = False)
# 方法2:pivot_table
titanic_df.pivot_table(values='Survived',index='AgeBand', aggfunc='count')
titanic_df.pivot_table(values='Name',index='AgeBand', aggfunc='count')

 

Exercise: Considering gender, class and boarding pier three factors, to calculate the probability of survival, and explore their relationship and the probability of survival in one of the figures.

# 1. This is the best method here, so the discussion can be about four dimensions (variables) in relation 
# default FIG point 
sns.factorplot (X = " pClass " , Y = " Survived " , Hue = " Sex " , = COL " Embarked " , Data = titanic_df)
 # histogram 
sns.factorplot (X = " pClass " , Y = " Survived " , Hue = " Sex " , COL = " Embarked " , Data = titanic_df, kind = " bar " )
# Method 2. 'Embarked', 'pClass', 'Sex', 'Survived' 
# this embodiment can analyze the relationship between the maximum three variables 
# 1, the relationship between the relationship discussed herein, gender space and three 
sns. barplot (X = " pClass " , Y = " Survived " , Hue = " Sex " , Data = titanic_df, CI = None) 


# 2, using the classifying function FacetGrid discussed 
sns.FacetGrid (Data = titanic_df, Row = ' Embarked ' , for 1.5 Aspect = ) \ 
   .map (sns.pointplot, ' Sex ' , ' Survived ' , 'Pclass',hueorder=['male','female'], palette='deep', ci=None)
# 方法3.
sns.pairplot(titanic_df.loc[:, ['Embarked', 'Pclass', 'Sex', 'Survived']], hue="Embarked")
# sns.pairplot(titanic_df.loc[:, ['Embarked', 'Pclass', 'Sex', 'Survived']], hue_order=["Embarked", "Sex"])

sns.pairplot(titanic_df[['Pclass', 'Sex', 'PassengerId', 'Survived', 'Embarked' , ' AgeBand ' ]], Hue = ' AgeBand ' ) 
sns.pairplot (titanic_df eyebrows = ' AgeBand ' )

 

Guess you like

Origin www.cnblogs.com/kongweisi/p/11231111.html