The second job -titanic dataset practice

First, the file is read titanic.xlsx according materials exemplary steps, data cleaning.

titanic 11 wherein the data set comprises, respectively:

Survived: 0 represents the death, 1 for survival
Pclass: ticket held by the passenger category, there are three values (, 2, 3)
the Name: Passenger Name
Sex: Sex passengers
Age: Passenger Age (deletion)
SibSp: passenger sibling / spouse number (integer)
PARCH: the number of passengers parent / child (integer)
ticket: ticket number (string)
Fare -: passengers held ticket price (floating-point numbers, ranging from 0-500)
cabin: cabin where the passengers ( deletion)
Embark in: embarkation port: S, C, Q (deletion)

# Read the file, display the first five lines 
Import PANDAS AS PD 
Titanic = pd.DataFrame (pd.read_excel ( ' Titanic-2.xlsx ' )) 
titanic.head ()

 

 

# Delete invalid column 
titanic.drop ( ' embark_town ' , Axis = 1, InPlace = True) 
titanic.head ()

 

 

# Find duplicate value 
titanic.duplicated ()

 

 

# Deleting duplicates 
Titanic = titanic.drop_duplicates () 
titanic.head ()

 

 

# Count the number of null values 
Titanic [ ' WHO ' ] .isnull (). Value_counts ()

 

 

# Use fillna method for filling nulls in 
Titanic [ ' WHO ' ] = Titanic [ ' WHO ' ] .fillna ( ' man ' ) 
Titanic

 

 

# CHECK OK values 
Titanic [ ' Age ' ] .isnull (). Value_counts ()

 

 

# Using a column filled with the average age of fillna 
Titanic [ ' age ' ] = Titanic [ ' age ' ] .fillna (Titanic [ ' age ' ] .mean ()) 
titanic.head ()

 

 

# Use describe to see statistics 
titanic.describe ()

 

 

# Outlier replaced average 
titanic.replace ([512.329200], Titanic [ ' Fare ' ] .mean ())

Second, the data set for titanic complete the following statistical operations

1. Statistical passenger deaths and survival

titanic['survived'].value_counts()

2. The number of passengers in gender statistics for men and women

titanic['sex'].value_counts()

3. The number of men and women rescued statistics

titanic['sex'][titanic['survived']==1].value_counts()

4. The number of cabin class passengers where statistics

titanic['class'].value_counts()

5. Use Corr () function, it is determined whether the two relevant properties, and survival analysis of the relationship between the level of the space

titanic['survived'].corr(titanic['pclass'])

 

 

6. Draw passenger fares with the box diagram Boxplot class of service, what conclusions can be derived from the figure?

titanic.boxplot(['fare'],['pclass'])

 

Conclusions: The higher the level of the cabin, the higher the cost.

 

Guess you like

Origin www.cnblogs.com/liyuchen44/p/11693969.html