First, the file is read titanic.xlsx according materials exemplary steps, data cleaning.
titanic 11 wherein the data set comprises, respectively:
Survived: 0 represents the death, 1 for survival
Pclass: ticket held by the passenger category, there are three values (, 2, 3)
the Name: Passenger Name
Sex: Sex passengers
Age: Passenger Age (deletion)
SibSp: passenger sibling / spouse number (integer)
PARCH: the number of passengers parent / child (integer)
ticket: ticket number (string)
Fare -: passengers held ticket price (floating-point numbers, ranging from 0-500)
cabin: cabin where the passengers ( deletion)
Embark in: embarkation port: S, C, Q (deletion)
# Read the file, display the first five lines Import PANDAS AS PD Titanic = pd.DataFrame (pd.read_excel ( ' Titanic-2.xlsx ' )) titanic.head ()
# Delete invalid column titanic.drop ( ' embark_town ' , Axis = 1, InPlace = True) titanic.head ()
# Find duplicate value titanic.duplicated ()
# Deleting duplicates Titanic = titanic.drop_duplicates () titanic.head ()
# Count the number of null values Titanic [ ' WHO ' ] .isnull (). Value_counts ()
# Use fillna method for filling nulls in Titanic [ ' WHO ' ] = Titanic [ ' WHO ' ] .fillna ( ' man ' ) Titanic
# CHECK OK values Titanic [ ' Age ' ] .isnull (). Value_counts ()
# Using a column filled with the average age of fillna Titanic [ ' age ' ] = Titanic [ ' age ' ] .fillna (Titanic [ ' age ' ] .mean ()) titanic.head ()
# Use describe to see statistics titanic.describe ()
# Outlier replaced average titanic.replace ([512.329200], Titanic [ ' Fare ' ] .mean ())
Second, the data set for titanic complete the following statistical operations
1. Statistical passenger deaths and survival
titanic['survived'].value_counts()
2. The number of passengers in gender statistics for men and women
titanic['sex'].value_counts()
3. The number of men and women rescued statistics
titanic['sex'][titanic['survived']==1].value_counts()
4. The number of cabin class passengers where statistics
titanic['class'].value_counts()
5. Use Corr () function, it is determined whether the two relevant properties, and survival analysis of the relationship between the level of the space
titanic['survived'].corr(titanic['pclass'])
6. Draw passenger fares with the box diagram Boxplot class of service, what conclusions can be derived from the figure?
titanic.boxplot(['fare'],['pclass'])
Conclusions: The higher the level of the cabin, the higher the cost.