Titanic data analysis

This is a very classic case, many bloggers have written, yes, it is it: the analysis of the survival rate of the Titanic, it is a question on kaggle, through the information analysis and modeling of passengers on board, predict which passengers will get Survived.

Let's take this data set to do a simple analysis very crudely.

Use tool: Excel
(yes, it's that simple and rude)

The data source can be obtained backstage reply: Titanic

1. Clear purpose

In 1912, the Titanic crashed into an iceberg and sank. 1502 of the 2224 passengers and crew on board were killed. Are the survivors out of luck or a certain pattern? This is something we are more concerned about, so we have to ask questions:

Those people are more likely to survive?

Secondly, understand the data. The data set has the following fields in total. Name, sex, cabin, embarked, and ticket are string types. Although pclass and survived are numeric types, their meaning is labels. We start from the class, passenger, and ticket. And the geographical dimension to analyze.

image

2. Data processing

Through inspection, we found that the Age, fare, embarked, and cabin fields are all missing. Let's look at them one by one below.

1. Age missing value processing

There are 263 pieces of data that are empty in the age column, and the missing rate is 20%. All of them can be filled as the mean or mode of age. It can also be further analyzed and found that the third-class data with missing age accounts for the largest number of missing values. The proportion of unsurvived men in third-class cabin is the largest, so it can also be filled with the average age of third-class cabin.image.png

In order to maintain the authenticity of the data here, no filling is done.

2. Fare missing value processing

The filter found that fare (fare) was missing only one value. We found it and found that it can be filled in with the same type of mean.image.png


Therefore, we filter the average fare 7 of third-class, older than 60-year-old males whose port of embarkation is S to fill in this missing value.image.png

3. Embarked missing value processing

There are also two missing values ​​in the embarked port of embarkation field.

image.pngIt is further observed that these two passengers travel alone and have no family (from the sibsp and parch columns are both 0), continue the idea of ​​processing the missing value of fare, and look for the same type to fill. For the first passenger, among women aged 35 to 40 in the first class, the port with the most value is filled in, and the result is S.image.png

In the same way, for the second passenger, filter the first-class females aged 60 to 65 with the most port landing values, and the result is also S.image.png

4. Cabin missing value processing

For the cabin (cabin) field missing values ​​reached 77%, there are too many missing values, no filling processing is done, just keep or delete them, let’s keep them here.

Three, data analysis

1. Class Dimension

pclass
analyzes the space and survival situation, inserts the pivot tableimage.png


Among the survivors, first class accounted for 40%.image.png

Doing a percentage stacked column chart for the survival and death of each cabin, you can see that the first-class survivors accounted for the largest proportion, reaching 61.92%, and the third-class survivors accounted for the least, only 25.33%, so the sentence is still the same. As the old saying goes, although money is not everything, there is no money @#%&^…image.png

Carbin's perspective
on carbin (cabin number) shows that there are 295 unique values, which basically means that only one person lives in a cabin.image.png


But I also found that there is a cabin corresponding to more than 2 people. I further pulled the cabins in and compared it. It was found that the value of the third-class cabin was very small, indicating that most of the missing carbin values ​​were missing from the third-class cabin, meaning third-class The people in the cabin do not have a cabin? Chase shop? This needs to be further verified.image.png

In addition, it is found that the third-class cabins have cabin numbers beginning with E/F/G, while the first-class cabins are more A/B/C. It is guessed that the cabin numbers are arranged in ascending alphabetical order as the cabin decreases.image.png

2. Passenger dimension

name
name name appears no valuable information, but you can further reflection is, in fact, correspond to the name of the title, such as Mr married man, Mrs married lady, etc., but here on the first deleted.

sex
analyzes gender and survivalimage.png


Women accounted for 67.8% of survivors, much higher than 32.2% of men.image.png

The number of female survivors accounted for 72.75% of the total number of women, far greater than the number of male survivors accounting for 19.10% of the total number of men.image.png
image.png

Gender & Cabin
You can take a look at the relationship between cabin and gender by the way. Because the male population base is large, no matter which cabin, there are more men than women. Similarly, each cabin has the largest number of rescued women.image.png


However, the survival rate of first-class women is 97%, much higher than the other two cabins, and the survival rate of third-class women is only 49%.image.png

age
analyzes age and survival. Because age is missing, only those with numerical values ​​are analyzed.

First, make a simple descriptive statistics for age, use the [Descriptive Statistics] function in [Data Analysis], you can see that the maximum age is 80 years, the minimum is 0.17 years, the average is 29.88 years, and the median age is 28 years old, the mode is 24 years old.image.png


Furthermore, we can observe the distribution of age, make a histogram, and make a group of 5 years old. It can be seen that the age of passengers is mainly concentrated in 15-30 years old, and the most young people are 20-25 years old.image.png

After understanding the approximate distribution of age, we have to look at the survival of specific groups of people. We divide age into:

  • Juvenile (0~15 years old)

  • Youth (15~40 years old)

  • Middle age (41~65 years old)

  • Elderly (over 66 years old)

First make a grouping table, use vlookup fuzzy matching to achieve groupingimage.png


Create a new auxiliary column for age grouping next to age and enter the formula

=VLOOKUP(E2,Sheet2!$B$18:$C$21,2,1)

Sheet2!18:21 This area is the pre-set grouping area in the figure above.

image
The perspective of age grouping and
image
survival shows that young people and teenagers account for the largest proportion of survivors, while the elderly account for the smallest proportion.
image
The percentage stacked column chart of death and survival of each age group was made, and the result was that the proportion of teenagers rescued was the highest.
image

sibsp
analyzes the sibsp field (the number of siblings/spouses). After perspective, the label is 0, which means that people without relatives are the majority of passengers on the ship.

image
Also because of the large base, among the survivors, 0 relatives accounted for up to 61.8%.image.png

Doing a percentage stacked column chart for each label is a more meaningful result. It can be seen that the proportion of people rescued with 1 relative is the highest.image.png

Parch
analyzes the parch field (number of parents/children). It can also be seen that the number of people without parents/children is 76% of the total number of people on board. Similarly, this group of people has the largest number of rescued.

image.pngimage.png
Doing a percentage stacked column chart, we can see that the proportion of people with 3 parents/children rescued is the largest, reaching 62.5%.image.png

3. Ferry ticket dimensions

fare
对Fare(票价)字段分析,首先比较关注的是票价和舱位是否存在相关性,正常的逻辑是舱位越高,票价越高,这里算出pclass和fare的相关系数是-0.56,还是比较相关。

image.png还记得上面我们用vlookup的模糊匹配分组,还可以直接用数据透视表分组。透视以后组合,选择50步长一组,可以再对票价和舱位透视看看,看到100以上的高票价全都是头等舱,二等舱和三等舱的票价大部分为0~50。
image.pngimage.png

性别&票价
女性的票价均价要高于男性image.png

性别&舱位&票价
头等舱的均价远高于其他两个舱,每个舱女性的均价都要高于男性,其中票价的最大值512出自头等舱的女性。另外一个比较有意思的现象是,票价为0的居然都是男性。image.png

都写到这儿了,可以再引申出一个问题,票价到底和什么有关?性别?登陆港口?舱位?客舱?有兴趣的小伙伴可以自己再深入探讨一下,这里我们就不探索下去了。

接下来,50一组看一下fare的分布情况,可以看到票价为0~50的占了船上乘客的82%。image.png


同时存活数量最多的还是0~50票价的人群,因为它的基数本身就很大。image.png

从各票价分组的角度来看,做百分比堆积柱形图,可以看到,500-550票价的人群存活比例为100%,而0-50票价的存活比例只有32%。image.png

ticket
ticket字段是船票信息/代号,没有特别大的分析意义,这里也就直接删除了。

4、地域维度

embarked
对embarked(登船港口)字段分析,透视后发现S港口登船的人数最多,从堆积柱形图中可以看到,C扣登船的生成比例最高。

image
image

四、生还率同什么有关

生还率同什么相关?这个是我们最关心的,这个问题其实就是survived字段同其他字段的相关系数。

sex列是字符型数据,要映射成数值,我们添加一列命名为性别的辅助列,male为1,female为0.

image
Add another column of f_num field, which is the sum of sibsp and parch, which means the number of family members.
image
The embarked field is decomposed into 3 auxiliary columns, port-S, port-C, and port-Q. Enter the formula at the same time:

=IF(N2="S",1,0)

If the embarked field is S, then port-S is listed as 1, port-C, port-Q are listed as 0, and so on.

image
In the same way, do the same for the class pclass
image
. Use the [Correlation Coefficient] function in [Data Analysis], you can see the correlation coefficient of each field in
image
descending order, you can see what the birth rate is related to,
image
so come back to us Initial question:

Which ones are more likely to survive?

in conclusion:

  • Although the third class has the largest number of people (54%), the first class has the highest proportion of survivors (62%)

  • Although there are more men (64%) than women, the survival rate of women (72%) is much higher than that of men (19%)

  • The survival rate of first-class women (97%) is much higher than that of third-class women (49%)

  • The number of young people aged 15-40 is the largest (53%), and the survival rate is highest among those aged 0-15 (56%)

  • The number of relatives with 0 is the largest (68%), and the number of relatives with 1 is the highest (51%)

  • The number of parents/children with 0 is the most (76%), and the number of parents/children with 3 is the highest (63%)

  • The number of people with fares in the range of 0-50 is the largest (82%), but the survivorship rate for fares in the range of 500-550 is 100%

  • Port S has the largest number of people boarding (70%), but Port C has the highest survival rate (56%)


Guess you like

Origin blog.51cto.com/15064638/2598040