The first bullet of R language description statistics | Calculating the survival rate of different cabins on the Titanic

Author: little bit helper

Source: Dingdian help you

Start learning to do descriptive statistics in R language today. In order to facilitate everyone to learn while practicing, you can download this data:

File name: titanic.csv

Link: https://pan.baidu.com/s/1Pj0EsaBZdnw6mHPpeVd9Aw  

Password: yuym

Import local files into R

In order to facilitate data management and operation, we usually save the data in .csv format, which is a relatively simple data format in excel. To import data in .csv format into R, you can use the function read.csv():

# Import the local file titanic.csv into R, 
# and store it in the object titanic  
  <- read.csv("//Users//Desktop//titanic.csv", header = TRUE)

Assume that this local file stores basic information about passengers on the giant cruise ship Titanic that sank in the Atlantic Ocean in 1912.

The first command above "//Users//Desktop//titanic.csv" is the local storage address of the file titanic.csv. You should adjust it according to the storage location of your computer;

The second command header = TRUE means to automatically set the first line in the original file as the column name of the file.

If there are no column names in your .csv file, but you want to set them after importing R, you should set the second command to header = FALSE.

Understand the data

As mentioned in the previous article, to get a database, we must first understand its basic information. It has been said before, let’s review it briefly.

class(titanic)   
#What is the data structure of the object[1] "data.frame"dim(titanic) #View the     
data has several rows and several columns[1] 1309 6names(titanic) #View   
the column name of the data[1] "pclass" "survived" "sex" "age" "sibsp" "parch" head(titanic) #View the 
first 6 lines tail(titanic)    #View the    
last 6 lines

As you can see, there are 1309 records and 6 variables in the titanic data frame.

The six variables are, in order, class of class, whether they survived, gender, age, number of siblings or spouses in the same group, and number of parents or children in the same group.

Descriptive statistics

Next, let's do descriptive statistics on titanic data.

1. How many people are in each class of cabin?

There are two methods, one is the table() function, which is used to count the frequency of each category in the categorical variable pclass; the other is the summary() function, which is used for descriptive statistics, which is suitable for both classification and count variables. You can use To count the frequency of categorical variables, calculate the mean and percentile of count variables, etc.

# Method one table(titanic$pclass) 1st 2nd 3rd 323 277 709# Method two summary(titanic$pclass) 1st 2nd 3rd 323 277 709

2. How many were the victims and the survivors?

table(titanic$survived)   
died  survived       
809      500 

3. How many people died and how many survived in each class of cabin?

In this example, the passenger status is counted according to the two conditions of "class of class" and "survival", and there are 6 possibilities. Still use the table() function to count how many people are in each possible situation, and generate a cross-contingency table.

# Store the contingency table in tab1 tab1  
  <- table(titanic$survived,titanic$pclass) 
# View the content of tab1 tab1            
1st 2nd 3rd died       
123 158 528    
survived   
200 119 181

4. What is the percentage of survivors in each class?

The idea is simple, that is, the ratio of the number of survivors in each class to the total number of people in that class.

1) Let's take a look at how to calculate the number of survivors in each class. The second line of tab1 above is just to extract them. The method is the same as how to extract the rows and columns in the data frame as described earlier:

#Extract the second row of tab1 tab1[2,] 
1st 2nd 3rd   
200 119 181

2) The total number of people in each class? The above has also been calculated:

table(titanic$pclass) 
1st 2nd 3rd  
323 277 709

There is another method, using the apply() function, the function is to batch process the rows or columns of matrix data:

apply(tab1,2,sum) 
1st 2nd 3rd 
323 277 709

There are three commands in the function. The first command tab1 represents the data to be processed; the second command 2 represents processing each column of tab1, if each row needs to be processed, the second command should enter the number 1; the third command sum represents the sum.

Therefore, the meaning of the above sentence is: to sum each column in tab1, that is, to calculate the total number of people in each class.

3) Find the ratio of the number of survivors in each class to the total number of people in that class:

# Method one 
tab1[2, ]/table(titanic$pclass)       
1st 2nd 3rd     
0.6191950 0.4296029 0.2552891  
# Method two 
tab1[2, ]/apply(tab1,2,sum)       
1st 2nd 3rd     
0.6191950 0.4296029 0.2552891

4) You must have also discovered that this result is very unsightly and not suitable for reporting in scientific research. We make the following changes:

# First multiply by 100 
tab1[2, ]/apply(tab1,2,sum)*100 
1st 2nd 3rd  
61.91950 42.96029 25.52891 
# Keep 2 decimal places 
round(tab1[2, ]/apply(tab1,2,sum)*100 ,2)    
1st 2nd 3rd   
61.92 42.96 25.53

The function of the round() function is to retain the number of decimal places.

In the above code, the first command tab1[2, ]/apply(tab1,2,sum)*100 is the object that needs to keep decimals;

The second command 2 means to keep 2 decimal places.

5) But this result is obviously wrong, adding the percent sign% is accurate. You need to use the paste() function. The function of this function is to connect various elements. In this example, we want to connect numbers and percent signs:

paste(round(tab1[2, ]/apply(tab1,2,sum)*100,2),"%",sep="")
"61.92%" "42.96%" "25.53%"

The first command round(tab1[2, ]/apply(tab1,2,sum)*100,2) is the numerical part of the percentage calculated above, which is the first part to be connected;

The second command "%" is the second part to be connected;

The third command sep="" refers to the connection symbol between two elements. Here we don't need any connection symbol, so nothing needs to be written between the quotation marks "".

 

Guess you like

Origin blog.csdn.net/yoggieCDA/article/details/108846126