US urban population and 2000--2013 years of population growth

Development environment: R STUDIO

Data samples: US urban population and 2000--2013 years of population growth

Format: txt

Screen Shot 2018-04-02 at 18.05.15.png

1. First a few questions, based on the observation of the data, presented:

  • In 2013, the top five urban population is what?
  • In 2013, the population growth rate among the top five cities are what?
  • 2013, negative population growth rate among the top five cities are what?
  • In 2000, the population of cities is how much?

2. Data pre-processing and import

setwd("/Users/mac/Desktop/Data Sample")
Add = read.csv("gistfile1 copy.csv", header = T, stringsAsFactors = F,)

1) Remove the original row index
mydataframe <- Add [2: 5  ]

2) conversion of a data block
as.data.frame (mydataframe) 

Screen Shot 2018-04-02 at 18.39.45.png

3) Data type conversion

First check the situation of each column data types

sapply(mydataframe, mode)
sapply(mydataframe, class)

Screen Shot 2018-04-02 at 18.36.31.png

Can be found, the population growth rate and default values ​​are stored as character data, we have to convert it to numeric

In addition, the percentage of population growth here, and we need to convert it to decimal in order to facilitate subsequent calculations. Percent values ​​in the column may be first removed, and then divided by 100, as follows

mydataframe$X2000.2013.growth <- sub("%","", mydataframe$X2000.2013.growth)
mydataframe$X2000.2013.growth <- sapply(mydataframe$X2000.2013.growth, as.numeric)
mydataframe$X2000.2013.growth <- mydataframe$X2000.2013.growth / 100

After treatment:

 

Screen Shot 2018-04-02 at 19.33.54.png

Screen Shot 2018-04-02 at 18.44.55.png

4) Delete missing data
mydataframe <- na.omit (mydataframe)

5) Preview

Screen Shot 2018-04-02 at 19.36.48.png

3. address the issues raised

1) In 2013, people in the top five cities are what?

newdata3 <- mydataframe[order(-mydataframe$population),]
newdata3[1:5,]

Screen Shot 2018-04-02 at 18.53.23.png

2013 population among the top five cities:

New York, Los Angeles, Chicago, Houston, Philadelphia

2) In 2013, the population growth rate among the top five cities are what?

newdata <- mydataframe[order(-mydataframe$X2000.2013.growth),]
newdata[1:5,]

Screen Shot 2018-04-02 at 19.00.10.png

2013 population growth rate among the top five cities:

Maricopa, Buckeye, Fresco, lincoln, Surprise

* Observed that the population growth rate is much higher than other cities of Maricopa, this data as an outlier or the actual data?

So Wikipedia:

“Maricopa was officially incorporated as a city on October 15, 2003, becoming the 88th incorporated city in Arizona.
Between 2000 and 2010, the city's population grew from 1,040 residents to 43,482,
an increase of 4080%.[8]”

可以发现根据记载2000-2010年此城市人口呈爆炸性增长,数据属实。

3)2013年,人口负增长率排在前五的城市是哪些?

newdata2 <- mydataframe[order(mydataframe$X2000.2013.growth),]
newdata2[1:5,]

Screen Shot 2018-04-02 at 19.06.38.png

 

4) 2000年,各个城市的人口是多少?

city <- mydataframe$city
population2013 <- mydataframe$population
population2000 <- round(mydataframe$population / (1 + mydataframe$X2000.2013.growth))

mydataframe2 <- data.frame(city, population2013, population2000)
mydataframe2

输出结果为包含城市名称,2000年及2013年人口的数据框

Screen Shot 2018-04-02 at 19.21.02.png

 

 

最后,由此练习可以总结出,R语言数据分析中需要注意的几点:

  • 导入数据时,需注意数据的储存类型,尤其是数字有时默认以字符型储存。
  • 百分数的处理,需先转化为小数
  • 异常值的分辨

Guess you like

Origin www.cnblogs.com/zfkepic/p/12208084.html