R language Big Data Analytics in New York City, 3.11 million complaints statistical visualization and time series analysis

Original link: http://tecdat.cn/?p=9800

 


 

Introduction

 

This article does not mean better in R data analysis or faster than Python, I personally use both languages ​​every day. This article only provides an opportunity to compare the two languages.

Article   data   is updated daily, the greater my file version for 4.63 GB.


CSV file containing the New York City 311 complaints. It is New York City's most popular open data portal data sets.

 

Data Workflow

 

install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")
library(plotly)

We need to create an account to connect to the plotly API. Alternatively, you can just use the default ggplot2 graphics.

 
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key

 

 

Dplyr analyzed using the R

 

Suppose the sqlite3 (and therefore accessible by the terminal) is installed.

$ sqlite3 data.db # Create your database
$.databases       # Show databases to make sure it works
$.mode csv        
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit 

Load data into memory.

 
library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv', 
                   select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'), 
                   showProgress = T))
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
  user.self sys.self elapsed user.child sys.child
time_data.table 63.588 1.952 65.633 0 0
time_data.table_full 205.571 3.124 208.880 0 0
time_readr 277.720 5.018 283.029 0 0

I will use data.table read the data. This  fread function greatly improves the reading speed.

About dplyr

 

By default, dplyr extracted from the database query only the first 10 rows.

library(dplyr)      ## Will be used for pandas replacement

# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db

 

Two data processing choice (except R) is:

  • data sheet
  • dplyr

Preview data

 

# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))

head_(data)
Agency CreatedDate ClosedDate ComplaintType Descriptor City
NYPD 04/11/2015 02:13:04 AM   Noise - Street/Sidewalk Loud Music/Party BROOKLYN
DFTA 04/11/2015 02:12:05 AM   Senior Center Complaint N/A ELMHURST
NYPD 04/11/2015 02:11:46 AM   Noise - Commercial Loud Music/Party JAMAICA
NYPD 04/11/2015 02:11:02 AM   Noise - Street/Sidewalk Loud Talking BROOKLYN
NYPD 04/11/2015 02:10:45 AM   Noise - Street/Sidewalk Loud Music/Party NEW YORK

 

Select columns

 
ComplaintType Descriptor Agency
Noise - Street/Sidewalk Loud Music/Party NYPD
Senior Center Complaint N/A DFTA
Noise - Commercial Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD
Noise - Street/Sidewalk Loud Music/Party NYPD

 

 

 
ComplaintType Descriptor Agency
Noise - Street/Sidewalk Loud Music/Party NYPD
Senior Center Complaint N/A DFTA
Noise - Commercial Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD
Noise - Street/Sidewalk Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD
Noise - Commercial Loud Music/Party NYPD
HPD Literature Request The ABCs of Housing - Spanish HPD
Noise - Street/Sidewalk Loud Talking NYPD
Street Condition Plate Condition - Noisy DOT

 

Use WHERE filter rows

 
ComplaintType Descriptor Agency
Noise - Street/Sidewalk Loud Music/Party NYPD
Noise - Commercial Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD
Noise - Street/Sidewalk Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD

 

WHERE value IN and a plurality of filtration column

 
ComplaintType Descriptor Agency
Noise - Street/Sidewalk Loud Music/Party NYPD
Noise - Commercial Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD
Noise - Street/Sidewalk Loud Music/Party NYPD
Noise - Street/Sidewalk Loud Talking NYPD

 

Find unique values ​​in a column DISTINCT

##       City
## 1 BROOKLYN
## 2 ELMHURST
## 3  JAMAICA
## 4 NEW YORK
## 5         
## 6  BAYSIDE

 

Use COUNT (*) GROUP BY query and the count value

 
# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data

q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())
head_(q)
Agency No.Complaints
3-1-1 22499
ACS 3
AJC 7
ART 3
CAU 8

 

And use ORDER - sequencing results

 

 

 

How many cities have a database?

# dt[, unique(City)]

q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)
##   Number.of.Cities
## 1             1818

Let's draw the 10 most talked about city

 

City No.Complaints
BROOKLYN 2671085
NEW YORK 1692514
BRONX 1624292
  766378
STATEN ISLAND 437395
JAMAICA 147133
FLUSHING 117669
ASTORIA 90570
Jamaica 67083
RIDGEWOOD 66411

 

 

  • With   UPPER conversion CITY format.
CITY No.Complaints
BROOKLYN 2671085
NEW YORK 1692514
BRONX 1624292
  766378
STATEN ISLAND 437395
JAMAICA 147133
FLUSHING 117669
ASTORIA 90570
JAMAICA 67083
RIDGEWOOD 66411

 

Complaint Type (by city)


# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) + 
            geom_bar(stat = 'identity') + 
            theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plt

 

Calculating a second part of series

SQLite data provided is not suitable for the standard date format.

Create a new column in the SQL database, and then use the formatted date statement reinsert data to create a new table and column names to insert the original date format.

String filter with a timestamp SQLite line: YYYY-MM-DD hh: mm: ss

# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00', 
#      .(ComplaintType, CreatedDate, City)]

q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00",   CreatedDate > "2014-09-16 23:45:00") %>%
    select(ComplaintType, CreatedDate, City)

head_(q)
ComplaintType CreatedDate City
Noise - Street/Sidewalk 2014-11-12 11:59:56 BRONX
Taxi Complaint 2014-11-12 11:59:40 BROOKLYN
Noise - Commercial 2014-11-12 11:58:53 BROOKLYN
Noise - Commercial 2014-11-12 11:58:26 NEW YORK
Noise - Street/Sidewalk 2014-11-12 11:58:14 NEW YORK

 

使用strftime从时间戳中拉出小时单位

# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]

q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>% 
            select(ComplaintType, CreatedDate, City, hour)

head_(q)

 

ComplaintType CreatedDate City hour
Noise - Street/Sidewalk 2015-11-04 02:13:04 BROOKLYN 02
Senior Center Complaint 2015-11-04 02:12:05 ELMHURST 02
Noise - Commercial 2015-11-04 02:11:46 JAMAICA 02
Noise - Street/Sidewalk 2015-11-04 02:11:02 BROOKLYN 02
Noise - Street/Sidewalk 2015-11-04 02:10:45 NEW YORK 02

 

汇总时间序列

首先,创建一个时间戳记四舍五入到前15分钟间隔的新列

# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]

q <- data %>% 
     mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%                     
     select(CreatedDate, interval)

head_(q, 10)
CreatedDate interval
2015-11-04 02:13:04 2015-11-04 02:00:00
2015-11-04 02:12:05 2015-11-04 02:00:00
2015-11-04 02:11:46 2015-11-04 02:00:00
2015-11-04 02:11:02 2015-11-04 02:00:00
2015-11-04 02:10:45 2015-11-04 02:00:00
2015-11-04 02:09:07 2015-11-04 02:00:00
2015-11-04 02:05:47 2015-11-04 02:00:00
2015-11-04 02:03:43 2015-11-04 02:00:00
2015-11-04 02:03:29 2015-11-04 02:00:00
2015-11-04 02:02:17 2015-11-04 02:00:00

 

绘制2003年的结果

发布了445 篇原创文章 · 获赞 246 · 访问量 97万+

Guess you like

Origin blog.csdn.net/qq_19600291/article/details/103699296