Original link: http://tecdat.cn/?p=9800
Introduction
This article does not mean better in R data analysis or faster than Python, I personally use both languages every day. This article only provides an opportunity to compare the two languages.
Article data is updated daily, the greater my file version for 4.63 GB.
CSV file containing the New York City 311 complaints. It is New York City's most popular open data portal data sets.
Data Workflow
install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")
library(plotly)
We need to create an account to connect to the plotly API. Alternatively, you can just use the default ggplot2 graphics.
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key
Dplyr analyzed using the R
Suppose the sqlite3 (and therefore accessible by the terminal) is installed.
$ sqlite3 data.db # Create your database
$.databases # Show databases to make sure it works
$.mode csv
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit
Load data into memory.
library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv',
select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),
showProgress = T))
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
user.self | sys.self | elapsed | user.child | sys.child | |
---|---|---|---|---|---|
time_data.table | 63.588 | 1.952 | 65.633 | 0 | 0 |
time_data.table_full | 205.571 | 3.124 | 208.880 | 0 | 0 |
time_readr | 277.720 | 5.018 | 283.029 | 0 | 0 |
I will use data.table read the data. This fread
function greatly improves the reading speed.
About dplyr
By default, dplyr extracted from the database query only the first 10 rows.
library(dplyr) ## Will be used for pandas replacement
# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db
Two data processing choice (except R) is:
- data sheet
- dplyr
Preview data
# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))
head_(data)
Agency | CreatedDate | ClosedDate | ComplaintType | Descriptor | City |
---|---|---|---|---|---|
NYPD | 04/11/2015 02:13:04 AM | Noise - Street/Sidewalk | Loud Music/Party | BROOKLYN | |
DFTA | 04/11/2015 02:12:05 AM | Senior Center Complaint | N/A | ELMHURST | |
NYPD | 04/11/2015 02:11:46 AM | Noise - Commercial | Loud Music/Party | JAMAICA | |
NYPD | 04/11/2015 02:11:02 AM | Noise - Street/Sidewalk | Loud Talking | BROOKLYN | |
NYPD | 04/11/2015 02:10:45 AM | Noise - Street/Sidewalk | Loud Music/Party | NEW YORK |
Select columns
ComplaintType | Descriptor | Agency |
---|---|---|
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Senior Center Complaint | N/A | DFTA |
Noise - Commercial | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
ComplaintType | Descriptor | Agency |
---|---|---|
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Senior Center Complaint | N/A | DFTA |
Noise - Commercial | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Noise - Commercial | Loud Music/Party | NYPD |
HPD Literature Request | The ABCs of Housing - Spanish | HPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Street Condition | Plate Condition - Noisy | DOT |
Use WHERE filter rows
ComplaintType | Descriptor | Agency |
---|---|---|
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Noise - Commercial | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
WHERE value IN and a plurality of filtration column
ComplaintType | Descriptor | Agency |
---|---|---|
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Noise - Commercial | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Noise - Street/Sidewalk | Loud Music/Party | NYPD |
Noise - Street/Sidewalk | Loud Talking | NYPD |
Find unique values in a column DISTINCT
## City
## 1 BROOKLYN
## 2 ELMHURST
## 3 JAMAICA
## 4 NEW YORK
## 5
## 6 BAYSIDE
Use COUNT (*) GROUP BY query and the count value
# dt[, .(No.Complaints = .N), Agency]
#setkey(dt, No.Complaints) # setkey index's the data
q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())
head_(q)
Agency | No.Complaints |
---|---|
3-1-1 | 22499 |
ACS | 3 |
AJC | 7 |
ART | 3 |
CAU | 8 |
And use ORDER - sequencing results
How many cities have a database?
# dt[, unique(City)]
q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)
## Number.of.Cities
## 1 1818
Let's draw the 10 most talked about city
City | No.Complaints |
---|---|
BROOKLYN | 2671085 |
NEW YORK | 1692514 |
BRONX | 1624292 |
766378 | |
STATEN ISLAND | 437395 |
JAMAICA | 147133 |
FLUSHING | 117669 |
ASTORIA | 90570 |
Jamaica | 67083 |
RIDGEWOOD | 66411 |
- With
UPPER
conversion CITY format.
CITY | No.Complaints |
---|---|
BROOKLYN | 2671085 |
NEW YORK | 1692514 |
BRONX | 1624292 |
766378 | |
STATEN ISLAND | 437395 |
JAMAICA | 147133 |
FLUSHING | 117669 |
ASTORIA | 90570 |
JAMAICA | 67083 |
RIDGEWOOD | 66411 |
Complaint Type (by city)
# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) +
geom_bar(stat = 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
plt
Calculating a second part of series
SQLite data provided is not suitable for the standard date format.
Create a new column in the SQL database, and then use the formatted date statement reinsert data to create a new table and column names to insert the original date format.
String filter with a timestamp SQLite line: YYYY-MM-DD hh: mm: ss
# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00',
# .(ComplaintType, CreatedDate, City)]
q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", CreatedDate > "2014-09-16 23:45:00") %>%
select(ComplaintType, CreatedDate, City)
head_(q)
ComplaintType | CreatedDate | City |
---|---|---|
Noise - Street/Sidewalk | 2014-11-12 11:59:56 | BRONX |
Taxi Complaint | 2014-11-12 11:59:40 | BROOKLYN |
Noise - Commercial | 2014-11-12 11:58:53 | BROOKLYN |
Noise - Commercial | 2014-11-12 11:58:26 | NEW YORK |
Noise - Street/Sidewalk | 2014-11-12 11:58:14 | NEW YORK |
使用strftime从时间戳中拉出小时单位
# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]
q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%
select(ComplaintType, CreatedDate, City, hour)
head_(q)
ComplaintType | CreatedDate | City | hour |
---|---|---|---|
Noise - Street/Sidewalk | 2015-11-04 02:13:04 | BROOKLYN | 02 |
Senior Center Complaint | 2015-11-04 02:12:05 | ELMHURST | 02 |
Noise - Commercial | 2015-11-04 02:11:46 | JAMAICA | 02 |
Noise - Street/Sidewalk | 2015-11-04 02:11:02 | BROOKLYN | 02 |
Noise - Street/Sidewalk | 2015-11-04 02:10:45 | NEW YORK | 02 |
汇总时间序列
首先,创建一个时间戳记四舍五入到前15分钟间隔的新列
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%
select(CreatedDate, interval)
head_(q, 10)
CreatedDate | interval |
---|---|
2015-11-04 02:13:04 | 2015-11-04 02:00:00 |
2015-11-04 02:12:05 | 2015-11-04 02:00:00 |
2015-11-04 02:11:46 | 2015-11-04 02:00:00 |
2015-11-04 02:11:02 | 2015-11-04 02:00:00 |
2015-11-04 02:10:45 | 2015-11-04 02:00:00 |
2015-11-04 02:09:07 | 2015-11-04 02:00:00 |
2015-11-04 02:05:47 | 2015-11-04 02:00:00 |
2015-11-04 02:03:43 | 2015-11-04 02:00:00 |
2015-11-04 02:03:29 | 2015-11-04 02:00:00 |
2015-11-04 02:02:17 | 2015-11-04 02:00:00 |
绘制2003年的结果