前言
这是用学习《用python进行数据分析》的连载。这篇博客记录的是学习第二章引言部分的内容
内容
一、分析usa.org的数据
(1)载入数据
import json
if __name__ == "__main__":
# load data
path = "../../datasets/bitly_usagov/example.txt"
with open(path) as data:
records = [json.loads(line) for line in data]
(2)计算数据中不同时区的数量
1、复杂的方式
# define the count function def get_counts(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts
2.简单的方式
from collections import defaultdict
# to simple the count function def simple_get_counts(sequence): # initialize the all value to zero counts = defaultdict(int) for x in sequence: counts[x] += 1 return counts
简单之所以简单是因为它引用了defaultdicta函数,初始化了字典中的元素,将其值全部初始化为0,就省去了判断其值是否在字典中出现的过程
(3)获取出现次数最高的10个时区名和它们出现的次数
1.复杂方式
# get the top 10 timezone which value is biggest def top_counts(count_dict, n=10): value_key_pairs = [(count, tz) for tz, count in count_dict.items()] # this sort method is asc value_key_pairs.sort() return value_key_pairs[-n:]
# get top counts by get_count function
counts = simple_get_counts(time_zones)
top_counts = top_counts(counts)
2.简单方式
from collections import Counter # simple the top_counts method def simple_top_counts(timezone, n=10): counter_counts = Counter(timezone) return counter_counts.most_common(n)
simple_top_counts(time_zones)
简单方式之所以简单,是因为他利用了collections中自带的counter函数,它能够自动的计算不同值出现的次数,并且不需要先计算counts
(4)利用pylab和dataframe画出不同timezone的出现次数,以柱状图的形式。
from pandas import DataFrame, Series import pandas as pd import numpy as np import pylab as pyl # use the dataframe to show the counts of timezone def show_timezone_data(records): frame = DataFrame(records) clean_tz = frame['tz'].fillna("Missing") clean_tz[clean_tz == ''] = 'Unknown' tz_counts = clean_tz.value_counts() tz_counts[:10].plot("barh", rot=0) pyl.show()
以下是结果图: