用python进行数据分析--引言

前言

这是用学习《用python进行数据分析》的连载。这篇博客记录的是学习第二章引言部分的内容

内容

一、分析usa.org的数据

(1)载入数据

import json
if __name__ == "__main__":
# load data
path = "../../datasets/bitly_usagov/example.txt"
with open(path) as data:
records = [json.loads(line) for line in data]
 

(2)计算数据中不同时区的数量

1、复杂的方式

# define the count function
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

2.简单的方式

 
 
from collections import defaultdict

#
to simple the count function def simple_get_counts(sequence): # initialize the all value to zero counts = defaultdict(int) for x in sequence: counts[x] += 1 return counts

简单之所以简单是因为它引用了defaultdicta函数,初始化了字典中的元素,将其值全部初始化为0,就省去了判断其值是否在字典中出现的过程

(3)获取出现次数最高的10个时区名和它们出现的次数

1.复杂方式

# get the top 10 timezone which value is biggest
def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    # this sort method is asc
    value_key_pairs.sort()
    return value_key_pairs[-n:]
# get top counts by get_count function
counts = simple_get_counts(time_zones)
top_counts = top_counts(counts)

2.简单方式

from collections import Counter

# simple the top_counts method
def simple_top_counts(timezone, n=10):
    counter_counts = Counter(timezone)
    return counter_counts.most_common(n)
simple_top_counts(time_zones)

简单方式之所以简单,是因为他利用了collections中自带的counter函数,它能够自动的计算不同值出现的次数,并且不需要先计算counts

(4)利用pylab和dataframe画出不同timezone的出现次数,以柱状图的形式。

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import pylab as pyl

# use the dataframe to show the counts of timezone
def show_timezone_data(records):
    frame = DataFrame(records)
    clean_tz = frame['tz'].fillna("Missing")
    clean_tz[clean_tz == ''] = 'Unknown'
    tz_counts = clean_tz.value_counts()
    tz_counts[:10].plot("barh", rot=0)
    pyl.show()

以下是结果图:

猜你喜欢

转载自www.cnblogs.com/whatyouknow123/p/9118174.html