用python进行数据分析--引言

前言

这是用学习《用python进行数据分析》的连载。这篇博客记录的是学习第二章引言部分的内容

内容

一、分析usa.org的数据

（1）载入数据

import json

if __name__ == "__main__":
    # load data
    path = "../../datasets/bitly_usagov/example.txt"
    with open(path) as data:
        records = [json.loads(line) for line in data]

（2）计算数据中不同时区的数量

1、复杂的方式

# define the count function
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

2.简单的方式

from collections import defaultdict


# to simple the count function
def simple_get_counts(sequence):
    # initialize the all value to zero
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts

简单之所以简单是因为它引用了defaultdicta函数，初始化了字典中的元素，将其值全部初始化为0，就省去了判断其值是否在字典中出现的过程

（3）获取出现次数最高的10个时区名和它们出现的次数

1.复杂方式

# get the top 10 timezone which value is biggest
def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    # this sort method is asc
    value_key_pairs.sort()
    return value_key_pairs[-n:]

# get top counts by get_count function
counts = simple_get_counts(time_zones)
top_counts = top_counts(counts)

2.简单方式

from collections import Counter

# simple the top_counts method
def simple_top_counts(timezone, n=10):
    counter_counts = Counter(timezone)
    return counter_counts.most_common(n)
simple_top_counts(time_zones)

简单方式之所以简单，是因为他利用了collections中自带的counter函数，它能够自动的计算不同值出现的次数，并且不需要先计算counts

（4）利用pylab和dataframe画出不同timezone的出现次数，以柱状图的形式。

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import pylab as pyl

# use the dataframe to show the counts of timezone
def show_timezone_data(records):
    frame = DataFrame(records)
    clean_tz = frame['tz'].fillna("Missing")
    clean_tz[clean_tz == ''] = 'Unknown'
    tz_counts = clean_tz.value_counts()
    tz_counts[:10].plot("barh", rot=0)
    pyl.show()

以下是结果图：

用python进行数据分析--引言

猜你喜欢