python数据分析--导入数据

1、读取指定路径的数据

读取json类型数据,注意python2和python3的路径表示不一样,我使用的python3中使用  \\ ,而python2中使用反斜杠 /

import json

path='E:\\DataAnalysis\\pydata-book\\pydata-book-1st-edition\\ch02\\usagov_bitly_data2012-03-16-1331923249.txt'

records=[json.loads(line) for line in open(path)]

records[0]
Out[4]: 
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'c': 'US',
 'nk': 1,
 'tz': 'America/New_York',
 'gr': 'MA',
 'g': 'A6qOVH',
 'h': 'wfLQtf',
 'l': 'orofrog',
 'al': 'en-US,en;q=0.8',
 'hh': '1.usa.gov',
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
 't': 1331923247,
 'hc': 1331822918,
 'cy': 'Danvers',
 'll': [42.576698, -70.954903]}

records[0]['tz']
Out[5]: 'America/New_York'

2、读取字典中某一字段

time_zones=[rec('tz') for rec in records]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-0672c6c590cc> in <module>()
----> 1 time_zones=[rec('tz') for rec in records]

<ipython-input-6-0672c6c590cc> in <listcomp>(.0)
----> 1 time_zones=[rec('tz') for rec in records]

TypeError: 'dict' object is not callable

字段需要用[],此处错用了()

3、计算字段每个值出现的次数

方法一:

 def get_counts(sequence):
    counts={}
    for x in sequence:
        if x in counts:
            counts[x]+=1
        else :
            counts[x]=1
    return counts

方法二

from collections import defaultdict

def get_counts2(sequence):
    counts=defaultdict(int)
    for x in sequence:
        counts[x]+=1
    return counts

4、取前10位及计数值

方法一写一个函数

def top_counts(count_dict,n=10):
    value_key_pairs=[(count,tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

top_counts(counts)

方法二 调用标准库collection

from collections import Counter

counts=Counter(time_zones)

counts.most_common(10)

利用pandas对时区进行计数

from pandas import DataFrame,Series

import pandas as pd; import numpy as np

frame=DataFrame(records)

自带的计数

tz_counts=frame['tz'].value_counts()

tz_counts[:10]

5 处理缺省值和缺失值

clean_tz=frame['tz'].fillna('Missing')

clean_tz[clean_tz=='']='unknown'

tz_counts=clean_tz.value_counts()

tz_counts[:10]

clean_tz=frame['tz'].fillna('Missing')

clean_tz[clean_tz=='']='Unknown'

tz_counts=clean_tz.value_counts()

tz_counts[:10]
Out[36]: 
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64

tz_counts[:10].plot(kind='barh',rot=0)
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x29412298780>

frame['a'][1]
Out[39]: 'GoogleMaps/RochesterNY'

frame['a'][50]
Out[40]: 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'

frame['a'][51]
Out[41]: 'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'

results=Series([x.split()[0] for x in frame.a.dropna()])

猜你喜欢

转载自blog.csdn.net/ElsaRememberAllBug/article/details/81118997