python数据分析学习笔记(2)——statsmodels模块

    数据分析学习笔记第2篇,我想记录的是 statsmodels 模块。为什么记录这个模块呢?相信很多自学数据分析  numpy/pandas 的同学都会有和我一样的感受,那就是我真的不知道从哪里获取一些数据拿来练手。举个例子,之前我试过从中国统计局官网和大英数据网上下载数据来进行训练,体验太烂,网速渣,找半天也找不到下载数据的地方。最近在做相关练习的时候发现一个很不错的库可以满足有像我一样的困惑的同学。

    言归正传,就是标题里的 statsmodels 神器。下面是官网对该模块的介绍:

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

    简而言之,就是一个包含了统计模型和统计数据的库。我还没有用过统计模型,只是研究了一下统计数据相关的模块。这个库里都有拿些数据可以供大家拿来操练。

    首先,列出所有这个库包含的数据和每个数据的简短介绍。上代码:

#coding:utf-8

import statsmodels.api as sm
from pandas import DataFrame

dataDict = {'name':[], 'describe_short':[]}

for modstr in dir(sm.datasets):
    try:
        mod = eval('sm.datasets.%s' % modstr)
        dataDict['describe_short'].append(mod.DESCRSHORT)
        dataDict['name'].append(modstr)
    except Exception as e:
        print("该模块无 DESCRSHORT 属性:\n", e)
        continue

dataDf = DataFrame({'describe_short':dataDict['describe_short']}, index=dataDict['name'])
print(dataDf)

    输出结果如下(可以看到这个库里还是有比较多的数据的,如 sunspots、scotland、china_smoking 等,可以供大家拿来练手):

                                                       describe_short
anes96              This data is a subset of the American National...
cancer                            Breast Cancer and county population
ccard                            William Greene's credit scoring data
china_smoking       Co-occurrence of lung cancer and smoking in 8 ...
co2                 Atmospheric CO2 from Continuous Air Samples at...
committee           Number of bill assignments in the 104th House ...
copper                                  World Copper Market 1951-1975
cpunish                            Number of state executions in 1997
elnino              Averaged monthly sea surface temperature - Pac...
engel                                    Engel food expenditure data.
fair                                        Extramarital affair data.
fertility           Total fertility rate represents the number of ...
grunfeld            Grunfeld (1950) Investment Data for 11 U.S. Fi...
heart               Survival times after receiving a heart transplant
interest_inflation  (West) German interest and inflation rate 1972...
longley                                                              
macrodata                   US Macroeconomic Data for 1959Q1 - 2009Q3
modechoice          Data used to study travel mode choice between ...
nile                This dataset contains measurements on the annu...
randhie                 The RAND Co. Health Insurance Experiment Data
scotland            Taxation Powers' Yes Vote for Scottish Parliam...
spector             Experimental data on the effectiveness of the ...
stackloss                    Stack loss plant data of Brownlee (1965)
star98              Math scores for 303 student with 10 explanator...
statecrime                                      State crime data 2009
strikes             Contains data on the length of strikes in US m...
sunspots            Yearly (1700-2008) data on sunspots from the N...

    那么如何调用一组数据呢?以scotland为例,我想查看scotland的数据,就用下面的几行代码:

import statsmodels.api as sm
from pandas import DataFrame

scotland_data = sm.datasets.scotland.load_pandas()
# print(type(scotland_data))
# print(scotland_data)
df = scotland_data.data
print(type(df))  # DataFrame 类型的数据
print(df)

    具体到 scotland 这个模块提供的数据,我想查看它的描述信息来了解它具体记录了什么鬼,可以直接打印它的 DESCRSHORT 属性:

#coding:utf-8

import statsmodels.api as sm
from pandas import DataFrame

scotland_mod = sm.datasets.scotland
print (scotland_mod.DESCRLONG)

# This data is based on the example in Gill and describes the proportion of
# voters who voted Yes to grant the Scottish Parliament taxation powers.
# The data are divided into 32 council districts.  This example's explanatory
# variables include the amount of council tax collected in pounds sterling as
# of April 1997 per two adults before adjustments, the female percentage of
# total claims for unemployment benefits as of January, 1998, the standardized
# mortality rate (UK is 100), the percentage of labor force participation,
# regional GDP, the percentage of children aged 5 to 15, and an interaction term
# between female unemployment and the council tax.
#
# The original source files and variable information are included in
# /scotland/src/

        以上,将 statsmodels 作为练手的数据来源介绍给大家~~

猜你喜欢

转载自blog.csdn.net/sinat_37255539/article/details/80381692