python for machine-learning

pd.columnss
输出为不包括第一列的表名
pd.merge
类似于数据库表的合并，data1，data2代表要合并的两个数据表，how表示连接的方式，on表示连接的条件
.np.round
对数据进行小数点位数处理
str(yr)
可以直接把数字变成字符
df.boxplot(‘Income’,by=’Regin’,rot=90) rot : label rotation angle
画盒图
X = scipy.stats.norm(loc=diff, scale=1)
正态分布，loc=mean，scale=deviation
plt.legend([“a={}”.format(a)for a in a_values],loc=0)
一般图的标注含有变量的时候就可以使用这个功能。
plt.yscale(‘log’)
merged = pd.groupby(‘Region’, as_index=False).mean()
单单使用groupby没有什么效果，要结合如mean等使用。
population.columns = [‘Country’] + list(list(population.columns)[1:])
对表头列名进行从新组织，在实际使用中，list的使用出现了写编译问题，网上说有时候jupyter需要刷新一下的原因。
http://www.cnblogs.com/txw1958/archive/2011/12/21/2295698.html
python3网络抓取资源的N种方法
source.count(bytes(‘Soup’,’UTF-8’))
X.sf(a)
subplot的基本使用方法
x2=np.arange(35,71,1) fig, ax = plt.subplots(2,1) ax[0].vlines(x2/100, 0, binom.pmf(x2, N, thep), colors='b', lw=5, alpha=0.5) ax[1].vlines(x[1:], 0, y, lw=5, colors=dark2_colors[0]) ax[0].set_xlim(0.35,0.75) ax[1].set_xlim(0.35,0.75) plt.show()
plt.xticks(rotation=90)
对图像的x轴标注旋转90度，这种情况适用于x轴是比较长的标注。
if l is not None and l[:4]==’http’
这是用于网络连接筛选的代码，在实际应用中，存在很多数据列为空的情况，所以该功能还是非常强大方便的。
[l for l in link_list if l is not None and l.startswith(‘http’)]for
python的for循环使用非常优美简介，实际掌握还是需要大量的联系。
有时候获取网络资源时候，网站会阻止爬虫，这时候就需要对你的爬虫程序进行伪装
req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'}) source = urllib.request.urlopen(req).read()
jupyter多版本切换问题解决，两条指令

 pip2 install ipython
 ipython2 kernelspec install-self

画图的两种方法

 1. 
    data_to_plot = ranking.overall
  plt.bar(data_to_plot.index, data_to_plot)
  plt.show()
 2.
  ranking_categories_weighted.head().plot(kind='bar')

legend的使用

.ax = ranking_categories_weighted.head().plot(kind='bar', legend=False)
    # Put a legend to the right of the current axis
    ax.legend(loc='center left', blebox_to_anchor=(1, 0.5))

    plt.show()

jupyter数学公式书写
http://blog.csdn.net/winnerineast/article/details/52274556
http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Typesetting%20Equations.html
一个网络数据处理的过程

     URL = "http://www.pollster.com/08USPresGEMvO-2.html"
   html=requests.get(URL).text
   dom=web.Element(html)
   rows=dom.by_tag('tr')
   table=[]
   for row in rows:
       table_row=[]
       data=row.by_tag('td')
       for value in data:
           table_row.append(web.plaintext(value.content))
      table.append(table_row)

.正则表达式re模块

   http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

df2 = new_df.iloc[keep]
密度图也被称作KDE图，调用plt时加上kind=’kde’即可生成一张密度图。

    diamonds['price'].plot(kind='kde', color = 'black')kind='kde'

diamonds.boxplot(‘price’, by = ‘color’)
by是x轴，price是Y轴
产生随机数的各种情况

  1. np.random.randint(a, b, N)
  2. np.random.rand(n, m)
  3. np.random.randn(n, m)

array的一些操作

1.z.reshape((8,2))
2..z.flatten()/To flatten an array (convert a higher dimensional array into a vector), use flatten()

在线下载zip并处理的全套python2代码

 zip_folder = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip').content
       zip_files = StringIO()
       zip_files.write(zip_folder)
       csv_files = ZipFile(zip_files)
       teams=csv_files.open('Teams.csv')
       teams=read_csv(teams)

DataFrame数据的基本构造

    data=pd.DataFrame({'level':['a','b','c','b','a'],
               'num':[3,5,6,8,9]})

    grouped = df.groupby("playerID", as_index=False)
    #print grouped.head()
    rookie_idx = grouped["yearID"].aggregate({'min_index':f})['min_index'].values
    #获得每组的第一个出现的数据组
    rookie = df.loc[rookie_idx][["playerID", "AB", "H"]]

jupyter markdown效果
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
tab = tab.dropna()
去除表中存在数值为空的行，这个功能比较实用
1. 对在线csv文件进行获取

url_exprs = "https://raw.githubusercontent.com/cs109/2014_data/master/exprs_GSE5859.csv"
exprs = pd.read_csv(url_exprs, index_col=0)

sklearn中包含很多机器学习的模型，下面给出一些例子

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression


selector = SelectKBest(f_regression, k=2).fit(x, y)
best_features = np.where(selector.get_support())[0]
print(best_features)

xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)

          错误的CrossValidation：
for train, test in KFold(len(y), 10):
    xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]
    clf.fit(xtrain, ytrain)
    yp = clf.predict(xtest)

    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')


plt.xlabel("Predicted")
plt.ylabel("Observed")

正确的CrossValidation:
for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))

scipy下面一些非常有用的模块

 scipy.stats
 scipy.integrate
 scipy.signal
 scipy.optimize
 scipy.special
 scipy.linalg

mtcars.ix[‘Maserati Bora’]
获取数据的一行
any和all的使用区别结果的区别

(mtcars.mpg >= 20).any() True
(mtcars > 0).all() true false true true ...

画多个关系对比

rom pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']],  figsize = (10, 6), alpha = 1, diagonal='kde')

bs4常用的一些属性

soup.head.contents
soup.head.children
oup.head.title
soup.head.title.string
for child in soup.head.descendants:
     .stripped_strings
    soup.find_all('a')
    soup.find_all('a')[1].get('href')

json格式使用

a = {'a': 1, 'b':2} a # a dictionary
s = json.dumps(a) s # s is a string containing a in JSON encoding
a2 = json.loads(s) a2 # reading back the keys are now in unicode

Create a pandas DataFrame from JSON

  data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])

pd的时间的格式化，但是不知道什么样的时间可以格式化

data['gameDate']=pd.DatetimeIndex(data.datetime).date
data['gameTime']=pd.DatetimeIndex(data.datetime).time

几种分布函数

data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)#贝努力随机
 y = stats.poisson.pmf(n, lam) 泊松分布
 y = stats.norm.pdf(x, 0, 1) 正态分布
 y=stats.beta.pdf(x, a, b) b分布
 lam = 0.5 x = np.arange(0, 15, 0.1），y = lam * np.exp(-lam * xstats.expon.rvs(scale = 2, size = 1000) e分布

sklearn自带的数据库

from sklearn.datasets import load_boston
              boston = load_boston()

statsmodels模块

statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn). It can estimate many types of statistical models, but today we will focus on linear regression.
eg:import statsmodels.api as sm
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
print results.summary()
results.params.values
X = sm.add_constant(X)

画图参数设置较全的一个

residData.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8), color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))

矩阵转置求线性回归的例子

 np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)

python for machine-learning

猜你喜欢