pd.columnss
输出为不包括第一列的表名pd.merge
类似于数据库表的合并,data1,data2代表要合并的两个数据表,how表示连接的方式,on表示连接的条件.np.round
对数据进行小数点位数处理str(yr)
可以直接把数字变成字符df.boxplot(‘Income’,by=’Regin’,rot=90) rot : label rotation angle
画盒图X = scipy.stats.norm(loc=diff, scale=1)
正态分布,loc=mean,scale=deviationplt.legend([“a={}”.format(a)for a in a_values],loc=0)
一般图的标注含有变量的时候就可以使用这个功能。plt.yscale(‘log’)
merged = pd.groupby(‘Region’, as_index=False).mean()
单单使用groupby没有什么效果,要结合如mean等使用。population.columns = [‘Country’] + list(list(population.columns)[1:])
对表头列名进行从新组织,在实际使用中,list的使用出现了写编译问题,网上说有时候jupyter需要刷新一下的原因。http://www.cnblogs.com/txw1958/archive/2011/12/21/2295698.html
python3网络抓取资源的N种方法source.count(bytes(‘Soup’,’UTF-8’))
X.sf(a)
subplot的基本使用方法
x2=np.arange(35,71,1)
fig, ax = plt.subplots(2,1)
ax[0].vlines(x2/100, 0, binom.pmf(x2, N, thep), colors='b', lw=5, alpha=0.5)
ax[1].vlines(x[1:], 0, y, lw=5, colors=dark2_colors[0])
ax[0].set_xlim(0.35,0.75)
ax[1].set_xlim(0.35,0.75)
plt.show()plt.xticks(rotation=90)
对图像的x轴标注旋转90度,这种情况适用于x轴是比较长的标注。if l is not None and l[:4]==’http’
这是用于网络连接筛选的代码,在实际应用中,存在很多数据列为空的情况,所以该功能还是非常强大方便的。[l for l in link_list if l is not None and l.startswith(‘http’)]for
python的for循环使用非常优美简介,实际掌握还是需要大量的联系。有时候获取网络资源时候,网站会阻止爬虫,这时候就需要对你的爬虫程序进行伪装
req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
source = urllib.request.urlopen(req).read()jupyter多版本切换问题解决,两条指令
pip2 install ipython
ipython2 kernelspec install-self
- 画图的两种方法
1.
data_to_plot = ranking.overall
plt.bar(data_to_plot.index, data_to_plot)
plt.show()
2.
ranking_categories_weighted.head().plot(kind='bar')
- legend的使用
.ax = ranking_categories_weighted.head().plot(kind='bar', legend=False)
# Put a legend to the right of the current axis
ax.legend(loc='center left', blebox_to_anchor=(1, 0.5))
plt.show()
jupyter数学公式书写
http://blog.csdn.net/winnerineast/article/details/52274556
http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Typesetting%20Equations.html一个网络数据处理的过程
URL = "http://www.pollster.com/08USPresGEMvO-2.html"
html=requests.get(URL).text
dom=web.Element(html)
rows=dom.by_tag('tr')
table=[]
for row in rows:
table_row=[]
data=row.by_tag('td')
for value in data:
table_row.append(web.plaintext(value.content))
table.append(table_row)
- .正则表达式re模块
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
df2 = new_df.iloc[keep]
密度图也被称作KDE图,调用plt时加上kind=’kde’即可生成一张密度图。
diamonds['price'].plot(kind='kde', color = 'black')kind='kde'
diamonds.boxplot(‘price’, by = ‘color’)
by是x轴,price是Y轴产生随机数的各种情况
1. np.random.randint(a, b, N)
2. np.random.rand(n, m)
3. np.random.randn(n, m)
- array的一些操作
1.z.reshape((8,2))
2..z.flatten()/To flatten an array (convert a higher dimensional array into a vector), use flatten()
- 在线下载zip并处理的全套python2代码
zip_folder = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip').content
zip_files = StringIO()
zip_files.write(zip_folder)
csv_files = ZipFile(zip_files)
teams=csv_files.open('Teams.csv')
teams=read_csv(teams)
- DataFrame数据的基本构造
data=pd.DataFrame({'level':['a','b','c','b','a'],
'num':[3,5,6,8,9]})
grouped = df.groupby("playerID", as_index=False)
#print grouped.head()
rookie_idx = grouped["yearID"].aggregate({'min_index':f})['min_index'].values
#获得每组的第一个出现的数据组
rookie = df.loc[rookie_idx][["playerID", "AB", "H"]]
jupyter markdown效果
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheettab = tab.dropna()
去除表中存在数值为空的行,这个功能比较实用- 对在线csv文件进行获取
url_exprs = "https://raw.githubusercontent.com/cs109/2014_data/master/exprs_GSE5859.csv"
exprs = pd.read_csv(url_exprs, index_col=0)
- sklearn中包含很多机器学习的模型,下面给出一些例子
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
selector = SelectKBest(f_regression, k=2).fit(x, y)
best_features = np.where(selector.get_support())[0]
print(best_features)
xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)
错误的CrossValidation:
for train, test in KFold(len(y), 10):
xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]
clf.fit(xtrain, ytrain)
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
正确的CrossValidation:
for train, test in KFold(len(y), n_folds=5):
xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
b = SelectKBest(f_regression, k=2)
b.fit(xtrain, ytrain)
xtrain = xtrain[:, b.get_support()]
xtest = xtest[:, b.get_support()]
clf.fit(xtrain, ytrain)
scores.append(clf.score(xtest, ytest))
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
print("CV Score is ", np.mean(scores))
- scipy下面一些非常有用的模块
scipy.stats
scipy.integrate
scipy.signal
scipy.optimize
scipy.special
scipy.linalg
mtcars.ix[‘Maserati Bora’]
获取数据的一行any和all的使用区别结果的区别
(mtcars.mpg >= 20).any() True
(mtcars > 0).all() true false true true ...
- 画多个关系对比
rom pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']], figsize = (10, 6), alpha = 1, diagonal='kde')
- bs4常用的一些属性
soup.head.contents
soup.head.children
oup.head.title
soup.head.title.string
for child in soup.head.descendants:
.stripped_strings
soup.find_all('a')
soup.find_all('a')[1].get('href')
- json格式使用
a = {'a': 1, 'b':2} a # a dictionary
s = json.dumps(a) s # s is a string containing a in JSON encoding
a2 = json.loads(s) a2 # reading back the keys are now in unicode
- Create a pandas DataFrame from JSON
data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])
- pd的时间的格式化,但是不知道什么样的时间可以格式化
data['gameDate']=pd.DatetimeIndex(data.datetime).date
data['gameTime']=pd.DatetimeIndex(data.datetime).time
- 几种分布函数
data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)#贝努力随机
y = stats.poisson.pmf(n, lam) 泊松分布
y = stats.norm.pdf(x, 0, 1) 正态分布
y=stats.beta.pdf(x, a, b) b分布
lam = 0.5 x = np.arange(0, 15, 0.1),y = lam * np.exp(-lam * xstats.expon.rvs(scale = 2, size = 1000) e分布
- sklearn自带的数据库
from sklearn.datasets import load_boston
boston = load_boston()
- statsmodels模块
statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn). It can estimate many types of statistical models, but today we will focus on linear regression.
eg:import statsmodels.api as sm
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
print results.summary()
results.params.values
X = sm.add_constant(X)
- 画图参数设置较全的一个
residData.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8), color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))
- 矩阵转置求线性回归的例子
np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)