数据处理与可视化（机器学习算法原理与实践）

数据的导入和内存管理

1.数据表文件的读取

由于现在大多数系统内存都在几个G，因此小点的数据表处理比较简单，可以直接读入内存并结构化

下面例子是用python读取数据表文件，并将其存到矩阵中，并输出矩阵的行、列数

[python] view plain copy

输出结果如下

2.对象的持久化

有时候我们希望数据以对象的方式保存，python的ePicke模块支持对象的读写

下面例子将转换为矩阵的数据持久化为对象的文件，并读取序列化后的文件

[python] view plain copy

输出结果如下

3.高效读取大文本文件

当遇到大文本文件时（几G，几十G，超过内存大小），可以使用如下函数逐行读取，逐行处理

如下例子读取了文件的前10行

[python] view plain copy

输出结果如下

表与线性结构的可视化

[python] view plain copy

树与分类结构的可视化

因为MatPlotlib没有提供专门绘制树的API，所以这里用了treePlotter

[python] view plain copy

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import treePlotter as tp
# 绘制树
myTree = {'root': {0: 'leaf node', 1: {'level 2': {0: 'leaf node', 1: 'leaf node'}},2:{'level2': {0: 'leaf node', 1: 'leaf node'}}}}
tp.createPlot(myTree)

图与网络结构的可视化

图和网络结构是神经网络和贝叶斯网络中重要的数据结构，完整的结构一般使用dict加list进行存储。

[python] view plain copy

在算法中，经常简化存储为邻接矩阵的形式，使用NumPy的矩阵结构存储点坐标；弧的坐标使用距离公式计算。可视化时可以生成x轴的list和y轴的list显示在图片中。

[python] view plain copy

# -*- coding: utf-8 -*-
import numpy as np
from numpy import *
import matplotlib.pyplot as plt
# nodelist = ["city1","city2","city3","city4","city5","city6","city7","city8"]
dist = mat([[0.1,0.1],[0.9,0.5],[0.9,0.1],[0.45,0.9],[0.9,0.8],[0.7,0.9],[0.1,0.45],[0.45,0.1]])
m,n = shape(dist)
# 绘图
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(dist.T[0],dist.T[1],c='blue',marker='o',s=100)
for point in dist.tolist():
plt.annotate("("+str(point[0])+", "+str(point[1])+")",xy = (point[0],point[1]))
xlist = []
ylist = []
for px,py in zip(dist.T.tolist()[0],dist.T.tolist()[1]):
xlist.append([px])
ylist.append([py])
# print xlist
# print ylist
ax.plot(xlist,ylist,'r')
plt.show()