1.读取文本格式数据
import pandas as pd df = pd.read_csv('ex1.csv') print(df)a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
如果文本内容是这样的(没有columns):
df = pd.read_csv('ex2.csv') print(df)1 2 3 4 hello (相当于是一种错误的解读)
0 5 6 7 8 world
1 9 10 11 12 foo
df = pd.read_csv('ex2.csv', header=None) print(df)0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
df = pd.read_csv('ex2.csv', names=['a', 'b', 'c', 'd', 'message']) print(df)
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
可以强制指定第四列为index:
df = pd.read_csv('ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message') print(df) print(df.index.name)a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
message
对于这样的csv:
df = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2']) print(df)value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
对于文本文件:
df = list(open('ex3.txt')) print(df)
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
利用read_table()和正则表达式进行解读:
正则表达式讲解:
https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/00143193331387014ccd1040c814dee8b2164bb4f064cff000
df = pd.read_table('ex3.txt', sep=r'\s+') #至少一个空格 print(df) print(df.columns) print(df.index)A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
Index(['A', 'B', 'C'], dtype='object')
Index(['aaa', 'bbb', 'ccc', 'ddd'], dtype='object')
对于这样的csv:
我们不需要第一,三,四行,所以可以选择跳过:
df = pd.read_csv('ex4.csv', skiprows=[0, 2, 3]) #跳过1,3,4行 print(df)a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
对于有缺失值的csv:
df = pd.read_csv('ex5.csv') print(df)something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
2.逐块读取文本文件
对于数据量较大的文件:
只选取前几行:
df = pd.read_csv('ex6.csv', nrows=5) #读取前5行 print(df)one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
逐块读取:
from pandas import DataFrame, Series import pandas as pd chunkers = pd.read_csv('ex6.csv', chunksize=1000) tot = Series([]) for piece in chunkers: #chunker相当于一个可迭代对象,里面每个piece有1000组数据(最后一组小于1000) tot = tot.add(piece['key'].value_counts(), fill_value=0) #对key的值进行计数并排序 tot = tot.sort_values(ascending=False) print(tot[:10])E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
3.将数据写出到文本格式
df = pd.read_csv('ex5.csv') print(df)something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
df.to_csv('out.csv')
执行后生成一个CSV文件:
df.to_csv('out.csv', na_rep='NULL') #用NULL代替空缺值
当然也可以不写如行和列的标签:
df.to_csv('out.csv', index=False, header=False)
将Series保存为csv文件:
dates = pd.date_range('1/1/2000', periods=7) ts = Series(np.arange(7), index=dates) print(ts) ts.to_csv('tseries.csv')
读取的时候如果想要读成Series需要做一些工作:无header行,第一列作索引
但是还有一个更为简便的方法,from_csv
obj = Series.from_csv('tseries.csv', parse_dates=True) print(obj)2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6