pandas file reading and storage
Most of our data exists in files, so pandas will support complex IO operations. Pandas API supports many file formats, such as CSV, SQL, XLS, JSON, HDF5.
Note: The most commonly used HDF5 and CSV files
1 CSV
1.1 read_csv
-
pandas.read_csv(filepath_or_buffer, sep =',', usecols )
- filepath_or_buffer: file path
- sep: Separator, separated by "," by default
- usecols: specify the name of the column to be read, in list form
-
Example: Read previous stock data
# 读取文件,并且指定只获取'open', 'close'指标
data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close'])
open close
2018-02-27 23.53 24.16
2018-02-26 22.80 23.53
2018-02-23 22.88 22.82
2018-02-22 22.25 22.28
2018-02-14 21.49 21.92
1.2 to_csv
-
DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode='w', encoding=None)
- path_or_buf: file path
- sep: Separator, separated by "," by default
- columns: select the desired column index
- header :boolean or list of string, default True, whether to write the column index value
- index: Whether to write for index
- mode:'w': rewrite,'a' append
-
Example: save the stock data read out
- Save the data in the'open' column, and then read and view the results
# 选取10行数据保存,便于观察数据
data[:10].to_csv("./data/test.csv", columns=['open'])
# 读取,查看结果
pd.read_csv("./data/test.csv")
Unnamed: 0 open
0 2018-02-27 23.53
1 2018-02-26 22.80
2 2018-02-23 22.88
3 2018-02-22 22.25
4 2018-02-14 21.49
5 2018-02-13 21.40
6 2018-02-12 20.70
7 2018-02-09 21.20
8 2018-02-08 21.79
9 2018-02-07 22.69
You will find that storing the index in the file becomes a single column of data. If you need to delete, you can specify the index parameter, delete the original file, and save it again.
# index:存储不会讲索引值变成一列数据
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)
2 HDF5
2.1 read_hdf given to_hdf
HDF5 file reading and storage need to specify a key, the value is the DataFrame to be stored
-
pandas.read_hdf(path_or_buf,key =None,** kwargs)
Read data from h5 file
- path_or_buffer: file path
- key: read key
- return:Theselected object
-
DataFrame.to_hdf(path_or_buf, key, *\kwargs*)
2.2 Case
- Read file
day_close = pd.read_hdf("./data/day_close.h5")
If the following error occurs when reading
Need to install the tables module to avoid being unable to read HDF5 files
pip install tables
- Store files
day_close.to_hdf("./data/test.h5", key="day_close")
When reading again, you need to specify the name of the key
new_close = pd.read_hdf("./data/test.h5", key="day_close")
Note: HDF5 file storage is preferred
- HDF5 supports compression when storing. The method used is blosc. This is the fastest and is supported by pandas by default.
- Use compression to improve disk utilization and save space
- HDF5 is also cross-platform and can be easily migrated to hadoop.
3 JSON
JSON is a commonly used data exchange format, which is often used in front-end and back-end interactions, and this format is also selected when storing. So we need to know how Pandas reads and stores JSON format.
3.1 read_json
-
pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)
- Change the JSON format to the default Pandas DataFrame format
- orient : string,Indication of expected JSON string format.
- 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
- Split summarizes the index to the index, the column name to the column name, and the data to the data. Separated the three parts
- 'records' : list like [{column -> value}, ... , {column -> value}]
- records
columns:values
are output in the form of
- records
- 'index' : dict like {index -> {column -> value}}
- index
index:{columns:values}...
is output in the form of
- index
- 'columns': dict like {column -> {index -> value}} , the default format
- colums
columns:{index:values}
output in the form of
- colums
- 'values' : just the values array
- values Direct output value
- 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
- lines : boolean, default False
- Read the json object according to each line
- typ: default'frame', specify the object type converted into series or dataframe
3.2 read_josn case
-
Data introduction
A news headline satire data set is used here, the format is json. is_sarcastic
: 1 is ironic, otherwise 0 headline
;: the title of the news report;: article_link
link to the original news article. The storage format is:
{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
- Read
Orient specifies the stored json format, and lines specifies to become a sample according to the line
json_read = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)
3.3 to_json
- DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- Store Pandas objects in json format
- path_or_buf=None : file address
- orient: the stored json form, {'split','records','index','columns','values'}
- lines: an object is stored as a line
3.4 Case
- Store files
json_read.to_json("./data/test.json", orient='records')
result
[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]
- Modify the lines parameter to True
json_read.to_json("./data/test.json", orient='records', lines=True)
result
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...
4 summary
- Reading of pandas CSV, HDF5, and JSON files
- Object.read_**()
- Object.to_**()