pandas file reading and storage

Most of our data exists in files, so pandas will support complex IO operations. Pandas API supports many file formats, such as CSV, SQL, XLS, JSON, HDF5.

Note: The most commonly used HDF5 and CSV files

1 CSV

1.1 read_csv

pandas.read_csv(filepath_or_buffer, sep =',', usecols )
- filepath_or_buffer: file path
- sep: Separator, separated by "," by default
- usecols: specify the name of the column to be read, in list form
Example: Read previous stock data

# 读取文件,并且指定只获取'open', 'close'指标
data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close'])

            open    close
2018-02-27    23.53    24.16
2018-02-26    22.80    23.53
2018-02-23    22.88    22.82
2018-02-22    22.25    22.28
2018-02-14    21.49    21.92

1.2 to_csv

DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode='w', encoding=None)
- path_or_buf: file path
- sep: Separator, separated by "," by default
- columns: select the desired column index
- header :boolean or list of string, default True, whether to write the column index value
- index: Whether to write for index
- mode:'w': rewrite,'a' append
Example: save the stock data read out
- Save the data in the'open' column, and then read and view the results

# 选取10行数据保存,便于观察数据
data[:10].to_csv("./data/test.csv", columns=['open'])

# 读取，查看结果
pd.read_csv("./data/test.csv")

     Unnamed: 0    open
0    2018-02-27    23.53
1    2018-02-26    22.80
2    2018-02-23    22.88
3    2018-02-22    22.25
4    2018-02-14    21.49
5    2018-02-13    21.40
6    2018-02-12    20.70
7    2018-02-09    21.20
8    2018-02-08    21.79
9    2018-02-07    22.69

You will find that storing the index in the file becomes a single column of data. If you need to delete, you can specify the index parameter, delete the original file, and save it again.

# index:存储不会讲索引值变成一列数据
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)

2 HDF5

2.1 read_hdf given to_hdf

HDF5 file reading and storage need to specify a key, the value is the DataFrame to be stored

pandas.read_hdf(path_or_buf，key =None，** kwargs)

Read data from h5 file
- path_or_buffer: file path
- key: read key
- return:Theselected object
DataFrame.to_hdf(path_or_buf, key, *\kwargs*)

2.2 Case

Read file

day_close = pd.read_hdf("./data/day_close.h5")

If the following error occurs when reading

Need to install the tables module to avoid being unable to read HDF5 files

pip install tables

Store files

day_close.to_hdf("./data/test.h5", key="day_close")

When reading again, you need to specify the name of the key

new_close = pd.read_hdf("./data/test.h5", key="day_close")

Note: HDF5 file storage is preferred

HDF5 supports compression when storing. The method used is blosc. This is the fastest and is supported by pandas by default.
Use compression to improve disk utilization and save space
HDF5 is also cross-platform and can be easily migrated to hadoop.

3 JSON

JSON is a commonly used data exchange format, which is often used in front-end and back-end interactions, and this format is also selected when storing. So we need to know how Pandas reads and stores JSON format.

3.1 read_json

pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)
- Change the JSON format to the default Pandas DataFrame format
- orient : string,Indication of expected JSON string format.
  - 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
    - Split summarizes the index to the index, the column name to the column name, and the data to the data. Separated the three parts
  - 'records' : list like [{column -> value}, ... , {column -> value}]
    - records columns：valuesare output in the form of
  - 'index' : dict like {index -> {column -> value}}
    - index index：{columns：values}...is output in the form of
  - 'columns': dict like {column -> {index -> value}} , the default format
    - colums columns:{index:values}output in the form of
  - 'values' : just the values array
    - values Direct output value
- lines : boolean, default False
  - Read the json object according to each line
- typ: default'frame', specify the object type converted into series or dataframe
3.2 read_josn case
Data introduction

A news headline satire data set is used here, the format is json. is_sarcastic: 1 is ironic, otherwise 0 headline;: the title of the news report;: article_linklink to the original news article. The storage format is:

{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}

Read

Orient specifies the stored json format, and lines specifies to become a sample according to the line

json_read = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)

3.3 to_json

DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- Store Pandas objects in json format
- path_or_buf=None : file address
- orient: the stored json form, {'split','records','index','columns','values'}
- lines: an object is stored as a line

3.4 Case

Store files

json_read.to_json("./data/test.json", orient='records')

result

[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]

Modify the lines parameter to True

json_read.to_json("./data/test.json", orient='records', lines=True)

result

{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...

4 summary

Reading of pandas CSV, HDF5, and JSON files
- Object.read_**()
- Object.to_**()