pandas file reading and storage

pandas file reading and storage


Most of our data exists in files, so pandas will support complex IO operations. Pandas API supports many file formats, such as CSV, SQL, XLS, JSON, HDF5.

Note: The most commonly used HDF5 and CSV files

1 CSV

1.1 read_csv

  • pandas.read_csv(filepath_or_buffer, sep =',', usecols )

    • filepath_or_buffer: file path
    • sep: Separator, separated by "," by default
    • usecols: specify the name of the column to be read, in list form
  • Example: Read previous stock data

# 读取文件,并且指定只获取'open', 'close'指标
data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close'])

            open    close
2018-02-27    23.53    24.16
2018-02-26    22.80    23.53
2018-02-23    22.88    22.82
2018-02-22    22.25    22.28
2018-02-14    21.49    21.92

1.2 to_csv

  • DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode='w', encoding=None)

    • path_or_buf: file path
    • sep: Separator, separated by "," by default
    • columns: select the desired column index
    • header :boolean or list of string, default True, whether to write the column index value
    • index: Whether to write for index
    • mode:'w': rewrite,'a' append
  • Example: save the stock data read out

    • Save the data in the'open' column, and then read and view the results
# 选取10行数据保存,便于观察数据
data[:10].to_csv("./data/test.csv", columns=['open'])
# 读取,查看结果
pd.read_csv("./data/test.csv")

     Unnamed: 0    open
0    2018-02-27    23.53
1    2018-02-26    22.80
2    2018-02-23    22.88
3    2018-02-22    22.25
4    2018-02-14    21.49
5    2018-02-13    21.40
6    2018-02-12    20.70
7    2018-02-09    21.20
8    2018-02-08    21.79
9    2018-02-07    22.69

You will find that storing the index in the file becomes a single column of data. If you need to delete, you can specify the index parameter, delete the original file, and save it again.

# index:存储不会讲索引值变成一列数据
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)

2 HDF5

2.1 read_hdf given to_hdf

HDF5 file reading and storage need to specify a key, the value is the DataFrame to be stored

  • pandas.read_hdf(path_or_buf,key =None,** kwargs)

    Read data from h5 file

    • path_or_buffer: file path
    • key: read key
    • return:Theselected object
  • DataFrame.to_hdf(path_or_buf, key*\kwargs*)

2.2 Case

  • Read file
day_close = pd.read_hdf("./data/day_close.h5")

If the following error occurs when reading

Need to install the tables module to avoid being unable to read HDF5 files

pip install tables

  • Store files
day_close.to_hdf("./data/test.h5", key="day_close")

When reading again, you need to specify the name of the key

new_close = pd.read_hdf("./data/test.h5", key="day_close")

Note: HDF5 file storage is preferred

  • HDF5 supports compression when storing. The method used is blosc. This is the fastest and is supported by pandas by default.
  • Use compression to improve disk utilization and save space
  • HDF5 is also cross-platform and can be easily migrated to hadoop.

3 JSON

JSON is a commonly used data exchange format, which is often used in front-end and back-end interactions, and this format is also selected when storing. So we need to know how Pandas reads and stores JSON format.

3.1 read_json

  • pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)

    • Change the JSON format to the default Pandas DataFrame format
    • orient : string,Indication of expected JSON string format.
      • 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
        • Split summarizes the index to the index, the column name to the column name, and the data to the data. Separated the three parts
      • 'records' : list like [{column -> value}, ... , {column -> value}]
        • records columns:valuesare output in the form of
      • 'index' : dict like {index -> {column -> value}}
        • index index:{columns:values}...is output in the form of
      • 'columns': dict like {column -> {index -> value}} , the default format
        • colums columns:{index:values}output in the form of
      • 'values' : just the values array
        • values ​​Direct output value
    • lines : boolean, default False
      • Read the json object according to each line
    • typ: default'frame', specify the object type converted into series or dataframe

    3.2 read_josn case

  • Data introduction

A news headline satire data set is used here, the format is json. is_sarcastic: 1 is ironic, otherwise 0 headline;: the title of the news report;: article_linklink to the original news article. The storage format is:

{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
  • Read

Orient specifies the stored json format, and lines specifies to become a sample according to the line

json_read = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)

3.3 to_json

  • DataFrame.to_json(path_or_buf=Noneorient=Nonelines=False)
    • Store Pandas objects in json format
    • path_or_buf=None : file address
    • orient: the stored json form, {'split','records','index','columns','values'}
    • lines: an object is stored as a line

3.4 Case

  • Store files
json_read.to_json("./data/test.json", orient='records')

result

[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]
  • Modify the lines parameter to True
json_read.to_json("./data/test.json", orient='records', lines=True)

result

{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...

4 summary

  • Reading of pandas CSV, HDF5, and JSON files
    • Object.read_**()
    • Object.to_**()

Guess you like

Origin blog.csdn.net/weixin_44799217/article/details/113954597