There are many ways for Python to process data files. The types of files that can be operated include text files (csv, txt, json, etc.), excel files, database files, api and other data files.
Here are some ways that python can read and write data files.
1. read、readline、readlines
-
read (): Read the entire file content at once. It is recommended to use read (size) method, the larger the size, the longer the running time
-
readline (): read one line at a time. Used when there is insufficient memory, generally not used
-
readlines (): read the entire file content at once, and return to the list by line to facilitate our traversal
2. Built-in module csv
Python has a built-in csv module for reading and writing csv files. Csv is a comma-delimited file and is one of the most common data storage formats in data science. The csv module can easily complete the reading and writing operations of various volume data. Of course, the large amount of data requires optimization at the code level.
-
csv module read file
# 读取csv文件
import csv
with open('test.csv','r') as myFile:
lines=csv.reader(myFile)
for line in lines:
print (line)
-
csv module write file
import csv
with open('test.csv','w+') as myFile:
myWriter=csv.writer(myFile)
# writerrow一行一行写入
myWriter.writerow([7,8,9])
myWriter.writerow([8,'h','f'])
# writerow多行写入
myList=[[1,2,3],[4,5,6]]
myWriter.writerows(myList)
3. numpy library
-
loadtxt method
loadtxt is used to read text files (including txt, csv, etc.) and compressed files in .gz or .bz2 format, provided that each line of file data must have the same number of values.
import numpy as np
# loadtxt()中的dtype参数默认设置为float
# 这里设置为str字符串便于显示
np.loadtxt('test.csv',dtype=str)
# out:array(['1,2,3', '4,5,6', '7,8,9'], dtype='<U5')
-
load method
numpy dedicated for reading load .npy
, .npz
or the pickled
persistent file.
import numpy as np
# 先生成npy文件
np.save('test.npy', np.array([[1, 2, 3], [4, 5, 6]]))
# 使用load加载npy文件
np.load('test.npy')
'''
out:array([[1, 2, 3],
[4, 5, 6]])
'''
-
fromfile method
The fromfile method can read simple text data or binary data, and the data comes from the binary data saved by the tofile method. When reading data, the user needs to specify the element type and modify the shape of the array appropriately.
import numpy as np
x = np.arange(9).reshape(3,3)
x.tofile('test.bin')
np.fromfile('test.bin',dtype=np.int)
# out:array([0, 1, 2, 3, 4, 5, 6, 7, 8])
4. The pandas library
Pandas is one of the most commonly used analysis libraries for data processing. It can read data files in various formats and generally output dataframe formats. Such as: txt, csv, excel, json, clipboard, database, html, hdf, parquet, pickled files, sas, stata, etc.
-
The read_csv method The read_csv method is used to read the csv format file and output the dataframe format.
import pandas as pd
pd.read_csv('test.csv')
-
read_excel method
Read excel files, including xlsx, xls, xlsm format
import pandas as pd
pd.read_excel('test.xlsx')
-
read_table method
Read any text file by controlling the sep parameter (separator)
-
read_json method
Read json format file
df = pd.DataFrame([['a', 'b'], ['c', 'd']],index=['row 1', 'row 2'],columns=['col 1', 'col 2'])
j = df.to_json(orient='split')
pd.read_json(j,orient='split')
-
read_html method
Read html table
-
read_clipboard method
Read clipboard content
-
read_pickle method
Read plckled persistent files
-
read_sql method
Read the database data, after connecting to the database, you can pass in the sql statement
-
read_dhf method
Read hdf5 files, suitable for reading large files
-
read_parquet method
Read parquet file
-
read_sas method
Read sas file
-
read_stata method
Read stata file
-
read_gbq method
Read google bigquery data
Pandas learning website: https://pandas.pydata.org/
5. Read and write excel files
There are many Python libraries for reading and writing excel files. In addition to the aforementioned pandas, there are xlrd, xlwt, openpyxl, xlwings and so on.
Main modules:
-
xlrd library
Read data from excel, support xls, xlsx
-
xlwt library
Modify excel, do not support the modification of xlsx format
-
xlutils library
In xlw and xlrd, modify an existing file
-
openpyxl
Mainly read and edit excel in xlsx format
-
xlwings
Read, write, and modify formats such as xlsx, xls, and xlsm format files
-
xlsxwriter
Used to generate excel tables, insert data, insert icons and other table operations, does not support reading
-
Microsoft Excel API
Need to install pywin32, communicate directly with the Excel process, can do anything that can be done in Excel, but it is slow
6. Operate the database
Python almost supports the interaction of all databases. After connecting to the database, you can use the SQL statement to add, delete, modify and check.
Main modules:
-
pymysql
Used to interact with the mysql database
-
sqlalchemy
Used to interact with the mysql database
-
cx_Oracle
Used to interact with oracle database
-
sqlite3
Built-in library for interaction with sqlite database
-
pymssql
Used to interact with the sql server database
-
pymongo
Used to interact with mongodb non-relational database
-
repeat 、 pyredis
Used to interact with redis non-relational database