Since I switched from R to Python, I am used to using R to process and transform data, but now I am a little uncomfortable with the transition to Python. But I still have to sigh pandas
about the power. Several packages in R on top of a library are used together.
Because the usual use is more complicated, and it is all used and checked, so pandas
the use of the library is also relatively fragmented. Here, a piece of data is recorded and analyzed, and more pandas
operations are used.
Official website: http://pandas.pydata.org/
Cheat Sheet
First paste the necessary cheatsheet for new handwriting code:
data read
The data reading part is very different from R, and I personally feel it is very convenient and easy to use. csv
The function used to read : read_csv
, its reading speed is much faster than the function that comes with R read.csv
, and somedata.table
in the package (although it may still be slightly faster), but pandas also has this function that can be read directly For files such as , this is much faster than R needs to call some packages to read.fread
fread
read_excel
xls
xlsx
The specific operations are as follows. First, we need pandas
to import the library first. Usually, the operations areas pd
import pandas as pd
Then we can csv
read the file!
1、read_csv
df = pd.read_csv('myfile.csv')
This is the easiest way to read, but when encountering many strange file formats, it is necessary to modify the parameters.
1)encoding
The encoding problem is the most likely problem in converting files between MacOS, Lunix, and Windows systems. Usually, the encoding we use is: utf-8
, this encoding is usually no problem in MacOS and Lunix systems, but you must pay attention to reading files in Windows! , usually need to use encoding = mbcs
to read the file normally, many Windows are ANSI encoding.
Regarding encoding issues, this URL has detailed instructions: https://docs.python.org/3/library/codecs.html#standard-encodings
In addition, here is a little trick about transcoding (of course, you can use Python, use encoding = mbcs
to read, use encoding = utf-8
to store). Here we open our file ( csv
, txt
etc.) with Notepad , then click Save As , there is an encoding below: , use UTF-8
Or ANSI
, in short, the encoding of your target.
2)sep
By default, our csv
files are separated by commas. If the original data is not in csv format and is separated by other characters (such as spaces), then I can use sep =
other separators to modify it.
3)header
Another commonly used parameter is header =
that the function defaults to use the first line of the file as the column name of each column. If we need to cancel this operation, that is, we do not make the first line of the file a column name, we need to use it header = None
. Another little trick is that each column of our file has multiple headers. We can set a list, such as: header = [0, 2]
, then it means that these lines in the file are used as column headers, and the lines in the middle will be ignored (in this example The 1st and 3rd rows of the data will appear as multi-level headers, the 2nd row of data will be discarded, and the real data will start from the 4th row).
4)names
Then if we need to define the column names ourselves, we need to use similar names = ['身高', '体重']
operations, and note that we need to useheader=None
5)index_col
Sometimes, we also want the content of the first column in the file not to be used as an index, but to reuse 0~(n-1) as a new index, which can be used index_col = False
. But you need to pay attention when using it, doing so will top off the last column of data! Makes the reading come in directly without a column of data. The solution is: we still use read_csv
normal reading, and then df
use df = df.reset_index()
0~(n-1) as our index, that is, the row name, for what is read in.
read_csv
Regarding the parameters in , the ones you use are introduced here, and there are more parameters, you can refer to the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
2、read_excel
df = pd.read_excel('myfile.xlsx', 'Sheet1')
read_excel
The parameters are read_csv
basically ones, that is to use the second parameter to specify the sheet to be read (usually the default is 'Sheet1'
, but it is recommended to look at the name of the first sheet before reading).