Pandas Library Essay - Cheat Sheet and Data Reading

Since I switched from R to Python, I am used to using R to process and transform data, but now I am a little uncomfortable with the transition to Python. But I still have to sigh pandasabout the power. Several packages in R on top of a library are used together.

Because the usual use is more complicated, and it is all used and checked, so pandasthe use of the library is also relatively fragmented. Here, a piece of data is recorded and analyzed, and more pandasoperations are used.

Official website: http://pandas.pydata.org/

Cheat Sheet

First paste the necessary cheatsheet for new handwriting code:
write picture description here

write picture description here

data read

The data reading part is very different from R, and I personally feel it is very convenient and easy to use. csvThe function used to read : read_csv, its reading speed is much faster than the function that comes with R read.csv, and somedata.table in the package (although it may still be slightly faster), but pandas also has this function that can be read directly For files such as , this is much faster than R needs to call some packages to read.freadfreadread_excelxlsxlsx

The specific operations are as follows. First, we need pandasto import the library first. Usually, the operations areas pd

import pandas as pd

Then we can csvread the file!

1、`read_csv`

df = pd.read_csv('myfile.csv')

This is the easiest way to read, but when encountering many strange file formats, it is necessary to modify the parameters.

1）`encoding`

The encoding problem is the most likely problem in converting files between MacOS, Lunix, and Windows systems. Usually, the encoding we use is: utf-8, this encoding is usually no problem in MacOS and Lunix systems, but you must pay attention to reading files in Windows! , usually need to use encoding = mbcsto read the file normally, many Windows are ANSI encoding.

Regarding encoding issues, this URL has detailed instructions: https://docs.python.org/3/library/codecs.html#standard-encodings

In addition, here is a little trick about transcoding (of course, you can use Python, use encoding = mbcsto read, use encoding = utf-8to store). Here we open our file ( csv, txtetc.) with Notepad , then click Save As , there is an encoding below: , use UTF-8Or ANSI, in short, the encoding of your target.

2）`sep`

By default, our csvfiles are separated by commas. If the original data is not in csv format and is separated by other characters (such as spaces), then I can use sep =other separators to modify it.

3）`header`

Another commonly used parameter is header =that the function defaults to use the first line of the file as the column name of each column. If we need to cancel this operation, that is, we do not make the first line of the file a column name, we need to use it header = None. Another little trick is that each column of our file has multiple headers. We can set a list, such as: header = [0, 2], then it means that these lines in the file are used as column headers, and the lines in the middle will be ignored (in this example The 1st and 3rd rows of the data will appear as multi-level headers, the 2nd row of data will be discarded, and the real data will start from the 4th row).

4）`names`

Then if we need to define the column names ourselves, we need to use similar names = ['身高', '体重']operations, and note that we need to useheader=None

5）`index_col`

Sometimes, we also want the content of the first column in the file not to be used as an index, but to reuse 0~(n-1) as a new index, which can be used index_col = False. But you need to pay attention when using it, doing so will top off the last column of data! Makes the reading come in directly without a column of data. The solution is: we still use read_csvnormal reading, and then dfuse df = df.reset_index()0~(n-1) as our index, that is, the row name, for what is read in.

read_csvRegarding the parameters in , the ones you use are introduced here, and there are more parameters, you can refer to the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

2、`read_excel`

df = pd.read_excel('myfile.xlsx', 'Sheet1')

read_excelThe parameters are read_csvbasically ones, that is to use the second parameter to specify the sheet to be read (usually the default is 'Sheet1', but it is recommended to look at the name of the first sheet before reading).