pandas (data frame part 01)

As we are engaged in data-related work, we usually come into contact with a well-planned data table, which we call a data frame here. In Python, a data frame can be constructed through the DataFrame function of the pandas module, while the R language uses data.frame to create a data frame. Next, we will compare the application of Python and R language in the following aspects:

1. The construction of the data frame

In Python , data frames can be constructed manually with the help of lists, tuples, and dictionaries. We illustrate with examples:

Create a dataframe from a list


It is found that when the data frame is created in this way, there is no variable name. How to add the column name when creating it?


Yes, you can use the columns parameter in the DataFrame function to add names to each column of the data frame, and if you need to add index names to the rows, you can use the index parameter.


Create a dataframe from a dictionary


It is found that the order of column names in the output result is inconsistent with the data at the time of construction, because the dictionary is not a sequence, but an object with a special key-value pair relationship. If you need to sort according to the specified column order, it can still be achieved through the columns parameter.



In the R language , the method of constructing a data frame is relatively simple. You only need to pass a vector object to the data.frame function.



2. Data reading

In more scenarios, we read external data, and then conduct data analysis, visualization, and data mining based on external data. Here I will introduce you to the reading of text files, spreadsheets and MySQL databases.

reading text files

There are two functions read_table and read_csv in the pandas module to read common text files. Here we take txt and csv files as examples to compare the reading of Python and R languages.


Both read_table and read_csv can read text file data. The difference is that the default sep parameters are inconsistent. By default, read_table uses the tab key as the separator between fields, while read_csv uses comma as the separator between fields by default.


Since the original data file books.txt has no field name, set header=None, and use the names parameter to add names to the table fields, and usecols is to set which columns of the original data are read. Let's take a look at using the read_table function to read a csv file.



In the R language , there are also two commonly used functions read.table and read.csv to read txt and csv files. You might as well use the read.csv function to read the above co2.csv data set:



Read the spreadsheet

The read_excel function in the pandas module can easily read external xls and xlsx spreadsheets:


In the R language , the basic package cannot read spreadsheet data. It is strongly recommended that R users use the readxl package to read Excel files. But it should be noted that the data path must not contain Chinese, not even the file name.



Reading MySQL database data

Using Python to read MySQL database also needs to be used in conjunction with the pymysql module . Here we create a data set in the local MySQL, and use Python and R to read the database data.

Create data in MySQL



Use Python to create a connection with MySQL and read data;



Use R to create a connection with MySQL and read data (requires downloading the RMySQL package);



3. Overview of the data

When external data is read into Python or R language, it is often necessary to have some general understanding of the data, such as the minimum value, maximum value, average value, data types of each variable, and the amount of data. Let's see how these problems are solved:


shape属性和columns属性返回数据集的行列数及变量名;



describe属性可以对数值型变量(include=['number'])和离散型变量(include=['object'])进行描述性

info属性则对数据集的变量类型进行简单的描述。


在R语言中,上面关于数据的概览信息,可以对应到如下的代码:


dim函数和names函数



summary函数进行统计描述;



str函数对数据集的变量类型进行描述。


今天我们的内容就介绍到这边,欢迎大家拍砖。下期我们来聊聊pandas模块的数据框DataFrame第二部分。主要涉及变量、观测的筛选;变量的重命名;数据类型的变换;排序和数据集的去重。


往期从零开始学Python系列:

从零开始学Python【4】--pandas(序列部分)

从零开始学Python【4】--numpy

从零开始学Python【3】--控制流与自定义函数

从零开始学Python--数值计算及正则表达式

从零开始学Python--数据类型及结构


每天进步一点点2015

学习与分享,取长补短,关注小号!


长按识别二维码 马上关注



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325162099&siteId=291194637