Pandas data structure - DataFrame

DataFrame is a tabular data structure that contains a set of ordered columns, each column can be a different value type (numeric, string, Boolean). DataFrame has both row index and column index, which can be regarded as a dictionary composed of Series (commonly share an index).

The DataFrame construction method is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter Description:

data: A set of data (ndarray, series, map, lists, dict, etc.).
index: Index value, or can be called row label.
columns: column labels, the default is RangeIndex (0, 1, 2, …, n) .
dtype: data type.
copy: copy data, default is False.
Pandas DataFrame is a two-dimensional array structure, similar to a two-dimensional array.

Create using list

# 实例 1 
import pandas as pd
data = [['Google',10],['Facebook',12],['Wiki',13]]
df = pd.DataFrame(data,columns=['Site','Age'],dtype=float)
print(df)

In this example, we use a list data containing three lists, each inner list containing two elements. We then use the pd.DataFrame() function to convert this list into a DataFrame object, specifying the column names of 'Site' and 'Age' and the data type as floating point. Finally, we print the DataFrame object.

The following example is created using ndarrays. The lengths of ndarrays must be the same. If index is passed, the length of the index should be equal to the length of the array. If no index is passed, by default the index will be range(n), where n is the array length.

ndarrays can refer to: NumPy Ndarray object

Created using ndarrays

# 实例 2
import pandas as pd
data = {
    
    'Site':['Google', 'Facebook', 'Wiki'], 'Age':[10, 12, 13]}
df = pd.DataFrame(data)
print (df)

In this example, we use a dictionary data where the keys are column names and the values are the corresponding ndarrays. We then convert this dictionary into a DataFrame object using the pd.DataFrame() function. Since we didn't specify an index, by default the index will be range(n), where n is the length of the array. Finally, we print the DataFrame object. From the output results, we can know that the DataFrame data type is a table, including rows and columns.

You can also use a dictionary (key/value), where the key of the dictionary is the column name:

Create using dictionary

# 实例 3 
import pandas as pd
data = [{
    
    'a': 1, 'b': 2},{
    
    'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)

If there is no corresponding partial data, it is NaN.

In this example, we use a list data, which contains two dictionaries. We then convert this list into a DataFrame object using the pd.DataFrame() function. Since we didn't specify an index, by default the index will be range(n), where n is the number of dictionaries. Since we only provided one row of data, the DataFrame object is empty. Finally, we print the DataFrame object.

Pandas' loc method is used to select rows of a DataFrame. In the case where no index is set, the index starts from 0 by default, so the index of the first row is 0, the index of the second row is 1, and so on.

# 实例 4
import pandas as pd
data = {
    
    
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
# 数据载入到 DataFrame 对象
df = pd.DataFrame(data)
# 返回第一行
print(df.loc[0])
# 返回第二行
print(df.loc[1])

Note: The returned result is actually a Pandas Series data.

In the returned result, you can see that the data of each row is encapsulated in a Series, and the column name is used as the index of the Series. The dtype part of the output indicates the data type of the column.

Another thing to note is that if the DataFrame has an index set, rows can be selected by index, not just in order. For example, if you have a DataFrame whose indices are ['a', 'b', 'c'], then you can select the row with index 'a' via df.loc['a'].

Return multiple rows

Pandas' loc method can indeed be used to select multiple rows. You can select multiple rows by separating the indexes of the rows you want to select with commas (use the [[ ... ]] format, where ... is the index of each row, separated by commas). In the example, df.loc[[0, 1]] will select the two rows with index 0 and 1.

# 实例 5
import pandas as pd
data = {
    
    
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
# 数据载入到 DataFrame 对象
df = pd.DataFrame(data)
# 返回第一行和第二行
print(df.loc[[0, 1]])

The returned result is actually a Pandas DataFrame data.

In this example, you can see that the selected data has been arranged according to the original format, that is, "row" is the original row and "column" is the original column.

Then we can specify the index value

# 实例 6
import pandas as pd
data = {
    
    
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

Pandas's loc method can be used to select rows based on index.

# 实例 6
import pandas as pd
data = {
    
    
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
# 指定索引
print(df.loc["day2"])

In this example, df.loc["day2"] will select the row with index "day2". You can see that the selected data has been arranged according to the original format, that is, "row" is the original row, and "column" is the original column. It's just that since you selected by index instead of row number this time, the row label in the result is "day2".

postscript

What you are learning today is Python Pandas DataFrame. Have you learned it? A summary of today’s learning content:

Pandas data structure - DataFrame
Create using list
Created using ndarrays
Create using dictionary
Return multiple rows

Python Study Notes Day 54 (Pandas DataFrame)

Python study notes day 54