[Proficient in Python in 100 days] Day54: Python data analysis_Pandas entry basics, core data structure Serise, DataFrame, Index objects, data import and export operations

Table of contents

1. Introduction to Pandas

1.1 What is Pandas?

1.2 Why use Pandas?

1.3 Install and import Pandas library

2. The core data structure of Pandas

2.1 Series: One-dimensional label array

2.1.1 Create Series

2.1.2 Custom Index

2.2 DataFrame: two-dimensional data table

2.2.1 Creating DataFrames

2.2.2 Import DataFrame from CSV file

2.3 Index object: container for row and column labels

2.3.1 Creating an Index object

2.3.2 Row index and column index

2.3.3 Indexing and slicing using the Index object

2.3.4 Properties and methods of the Index object

3. Data import and export

3.1 Import data from CSV file

3.2 Import data from Excel file

3.3 Import data from SQL database

3.4 Saving data to files in different formats


1. Introduction to Pandas

1.1 What is Pandas?

Pandas is a Python library for data manipulation and data analysis. It provides high-performance, easy-to-use data structures and data analysis tools, especially suitable for processing structured data. The two main data structures in Pandas are Seriesand DataFrame.

  • Series : A Series is a one-dimensional labeled array, similar to a list or array in Python, but each element has a label (index). This makes Series very useful for working with time series data and other labeled data.

  • DataFrame : A DataFrame is a two-dimensional tabular data structure, similar to a database table or an Excel spreadsheet. It contains multiple columns, each of which can have a different data type, and has row and column labels.

The Pandas library also provides many data manipulation and analysis tools, including data filtering, sorting, grouping, aggregation, merging and other functions, enabling users to easily process and analyze large-scale data sets.

1.2 Why use Pandas?

Using Pandas has several advantages:

  1. Data Structures : Pandas' data structures are flexible and work with a variety of data types and forms, including time series, tabular data, multidimensional data, and more.

  2. Data cleaning : Pandas provides powerful data cleaning and preprocessing functions, including handling missing values, duplicate values, outliers, etc.

  3. Data analysis : Pandas has a wealth of data analysis tools, which can perform statistical analysis, data perspective, correlation analysis, etc., which help to gain insights into the characteristics and trends of data.

  4. Data visualization : Pandas can be combined with other data visualization libraries such as Matplotlib and Seaborn to easily create various data visualization charts.

  5. Data import and export : Pandas supports importing data from a variety of data sources, including CSV, Excel, SQL databases, etc., and can also export processed data to different formats.

  6. Extensive community support : Pandas has a large user community that provides extensive documentation, tutorials, and support, making it easier to learn and use Pandas.

1.3 Install and import Pandas library

To install the Pandas library, you can use pip, Python's package manager. Run the following command at the command line to install Pandas:

pip install pandas

Once installed, you can import the Pandas library in a Python script or in an interactive environment:

import pandas as pd

Usually, Pandas is imported by convention and renamed as pd, which makes it easier to use Pandas' functions and data structures. After importing Pandas, you can start using Pandas for data processing and analysis.

2. The core data structure of Pandas

2.1 Series: One-dimensional label array

Series is a data structure similar to a one-dimensional array. Unlike Numpy arrays, it has labels (indexes) that can be used to identify and access data. Series consists of two parts: data part and index part.

2.1.1 Create Series

import pandas as pd

# 创建一个Series,包含一些整数数据
data = pd.Series([1, 2, 3, 4, 5])

# 输出Series
print(data)

 Output result:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In the example above, the Series contains a set of integer data and is automatically assigned the default integer indices (0, 1, 2, 3, 4).

2.1.2 Custom Index

import pandas as pd

# 创建一个Series,指定自定义索引
data = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

# 输出Series
print(data)

 Output result:

a    1
b    2
c    3
d    4
e    5
dtype: int64

In this example, we specify custom indexes for Series, and each index corresponds to a data value. 

2.2 DataFrame: two-dimensional data table

DataFrame is the most commonly used data structure in Pandas, which is similar to tabular data in spreadsheets or SQL databases. A DataFrame consists of rows and columns, and each column can contain different data types.

2.2.1 Creating DataFrames

import pandas as pd

# 创建一个简单的DataFrame,包含姓名和年龄列
data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David'],
        '年龄': [25, 30, 35, 40]}

df = pd.DataFrame(data)

# 输出DataFrame
print(df)

Output result:

      姓名  年龄
0   Alice  25
1     Bob  30
2  Charlie  35
3    David  40

 In the above example, we have created a DataFrame with name and age columns. The data type of each column can be different.

2.2.2 Import DataFrame from CSV file

import pandas as pd

# 从CSV文件导入数据创建DataFrame
df = pd.read_csv('data.csv')

# 输出前几行数据
print(df.head())

        This example demonstrates how to import data from a CSV file and create a DataFrame. The data in the CSV file will be converted into a DataFrame. 

2.3 Index object: container for row and column labels

       Index objects are used in Pandas to identify containers for row and column labels. Each DataFrame has a row index (row labels) and a column index (column labels), which are Index objects. Index objects are immutable, which means you cannot change their contents once created.

The following is a detailed explanation and example of the Index object:

2.3.1 Creating an Index object

You can pd.Index()create Index objects using or directly in the DataFrame. Here are some examples:

import pandas as pd

# 使用pd.Index()创建Index对象
index1 = pd.Index(['a', 'b', 'c', 'd'])

# 直接在DataFrame中创建Index对象
data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David']}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
index2 = df.index

print(index1)
print(index2)

output result

Index(['a', 'b', 'c', 'd'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')

2.3.2 Row index and column index

        In DataFrame, Index objects are used to identify rows and columns. The row index is at the top of the DataFrame and the column index is at the left of the DataFrame. Here is an example:

import pandas as pd

data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David']}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

# 行索引
row_index = df.index
print("行索引:", row_index)

# 列索引
column_index = df.columns
print("列索引:", column_index)

Output result:

行索引: Index(['A', 'B', 'C', 'D'], dtype='object')
列索引: Index(['姓名'], dtype='object')

2.3.3 Indexing and slicing using the Index object

You can use the Index object to select specific rows or columns in a DataFrame. Here are some examples:

import pandas as pd

data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David']}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

# 选择特定行
selected_row = df.loc['B']  # 通过行标签选择
print("选择行:\n", selected_row)

# 选择特定列
selected_column = df['姓名']  # 通过列标签选择
print("选择列:\n", selected_column)

# 使用loc进行切片
sliced_df = df.loc['B':'C']  # 使用行标签进行切片
print("切片行:\n", sliced_df)

Output result:

选择行:
 姓名    Bob
Name: B, dtype: object

选择列:
A      Alice
B        Bob
C    Charlie
D      David
Name: 姓名, dtype: object

切片行:
      姓名
B      Bob
C  Charlie

2.3.4 Properties and methods of the Index object

The Index object has some commonly used attributes and methods, such as valuesattributes, tolist()methods, and so on. Here are some examples:

import pandas as pd

data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David']}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
row_index = df.index

# 获取Index对象的值
index_values = row_index.values
print("Index对象的值:", index_values)

# 将Index对象转换为列表
index_list = row_index.tolist()
print("Index对象转换为列表:", index_list)

# 检查索引是否包含特定值
contains_value = 'B' in row_index
print("索引包含'B':", contains_value)

Output result:

Index对象的值: ['A' 'B' 'C' 'D']
Index对象转换为列表: ['A', 'B', 'C', 'D']
索引包含'B': True

        Index objects are widely used in Pandas. Index objects can contain different data types. They help identify and manipulate the rows and columns of DataFrame, making data analysis more convenient. You can gain a better grasp of indexing and labeling data in Pandas by understanding how to create and use Index objects.

        These are the basic concepts and examples of Pandas' core data structures. By using Series, DataFrame, and Index, you can process and analyze various data sets more flexibly.

3. Data import and export

        Pandas provides a wealth of functions to easily import data from different data sources and save the data to files in different formats.

3.1 Import data from CSV file

To import data from a CSV file, you can use pd.read_csv()functions. Suppose there is a data.csvCSV file named , which contains the following data:

姓名,年龄
Alice,25
Bob,30
Charlie,35
David,40

Import data example:

import pandas as pd

# 从CSV文件导入数据
df = pd.read_csv('data.csv')

# 输出DataFrame
print(df)

3.2 Import data from Excel file

        To import data from an Excel file, you can use pd.read_excel()functions. Suppose there is an data.xlsxExcel file named, containing the following data:

姓名    年龄
Alice  25
Bob    30
Charlie 35
David  40

 Import data example:

import pandas as pd

# 从Excel文件导入数据
df = pd.read_excel('data.xlsx')

# 输出DataFrame
print(df)

3.3 Import data from SQL database

        To import data from an SQL database, you can use pd.read_sql()functions. First, you need to install the appropriate database driver (eg pymysql, , sqlite3etc.), and then establish a database connection.

Import data example (using SQLite database):

import pandas as pd
import sqlite3

# 建立SQLite数据库连接
conn = sqlite3.connect('mydatabase.db')

# 从数据库导入数据
query = "SELECT * FROM mytable"
df = pd.read_sql(query, conn)

# 关闭数据库连接
conn.close()

# 输出DataFrame
print(df)

 3.4 Saving data to files in different formats

        To save data from a DataFrame to a file in different formats, you can use different to_functions such as to_csv(), , to_excel(), to_sql()etc., depending on the type of file you want to save.

3.4.1 Example of saving data to a CSV file:

import pandas as pd

# 创建一个DataFrame
data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David'],
        '年龄': [25, 30, 35, 40]}
df = pd.DataFrame(data)

# 将数据保存到CSV文件
df.to_csv('output.csv', index=False)

3.4.2 Example of saving data to an Excel file:

import pandas as pd

# 创建一个DataFrame
data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David'],
        '年龄': [25, 30, 35, 40]}
df = pd.DataFrame(data)

# 将数据保存到Excel文件
df.to_excel('output.xlsx', index=False)

3.4.3 Save data to SQL database example (using SQLite database):

import pandas as pd
import sqlite3

# 创建一个DataFrame
data = {'姓名': ['Alice', 'Bob', 'Charlie', 'David'],
        '年龄': [25, 30, 35, 40]}
df = pd.DataFrame(data)

# 建立SQLite数据库连接
conn = sqlite3.connect('mydatabase.db')

# 将数据保存到数据库中的新表格
df.to_sql('mytable', conn, if_exists='replace', index=False)

# 关闭数据库连接
conn.close()

        In the above example, we first created a DataFrame and then used sqlite3the module to establish mydatabase.dba connection to the SQLite database file. Next, we use to_sql()a function to save the DataFrame's data into mytablea new table named . The parameter if_exists='replace'indicates that if the table already exists, it will be replaced. You can select other options as needed, such as 'append'etc.

        Finally, we close the database connection to ensure the data has been successfully saved to the database.

        You can modify the data, table name, and other related parameters as needed to meet your specific needs.

 

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132700337