Table of contents
1. Import Pandas library
import pandas as pd
2. Read data
Pandas can easily read multiple data formats such as CSV, Excel, JSON, HTML, etc. Here is an example of reading a CSV file:
data = pd.read_csv('data.csv')
The reading methods of other data formats are similar, such as reading Excel files:
data = pd.read_excel('data.xlsx')
3. View data
You can use the head() function to view the first few rows of data (the default is 5 rows):
print(data.head())
You can also use the tail() function to view the last few lines of the data, and the info() and describe() functions to view the statistics of the data:
print(data.tail())
print(data.info())
print(data.describe())
4. Select data
There are many ways to select data, here are some common methods:
- Select a column: data['column_name']
- Select multiple columns: data[['column1', 'column2']]
- Select a row: data.loc[row_index]
- Select a value: data.loc[row_index, 'column_name']
- Select by condition: data[data['column_name'] > value]
5. Data cleaning
Before data analysis, it is usually necessary to clean the data. The following are some commonly used data cleaning methods:
- Remove null values: data.dropna()
- Replace empty values: data.fillna(value)
- Rename column names: data.rename(columns={'old_name': 'new_name'})
- Data type conversion: data['column_name'].astype(new_type)
- Remove duplicate values: data.drop_duplicates()
6. Data analysis
Pandas provides a wealth of data analysis functions, the following are some common methods:
- Calculate the mean: data['column_name'].mean()
- Calculate the median: data['column_name'].median()
- Calculate the mode: data['column_name'].mode()
- Calculate the standard deviation: data['column_name'].std()
- Calculate correlation: data.corr()
- Data grouping: data.groupby('column_name')
7. Data visualization
Pandas makes it easy to convert data into visual charts. First, the Matplotlib library needs to be installed:
pip install matplotlib
Then, use the following code to create the graph:
import matplotlib.pyplot as plt
data['column_name'].plot(kind='bar')
plt.show()
Other visualization types include line charts, pie charts, histograms, and more:
data['column_name'].plot(kind='line')
data['column_name'].plot(kind='pie')
data['column_name'].plot(kind='hist')
plt.show()
8. Export data
Pandas can export data to various formats such as CSV, Excel, JSON, HTML, etc. Here is an example of exporting data as a CSV file:
data.to_csv('output.csv', index=False)
The export method of other data formats is similar, such as exporting to an Excel file:
data.to_excel('output.xlsx', index=False)
9. Practical cases
Let's say we have a copy of sales data (sales_data.csv) that we want to analyze. First, we need to read the data:
import pandas as pd
data = pd.read_csv('sales_data.csv')
Then, we can clean and analyze the data. For example, we can calculate the sales for each product:
data['sales_amount'] = data['quantity'] * data['price']
Next, we can analyze which product has the highest sales:
max_sales = data.groupby('product_name')['sales_amount'].sum().idxmax()
print(f'最高销售额的产品是:{
max_sales}')
Finally, we can export the results as a CSV file:
data.to_csv('sales_analysis.csv', index=False)
The Pandas library is a very powerful data analysis tool in Python, which provides a wealth of data processing, cleaning, analysis and visualization functions. Mastering the use of the Pandas library will greatly improve your work efficiency in the field of data analysis.