Data Analysis in Action: Using Pandas for Python Data Processing

1. Import Pandas library

import pandas as pd

2. Read data

Pandas can easily read multiple data formats such as CSV, Excel, JSON, HTML, etc. Here is an example of reading a CSV file:

data = pd.read_csv('data.csv')

The reading methods of other data formats are similar, such as reading Excel files:

data = pd.read_excel('data.xlsx')

3. View data

You can use the head() function to view the first few rows of data (the default is 5 rows):

print(data.head())

You can also use the tail() function to view the last few lines of the data, and the info() and describe() functions to view the statistics of the data:

print(data.tail())
print(data.info())
print(data.describe())

4. Select data

There are many ways to select data, here are some common methods:

  • Select a column: data['column_name']
  • Select multiple columns: data[['column1', 'column2']]
  • Select a row: data.loc[row_index]
  • Select a value: data.loc[row_index, 'column_name']
  • Select by condition: data[data['column_name'] > value]

5. Data cleaning

Before data analysis, it is usually necessary to clean the data. The following are some commonly used data cleaning methods:

  • Remove null values: data.dropna()
  • Replace empty values: data.fillna(value)
  • Rename column names: data.rename(columns={'old_name': 'new_name'})
  • Data type conversion: data['column_name'].astype(new_type)
  • Remove duplicate values: data.drop_duplicates()

6. Data analysis

Pandas provides a wealth of data analysis functions, the following are some common methods:

  • Calculate the mean: data['column_name'].mean()
  • Calculate the median: data['column_name'].median()
  • Calculate the mode: data['column_name'].mode()
  • Calculate the standard deviation: data['column_name'].std()
  • Calculate correlation: data.corr()
  • Data grouping: data.groupby('column_name')

7. Data visualization

Pandas makes it easy to convert data into visual charts. First, the Matplotlib library needs to be installed:

pip install matplotlib

Then, use the following code to create the graph:

import matplotlib.pyplot as plt

data['column_name'].plot(kind='bar')
plt.show()

Other visualization types include line charts, pie charts, histograms, and more:

data['column_name'].plot(kind='line')
data['column_name'].plot(kind='pie')
data['column_name'].plot(kind='hist')
plt.show()

8. Export data

Pandas can export data to various formats such as CSV, Excel, JSON, HTML, etc. Here is an example of exporting data as a CSV file:

data.to_csv('output.csv', index=False)

The export method of other data formats is similar, such as exporting to an Excel file:

data.to_excel('output.xlsx', index=False)
9. Practical cases

Let's say we have a copy of sales data (sales_data.csv) that we want to analyze. First, we need to read the data:

import pandas as pd

data = pd.read_csv('sales_data.csv')

Then, we can clean and analyze the data. For example, we can calculate the sales for each product:

data['sales_amount'] = data['quantity'] * data['price']

Next, we can analyze which product has the highest sales:

max_sales = data.groupby('product_name')['sales_amount'].sum().idxmax()
print(f'最高销售额的产品是:{
      
      max_sales}')

Finally, we can export the results as a CSV file:

data.to_csv('sales_analysis.csv', index=False)

The Pandas library is a very powerful data analysis tool in Python, which provides a wealth of data processing, cleaning, analysis and visualization functions. Mastering the use of the Pandas library will greatly improve your work efficiency in the field of data analysis.

Guess you like

Origin blog.csdn.net/weixin_45841831/article/details/130442014