[Yu Gong Series] Miscellaneous items of Pandas data analysis in July 2023


foreword

pandas is an open source data processing library in Python, which provides high-performance, easy-to-use data structures and data analysis tools, making data processing and analysis easier and faster.

The main functions of pandas include:

  1. Data input and output: pandas supports many data formats, including CSV, Excel, JSON, SQL, HTML, etc., which is convenient for users to read and write data.

  2. Data cleaning and preprocessing: pandas provides a series of functions to deal with missing values, repeated values, and outliers, which can help users quickly clean and preprocess data.

  3. Data conversion and processing: pandas supports basic mathematics, logic, and string operations, as well as operations such as data aggregation, transformation, and perspective, which facilitate data processing and conversion for users.

  4. Data analysis and visualization: pandas can perform data analysis and statistical analysis, and also supports data visualization, which is convenient for users to analyze and display data.

In general, pandas is a powerful data processing library that can help users efficiently and quickly perform data processing, analysis and visualization.

1. Miscellaneous

1.pandas reads database data

1.1 Read MySql data

import MySQLdb
import pandas as pd
conn = MySQLdb.connect(host = host,port = port,user = username,passwd = password,db = db_name)
df = pd.read_sql('select * from table_name',con=conn)
conn.close()

1.2 Read SqlServer data

import pymssql
conn = pymssql.connect(host=host, port=port ,user=username, password=password, database=database)
df = pd.read_sql("select * from table_name",con=conn)
conn.close()

1.3 Read sqlite data

import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
conn.close()

3. Use Pandas to sql

Install the third-party library pandasql

pip install pandasql

Specific use

import pandas as pd
from pandasql import sqldf
df1 = pd.read_excel("student.xlsx")
df2 = pd.read_excel("sc.xlsx")
df3 = pd.read_excel("course.xlsx")
df4 = pd.read_excel("teacher.xlsx")
pysqldf = lambda q: sqldf(q, globals())
query1 = "select * from df1 limit 5"
query2 = "select * from df2 limit 5"
query3 = "select * from df3"
query4 = "select * from df4"
sqldf(query1)
sqldf(query2)
sqldf(query3)
sqldf(query4)

4. Pandas reads JSON files

4.1 Basic use

1. Read the text

import pandas as pd

df = pd.read_json('sites.json')
   
print(df.to_string())

to_string() is used to return data of type DataFrame, we can also directly process JSON strings.

2. Read the string

import pandas as pd

data =[
    {
    
    
      "id": "A002",
      "name": "Google",
      "url": "www.google.com",
      "likes": 124
    },
    {
    
    
      "id": "A003",
      "name": "淘宝",
      "url": "www.taobao.com",
      "likes": 45
    }
]
df = pd.DataFrame(data)

print(df)

insert image description here

import pandas as pd


# 字典格式的 JSON                                                                                              
s = {
    
    
    "col1":{
    
    "row1":1,"row2":2,"row3":3},
    "col2":{
    
    "row1":"x","row2":"y","row3":"z"}
}

# 读取 JSON 转为 DataFrame                                                                                          
df = pd.DataFrame(s)
print(df)

insert image description here
3. Read remote json

import pandas as pd

URL = 'https://static.runoob.com/download/sites.json'
df = pd.read_json(URL)
print(df)

insert image description here

4.2 Embedded JSON data

import pandas as pd


# 字典格式的 JSON                                                                                              
s = {
    
    
    "school_name": "ABC primary school",
    "class": "Year 1",
    "students": [
    {
    
    
        "id": "A001",
        "name": "Tom",
        "math": 60,
        "physics": 66,
        "chemistry": 61
    },
    {
    
    
        "id": "A002",
        "name": "James",
        "math": 89,
        "physics": 76,
        "chemistry": 51
    },
    {
    
    
        "id": "A003",
        "name": "Jenny",
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }]
}

# 读取 JSON 转为 DataFrame                                                                                          
df = pd.DataFrame(s)
print(df)

insert image description here

import pandas as pd


# 字典格式的 JSON                                                                                              
s = {
    
    
    "school_name": "ABC primary school",
    "class": "Year 1",
    "students": [
    {
    
    
        "id": "A001",
        "name": "Tom",
        "math": 60,
        "physics": 66,
        "chemistry": 61
    },
    {
    
    
        "id": "A002",
        "name": "James",
        "math": 89,
        "physics": 76,
        "chemistry": 51
    },
    {
    
    
        "id": "A003",
        "name": "Jenny",
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }]
}

# 读取 JSON 转为 DataFrame                                                                                          
#df = pd.DataFrame(s)
df = pd.json_normalize(s, record_path =['students'])
print(df)

insert image description here

import pandas as pd


# 字典格式的 JSON                                                                                              
s = {
    
    
    "school_name": "local primary school",
    "class": "Year 1",
    "info": {
    
    
      "president": "John Kasich",
      "address": "ABC road, London, UK",
      "contacts": {
    
    
        "email": "[email protected]",
        "tel": "123456789"
      }
    },
    "students": [
    {
    
    
        "id": "A001",
        "name": "Tom",
        "math": 60,
        "physics": 66,
        "chemistry": 61
    },
    {
    
    
        "id": "A002",
        "name": "James",
        "math": 89,
        "physics": 76,
        "chemistry": 51
    },
    {
    
    
        "id": "A003",
        "name": "Jenny",
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }]
}

# 读取 JSON 转为 DataFrame                                                                                          
#df = pd.DataFrame(s)
df = pd.json_normalize(
    s,
    record_path =['students'],
    meta=[
        'class',
        ['info', 'president'],
        ['info', 'contacts', 'tel']
    ]
)
print(df)

insert image description here

4.3 Read a set of data in embedded data

Here we need to use the glom module to handle data nesting. The glom module allows us to use . to access the properties of embedded objects.

pip3 install glom
import pandas as pd
from glom import glom

# 字典格式的 JSON                                                                                              
s = {
    
    
    "school_name": "local primary school",
    "class": "Year 1",
    "students": [
    {
    
    
        "id": "A001",
        "name": "Tom",
        "grade": {
    
    
            "math": 60,
            "physics": 66,
            "chemistry": 61
        }
 
    },
    {
    
    
        "id": "A002",
        "name": "James",
        "grade": {
    
    
            "math": 89,
            "physics": 76,
            "chemistry": 51
        }
       
    },
    {
    
    
        "id": "A003",
        "name": "Jenny",
        "grade": {
    
    
            "math": 79,
            "physics": 90,
            "chemistry": 78
        }
    }]
}

# 读取 JSON 转为 DataFrame                                                                                          
df = pd.DataFrame(s)
data = df['students'].apply(lambda row: glom(row, 'grade.math'))
print(data)
print(df)

insert image description here

5. Pandas data cleaning

Introduce a few more important points

1. Specify an empty data type

import pandas as pd

missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values = missing_values)

2. Date formatting

import pandas as pd

# 第三个日期格式错误
data = {
    
    
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

insert image description here

Second, the case

import pandas as pd

# 读取 JSON 数据
df = pd.read_json('data.json')

# 删除缺失值
df = df.dropna()

# 用指定的值填充缺失值
df = df.fillna({
    
    'age': 0, 'score': 0})

# 重命名列名
df = df.rename(columns={
    
    'name': '姓名', 'age': '年龄', 'gender': '性别', 'score': '成绩'})

# 按成绩排序
df = df.sort_values(by='成绩', ascending=False)

# 按性别分组并计算平均年龄和成绩
grouped = df.groupby('性别').agg({
    
    '年龄': 'mean', '成绩': 'mean'})

# 选择成绩大于等于90的行,并只保留姓名和成绩两列
df = df.loc[df['成绩'] >= 90, ['姓名', '成绩']]

# 计算每列的基本统计信息
stats = df.describe()

# 计算每列的平均值
mean = df.mean()

# 计算每列的中位数
median = df.median()

# 计算每列的众数
mode = df.mode()

# 计算每列非缺失值的数量
count = df.count()

Guess you like

Origin blog.csdn.net/aa2528877987/article/details/131566011