有n组标本(1, 2, 3, 4), 每组由m个( , , ...)元素( , )组成(m值不定), . 各组样本的分布 曲线如下图所示. 通过程序近似实现各曲线与oc, cd直线围成的⾯积.
train of thought
- The image can be divided into several trapezoids. The length of the bottom of each trapezoid is (Xn+1 - Xn-1), and the area is half of the rectangle. Its area = (length of the bottom X height)/2, that is, S = (Xn +1 - Xn-1) * (Yn+1 + Yn+2), for the entire figure, the area is the sum of the areas of all trapezoids.
[picture] - Finding the area of the curve and the x-axis below it is essentially a process of integration. All points can be integrated, and np.tapz(x, y) can be called to find
the code
"""Calculate the area between the coordinates and the X-axis
"""
import typing
from pandas import read_parquet
def calc_area(file_name: str) -> typing.Any:
"""⾯积计算.
Args:
file_name: parquet⽂件路径, eg: data.parquet
Returns:
计算后的结果
"""
res = []
# Load data from .parquet
initial_data = read_parquet(file_name)
# Get number of groups
group_numbers = initial_data["gid"].drop_duplicates().unique()
# Loop through the results for each group
for i in group_numbers:
data = initial_data[initial_data["gid"] == i]
data = data.reset_index(drop=True)
# Extract the list of x\y
x_coordinates = data["x"]
y_coordinates = data["y"]
# Calculate area between (x[i], y[i]) and (x[i+1], y[i+1])
rect_areas = [
(x_coordinates[i + 1] - x_coordinates[i])
* (y_coordinates[i + 1] + y_coordinates[i])
/ 2
for i in range(len(x_coordinates) - 1)
]
# Sum the total area
result = sum(rect_areas)
res.append(result)
# Also we can use np for convenience
# import numpy as np
# result_np = np.trapz(y_coordinates, x_coordinates)
return res
calc_area("./data.parquet")
or use pyspark
"""Calculate the area between the coordinates and the X-axis
"""
import typing
from pyspark.sql import Window
from pyspark.sql.functions import lead, lit
from pyspark.sql import SparkSession
def calc_area(file_name: str) -> typing.Any:
"""⾯积计算.
Args:
file_name: parquet⽂件路径, eg: data.parquet
Returns:
计算后的结果
"""
res = []
# Create a session with spark
spark = SparkSession.builder.appName("Area Calculation").getOrCreate()
# Load data from .parquet
initial_data = spark.read.parquet(file_name, header=True)
# Get number of groups
df_unique = initial_data.dropDuplicates(subset=["gid"]).select("gid")
group_numbers = df_unique.collect()
# Loop through the results for each group
for row in group_numbers:
# Select a set of data
data = initial_data.filter(initial_data["gid"] == row[0])
# Adds a column of delta_x to the data frame representing difference
# from the x value of an adjacent data point
window = Window.orderBy(data["x"])
data = data.withColumn("delta_x", lead("x").over(window) - data["x"])
# Calculated trapezoidal area
data = data.withColumn(
"trap",
(
data["delta_x"]
* (data["y"] + lit(0.5) * (lead("y").over(window) - data["y"]))
),
)
result = data.agg({
"trap": "sum"}).collect()[0][0]
res.append(result)
return res
calc_area("./data.parquet")
Improve computing efficiency
- More efficient algorithms can be used, such as the adaptive Simpson method or other faster integration methods
- It is possible to perform parallel processing on the data, partition the pd DataFrame\spark DataFrame and use distributed computing
- When using spark, you can specify partitions for window operations to improve performance
- The following are general efficiency-improving methods that are not relevant to this example:
- Parallel computing: use multi-core CPU or distributed computing system to decompose tasks into multiple subtasks for parallel processing.
- Data Compression: Compress large data to reduce storage space and bandwidth, and speed up reading and writing.
- Data chunking: Chunking large data can reduce memory requirements and speed up processing.
- Cache optimization: optimize the cache strategy, reduce disk access and reading, and improve computing efficiency.
- Algorithm optimization: Using high-efficiency algorithms, such as tree-based algorithms and matrix algorithms, can improve computational efficiency.