How to find patterns in data? Teach you 4 Python methods!

3ba28f9414a6891f31b676c6c1f3078a.png

Discovering patterns in data is a very important step in data analysis and data science. Here are some common methods and techniques:

  1. Statistical description: Use basic statistical tools (such as mean, median, standard deviation, percentile, etc.) to describe and summarize data in order to understand the distribution and trends of data.

  2. Data visualization: draw data into charts or graphs, such as histograms, scatterplots, boxplots, etc., in order to more clearly show the distribution and trend of the data. Visualization tools such as Matplotlib in Python, Seaborn or ggplot2 in R can be used.

  3. Grouping and aggregation: Group data according to a variable, and then aggregate each group of data (such as calculating the average, median, maximum, minimum, etc.) to find the correlation and trend between variables.

  4. Machine Learning Algorithms: Use machine learning algorithms (such as linear regression, decision trees, clustering, etc.) to model and predict data to gain a deeper understanding of the patterns and trends of the data.

The comprehensive use of the above methods can provide a more comprehensive understanding of the laws of the data for better data analysis and decision-making.

The following uses Python to introduce the analysis methods one by one.

1. Prepare

Before starting, you need to make sure that Python and pip have been successfully installed on your computer. If not, you can visit this article: Super detailed Python installation guide  for installation.

(Optional 1)  If you use Python for data analysis, you can install Anaconda directly: Anaconda, a good helper for Python data analysis and mining , has built-in Python and pip.

(Optional 2)  In addition, it is recommended that you use the VSCode editor, which has many advantages: The best partner for Python programming—VSCode detailed guide .

Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd (Start-Run-CMD) in the Windows environment.
2. Open Terminal in the MacOS environment (command+space to enter Terminal).
3. If you are using VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.

pip install pandas
pip install numpy
pip install scipy
pip install seaborn
pip install matplotlib

# 机器学习部分
pip install scikit-learn

2. Statistical description to discover the law

Statistical description using Python can use some built-in libraries, such as Numpy and Pandas.

Here are some basic statistical description functions:

  1. mean: Calculates the mean of a set of data.

import numpy as np

data = [1, 2, 3, 4, 5]
mean = np.mean(data)
print(mean)

The output is: 3.0

  1. Median: Calculates the median of a set of data.

import numpy as np

data = [1, 2, 3, 4, 5]
median = np.median(data)
print(median)

The output is: 3.0

  1. Mode: Calculate the mode of a set of data.

import scipy.stats as stats

data = [1, 2, 2, 3, 4, 4, 4, 5]
mode = stats.mode(data)
print(mode)

The output result is: ModeResult(mode=array([4]), count=array([3]))

  1. Variance: Calculates the variance of a set of data.

import numpy as np

data = [1, 2, 3, 4, 5]
variance = np.var(data)
print(variance)

The output is: 2.0

  1. Standard deviation: Calculate the standard deviation of a set of data.

import numpy as np

data = [1, 2, 3, 4, 5]
std_dev = np.std(data)
print(std_dev)

The output is: 1.4142135623730951

The above are some basic statistical description functions, and there are other functions that can be used. For specific usage methods, please refer to the corresponding documents.

3. Data visualization analysis rules

Python has many libraries that can be used for data visualization, the most commonly used are Matplotlib and Seaborn. Here are some basic data visualization methods:

  1. Line plot (line plot): Can be used to show the trend over time or a variable.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y)
plt.show()
  1. Scatter plot (scatter plot): Can be used to show the relationship between two variables.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.scatter(x, y)
plt.show()
  1. Histogram: It can be used to display the distribution of numerical data.

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 4, 4, 4, 5]

plt.hist(data, bins=5)
plt.show()
  1. Box plot (box plot): It can be used to display information such as the median, quartiles, and outliers of numerical data.

import seaborn as sns

data = [1, 2, 2, 3, 4, 4, 4, 5]

sns.boxplot(data)
plt.show()
  1. Bar chart: Can be used to show differences or comparisons between categorical variables.

import matplotlib.pyplot as plt

categories = ['A', 'B', 'C', 'D']
values = [10, 20, 30, 40]

plt.bar(categories, values)
plt.show()

The above are some basic data visualization methods. Both Matplotlib and Seaborn provide richer functions that can be used to create more complex charts and graphics.

4. Grouping and aggregation analysis to find rules

In Python, the pandas library can be used to easily group and aggregate data to discover the regularity of the data. Here is a basic grouping and aggregation example:

Suppose we have a dataset with sales dates, sales amounts, and salesperson names, and we want to know the total sales for each salesperson. We can group by salesperson name and apply aggregate functions like sum, average, etc. to each group. Here is a sample code:

import pandas as pd

# 创建数据集
data = {'sales_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09', '2022-01-10'],
        'sales_amount': [100, 200, 150, 300, 250, 400, 350, 450, 500, 600],
        'sales_person': ['John', 'Jane', 'John', 'Jane', 'John', 'Jane', 'John', 'Jane', 'John', 'Jane']}

df = pd.DataFrame(data)

# 按销售员名称分组,并对每个组的销售金额求和
grouped = df.groupby('sales_person')['sales_amount'].sum()

print(grouped)

The output is:

sales_person
Jane 2200
John 1800
Name: sales_amount, dtype: int64

As you can see, we successfully grouped by salesperson name and summed the sales amount for each group. In this way, we can find the total sales of each salesperson, so as to understand the law of the data.

5. Machine learning algorithm analysis and discovery rules

You can use the scikit-learn library to implement machine learning algorithms and discover patterns in data. Here is a basic example showing how to use the decision tree algorithm to classify data and discover patterns in the data:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 创建数据集
data = {'age': [22, 25, 47, 52, 21, 62, 41, 36, 28, 44],
        'income': [21000, 22000, 52000, 73000, 18000, 87000, 45000, 33000, 28000, 84000],
        'gender': ['M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M'],
        'bought': ['N', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y']}

df = pd.DataFrame(data)

# 将文本数据转换成数值数据
df['gender'] = df['gender'].map({'M': 0, 'F': 1})
df['bought'] = df['bought'].map({'N': 0, 'Y': 1})

# 将数据集分成训练集和测试集
X = df[['age', 'income', 'gender']]
y = df['bought']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 创建决策树模型
model = DecisionTreeClassifier()

# 训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算模型的准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))

The output is:

Accuracy: 50.00%

As you can see, we used the decision tree algorithm to classify the data and calculated the accuracy of the model on the test set. In this way, we can discover the law of the data, such as which factors will affect the purchase decision and so on. It should be noted that this is just a simple example. In practical applications, appropriate machine learning algorithms and feature engineering methods need to be selected according to specific problems.

This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention to Python Practical Collection.

If you have any questions, you can reply in the background of the official account: join the group , answer the corresponding red letter verification information , and enter the mutual assistance group to ask.

Originality is not easy, I hope you can give me a thumbs up below and watch to support me to continue creating, thank you!

Click below to read the original text for a better reading experience

Python Practical Collection (pythondict.com)
is not just a collection.
Welcome to pay attention to the official account: Python Practical Collection

ddd936a844c45a69b30fd040964adcd0.jpeg

Guess you like

Origin blog.csdn.net/u010751000/article/details/129576855