Profit prediction is no longer difficult, scikit-learn linear regression method allows you to get twice the result with half the effort

Based on scikit-learn, use linear regression method to predict company profits.

Search and follow "Python Learning and Research Basecamp" on WeChat, join the reader group, and share more exciting things

picture

1 Introduction

Generative AI is undoubtedly a game-changing technology, but for most business problems, traditional machine learning models such as regression and classification are still the first choice.

Imagine how investors like private equity or venture capital can leverage machine learning. To answer questions like this, you must first understand what data investors care about and how it is used. Investment company decisions are based not only on quantifiable data such as expenses, growth, and burn rate, but also on qualitative data such as founder records, customer feedback, and product experience.

picture

This article will cover the basics of linear regression, the complete code can be found here.

[Code]: https://github.com/RoyiHD/linear-regression

2. Project settings

This article will use Jupyter Notebook for this project. First import some libraries.

Import library

# 绘制图表
import matplotlib.pyplot as plt
# 数据管理和处理
from pandas import DataFrame
# 绘制热力图
import seaborn as sns
# 分析
from sklearn.metrics import r2_score
# 用于训练和测试的数据管理
from sklearn.model_selection import train_test_split
# 导入线性模型
from sklearn.linear_model import LinearRegression
# 代码注释
from typing import List

3. Data

To simplify the problem, this article will use regional data. The data represents the company's expense categories and profits. You can see some examples of different data points. This article hopes to use spending data to train a linear regression model and predict profits.

It's important to understand that the data this article will use describes a company's spending. Meaningful predictive power is only achieved when spending data is combined with data on revenue growth, local taxes, amortization and market conditions.

R&D Spend Administration Marketing Spend Profit
165349.2 136897.8 471784.1 192261.83
162597.7 151377.59 443898.53 191792.06
153441.51 101145.55 407934.54 191050.39

Download Data

companies: DataFrame = pd.read_csv("companies.csv", header = 0)

4. Data visualization

Understanding the data is important to determine which features to use, which features need to be normalized and transformed, how to remove outliers from the data, and what to do with specific data points.

Target (Profit) Histogram

Histograms can be plotted directly using a DataFrame (Pandas uses Matplotlib to plot dataframes), and profit can be directly accessed and plotted.

companies['Profit'].hist( color='g', bins=100);

picture

As you can see, there are very few outliers with profits exceeding $200,000. It can also be inferred from this that the data in this article represent companies of a specific size. Since the number of outliers is relatively small, they can be retained.

Feature (expenditure) histogram

Here, this article would like to see the histogram of the features used and see their distribution. The Y-axis represents frequency of numbers and the X-axis represents expenditures.

companies[[
  "R&D Spend", 
  "Administration", 
  "Marketing Spend"
]].hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)

picture

You can also see that there is a healthy distribution with only a few outliers. Intuitively, one would expect companies that spend more on R&D and marketing to be more profitable. As you can see from the scatterplot below, there's a clear correlation between R&D spending and profits.

profits: DataFrame = companies[["Profit"]]
research_and_development_spending: DataFrame = companies[["R&D Spend"]]

figure, ax = plt.subplots(figsize = (9, 9))
plt.xlabel("R&D Spending")
plt.ylabel("Profits")
ax.scatter(
  research_and_development_spending, 
  profits, 
  s=60, 
  alpha=0.7, 
  edgecolors="k",
  color='g',
  linewidths=0.5
)

picture

The correlation between expenses and profits can be further explored through correlation heatmaps. As can be seen from the figure, R&D and marketing spending have a higher correlation with profits than administrative spending.

sns.heatmap(companies.corr())

picture

5. Model training

First, the data set needs to be divided into two parts: training set and test set. Sklearn provides a helper method to accomplish this task. Given that the dataset of this article is simple and small enough, features and targets can be separated as follows.

data set

features: DataFrame = companies[[
    "R&D Spend", 
    "Administration", 
    "Marketing Spend",
]]
targets: DataFrame = companies[["Profit"]]

train_features, test_features, train_targets, test_targets = train_test_split(
  features, 
  targets,
  test_size=0.2
)

Most data scientists will use a different naming convention such as X_train, y_trainor other similar variations.

Model training

Now you can create and train the model. Sklearn makes things very easy.

model: LinearRegression = LinearRegression()
model.fit(train_features, train_targets)

6. Model evaluation

This article hopes to evaluate the performance of the model and its usability. First look at the calculated coefficients. In machine learning, coefficients are learned weights or values ​​that are used to multiply each feature. Expect to see a learning coefficient for each feature.

coefficients = model.coef_

"""
We should see the following in our console

Coefficients  [[0.55664299 1.08398919 0.07529883]]
"""

As you can see above, there are 3 coefficients, one for each feature ("R&D expenditures", "Administrative expenditures", "Marketing expenditures"). It can also be plotted on a graph to give a more intuitive understanding of each coefficient.

plt.figure()
plt.barh(train_features.columns, coefficients[0])
plt.show()

picture

Calculation error

This article hopes to understand the error rate of the model and will use Sklearn's R2 score.

test_predictions: List[float] = model.predict(test_features)
root_squared_error: float = r2_score(test_targets, test_predictions)
"""float
We should see an ouput similar to this
0.9781424529214315
"""

The closer it is to 1, the more accurate the model is. This can actually be tested in a very simple way.

You can take the first row of the data set. If you use the payout model below to predict profits, expect to get a number close enough to $192,261.

"R&D Spend" |  "Administration" |  "Marketing Spend" | "Profit"
165349.2       136897.8            471784.1            192261.83

Next create an inference request.

inference_request: DataFrame = pd.DataFrame([{
  "R&D Spend":165349.2, 
  "Administration":136897.8, 
  "Marketing Spend":471784.1 
}])

Run the model.

inference: float = model.predict(inference_request)
"""
We should get a number that is around
199739.88721901
"""

The error rate you can see now is abs(199739-192261)/192261=0.0388. This is very accurate.

7. Conclusion

There are many ways to process data, build models, and analyze data. There is no one solution that fits all situations, and when using machine learning to solve business problems, one of the key processes is to build multiple models designed to solve the same problem and select the most promising model.

Recommended book list

"Python Distributed Machine Learning"

"Python Distributed Machine Learning" This book explains in detail the basic solutions related to distributed machine learning, mainly including splitting input data, parameter server and All-Reduce, building data parallel training and service pipelines, bottlenecks and solutions, Split models, pipeline inputs and layer splitting, enable model parallel training and serving workflows, achieve higher throughput and lower latency, hybrid of data parallelism and model parallelism, federated learning and edge devices, elastic model training and services, advanced technologies for further acceleration, and more. In addition, this book also provides corresponding examples and codes to help readers further understand the implementation process of related solutions.

This book is suitable as a textbook and teaching reference book for computer and related majors in colleges and universities, and can also be used as a self-study book and reference manual for relevant developers.

"Python Distributed Machine Learning" icon-default.png?t=N7T8https://item.jd.com/13968572.html

picture

Highlights

"Easy to play with Python, create stunning line charts in 5 steps"

"4 killer Pycharm efficient plug-ins"

"10 Data Type Tips in Python"

"Python's collection module, using data containers to process data collections"

"Using Ray to create efficient deep learning data pipelines"

"Using Ray to Easily Perform Python Distributed Computing"

Search and follow "Python Learning and Research Basecamp" on WeChat, join the reader group, and share more exciting things

Visit [IT Today’s Hot List] to discover daily technology hot spots

Guess you like

Origin blog.csdn.net/weixin_39915649/article/details/135288805
Recommended