Adding a Trend Line to Scatter Plot

1. Introduction

Scatter plot is a useful way to explore two variables relationship, but it also has a shortcome: we have to guess it's trend by our eyes. If we can add a trend line to the scatter plot, it will makes our opinion more clear and powerful. This step will benefit not only readers but also data analyst ourself to sort out what is going on.

However, matplotlib has no argument or built-in method to do so. We have to make a trend line by ourself. In python ecosystem of doing regression(trend line) is more than crowded. Libraries from numpy, statsmodels, scipy to sklearn, many libraries has it's own way to do the same thing. I wish they have same or close interface for users, but it is not the case. So in this article, we will try to add a trend line as easy as we can. After that, we will have a glimps at how other libraries do the same thing.

One quick note before start, trend line is actually a regression to the scatter data, so this article is also a standard process of how to do data regression in python.

 

 

2. The Data

We will use two classical dataset "tips.csv" and "mpg.csv" as examples.

One quick way is using seaborn's bulit-in datasets, tips and mpg are already inside.

import seaborn as sns

tips = sns.load_dataset("tips")
mpg = sns.load_dataset("mpg")

Here is what they look like, first 5 lines:

  

We can draw a normal scatter plot to see what it looks like.

from matplotlib import pyplot as plt

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))
ax1.scatter(tips['total_bill'], tips['tip'])

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))
ax1.scatter(mpg['horsepower'], mpg['mpg'])

 

 

 

3. Seaborn

As we mentioned before, matplotlib has no built-in method to add a trend line. But seaborn has one, named "regplot"(short for regression plot).

This can totally solve the problem without even know how to do a regression. 

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

scatter = sns.regplot(tips['total_bill'], tips['tip'])

The mpg data shows a curve trend, we will use argument order=2 as below.

# using argument order=
fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

scatter = sns.regplot(mpg['horsepower'], mpg['mpg'], order=2)

  

We can do some customization within seaborn. It's not so obvious nor so coordinate with matplotlib, but still good to know.

# we can specify line color & scatter color with line_kws
fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

scatter = sns.regplot(mpg['horsepower'], mpg['mpg'], order=2,
                      line_kws={'color':'darkorange'}, scatter_kws={'color':'pink'})

This can actually be the end of story. We have a decent output and easy enough way to add a trend line to scatter plot.

If you can bear with me, or would like to come back later, we are happy to go on more detail into the topic.

 

  

4. Numpy  

import numpy as np

The second easiest way of adding a trend line is using numpy library. This may surprise us how powerful numpy is, without even using statsmodels, scipy or sklearn.

In the seaborn method, we have no idea what the function of trend line is, or how good it fits the data.

Using numpy is another way of doing the same, but we can get what the function of trend line is. At least if we are asked, we can tell what is the trend line function.

The function built by "np.poly1d()" can be printed out directly, which is convenience to we users.

params = np.polyfit(tips['total_bill'], tips['tip'], 1) # 1 order
function = np.poly1d(params)

print(function)
# output
# 0.105 x + 0.9203

Using the function we build above, we can draw the trend within matplotlib.

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

ax1.scatter(tips['total_bill'], tips['tip'])

x = np.linspace(0, 50)
ax1.plot(x, function(x), color='red', linewidth=2)

  

Similarly we will do the same to mpg dataset.

params = np.polyfit(mpg['horsepower'], mpg['mpg'], 2) # 2 orders
function = np.poly1d(params)

print(function)
# output
#          2
#0.001231 x - 0.4662 x + 56.9

  

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

ax1.scatter(mpg['horsepower'], mpg['mpg'])

x = np.linspace(50, 225)
ax1.plot(x, function(x), color='r', linewidth=2)

I think this method is also easy enough. Two important functions here is very well designed, especially the output of function and we can use it for drawing.

 

 

5. State Model  

In previous section we talked about how to get the expression(function) of trend line. But still we have no idea how good or bad the fitness is.

Normally we don't have to know this part of infomation, but if we have to go into this part, using statsmodel is a good idea.

from statsmodels.formula.api import ols

tips = tips.dropna() # dropna by ourself or it may get error

curve = ols('tip~total_bill', tips) # building expression y ~ x
curve = curve.fit()

print(curve.params)
# output
# Intercept     0.920270
# total_bill    0.105025
# dtype: float64

Using these infomation, we can built function by ourself and draw the trend line.

# we can use np.poly1d() to build the function if we want
# but here we give a common way:
def f(x):
    return curve.params[1]*x + curve.params[0]


fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

ax1.scatter(tips['total_bill'], tips['tip'])

x = np.linspace(0, 50)
ax1.plot(x, f(x), color='red', linewidth=2)

  

No surprise it is the same as before, but good part is we can have a curve summary as below.

If we are doing regression and try to figure which parameter is good or bad enough, this infomation table can be a life savior.

curve.summary()

  

Now we will do the same to mpg dataset. When I tried to do this process, I found if the regression is 2 orders, I have to make the varibale by myself. Maybe I am wrong, if I have new infomation I will update.

mpg = mpg.dropna()

# mpg scatter is 2 orders trend, 
# in this statsmodel method we have to make it by ourself
mpg['horsepower2'] = mpg['horsepower']**2

# regress by ourself
# building expression y ~ x**2 + x
curve = ols('mpg ~ horsepower2 + horsepower', mpg)
curve = curve.fit()

curve.params
# output
#Intercept      56.900100
#horsepower2     0.001231
#horsepower     -0.466190
#dtype: float64

 

# our regression expression
def f(x):
    return curve.params[1]*x**2 + curve.params[2]*x + curve.params[0]

fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))

ax1.scatter(mpg['horsepower'], mpg['mpg'])

x = np.linspace(50, 225)
ax1.plot(x, f(x), color='r', linewidth=2)

  

 

curve.summary()

 

 

6. Summary

  • If we don't care anything, just use seabron's regplot() to add a trend line.
  • If we also want a expression of the trend, we can use numpy's ployfit() adn ploy1d(). Useage of ploy1d() is beyond this method. It can be used to build expression with parameters produced by other libraries.
  • If we want to get full infomation of fitness or doing formal regression, statsmodel has an good output table.

We will stop here. Not only because the article is already too long, but also I haven't got full understand of scipy & sklearn.

We wish our topic today is well explained and easy enough to use by everyone. If I am wrong at any part, please let me know.

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/drvongoosewing/p/12496786.html