This amazing library that smoothes data and finds outliers

When processing data, we often encounter some non-continuous scattered time series data:
picture

Sometimes, such scattered data is not conducive to our data clustering and prediction. So we need to smooth them, as shown in the following image:

picture

If we remove the scatter and its range, the smoothed effect is as follows:

picture

Does this time series data look more comfortable? In addition, using smoothed time series data for clustering or forecasting can be amazing because it removes some bias and refines the distribution of the data.

If we develop such a smoothing tool ourselves, it will take a lot of time. Because there are many techniques for smoothing, you need to research them one by one to find the most suitable technique and write the code, which is a very time-consuming process. Smoothing techniques include but are not limited to:

  • Exponential smoothing

  • Convolution smoothing with various window types (Constant, Hanning, Hamming, Bartlett, Blackman)

  • Spectral smoothing of the Fourier transform

  • Polynomial smoothing

  • Various spline smoothing (linear, cubic, natural cubic)

  • Gaussian smoothing

  • binary smoothing

Fortunately, some big guys have already implemented these smoothing techniques for time series for us, and open sourced the code of this module on GitHub - it is the Tsmoothie module.

1. Prepare

Before you start, you want to make sure that Python and pip have been successfully installed on your computer.

Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd in Windows environment (Start-Run-CMD).
2. Open Terminal in MacOS environment (command+space to enter Terminal).
3. If you are using the VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.

pip install tsmoothie

(PS) Tsmoothie only supports Python 3.6 and above.

2. Basic use of Tsmoothie

To try the effect of Tsmoothie, we need to generate random data:

import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.utils_func import sim_randomwalk
from tsmoothie.smoother import LowessSmoother

# 生成 3 个长度为200的随机数据组
np.random.seed(123)
data = sim_randomwalk(n_series=3, timesteps=200,
                      process_noise=10, measure_noise=30)

Then use Tsmoothie to perform smoothing:

# 平滑
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)

With smooth.smooth_data you can get smoothed data:

print(smoother.smooth_data)
# [[ 5.21462928 3.07898076 0.93933646 -1.19847767 -3.32294934
# -5.40678762 -7.42425709 -9.36150892 -11.23591897 -13.05271523
# ....... ....... ....... ....... ....... ]]

Draw renderings:

# 生成范围区间
low, up = smoother.get_intervals('prediction_interval')

plt.figure(figsize=(18,5))

for i in range(3):
    
    plt.subplot(1,3,i+1)
    plt.plot(smoother.smooth_data[i], linewidth=3, color='blue')
    plt.plot(smoother.data[i], '.k')
    plt.title(f"timeseries {
      
      i+1}"); plt.xlabel('time')

    plt.fill_between(range(len(smoother.data[i])), low[i], up[i], alpha=0.3)

picture

3. Extreme outlier detection based on Tsmoothie

In fact, based on the range region generated by smoothother, we can perform outlier detection:

picture

It can be seen that the points outside the blue range are all outliers. We can easily mark these outliers in red or record them for subsequent processing.

_low, _up = smoother.get_intervals('sigma_interval', n_sigma=2)
series['low'] = np.hstack([series['low'], _low[:,[-1]]])
series['up'] = np.hstack([series['up'], _up[:,[-1]]])
is_anomaly = np.logical_or(
    series['original'][:,-1] > series['up'][:,-1],
    series['original'][:,-1] < series['low'][:,-1]
).reshape(-1,1)

Assuming that the maximum value of the blue range interval is up and the minimum value is low, if there is data > up or data < low, it indicates that the data is an abnormal point.

The GIF above can be saved using the following code for smoothing and anomaly detection by scrolling through the data points.

Swipe up to see more codes

# Origin: https://github.com/cerlymarco/MEDIUM_NoteBook/blob/master/Anomaly_Detection_RealTime/Anomaly_Detection_RealTime.ipynb
import numpy as np
import matplotlib.pyplot as plt
from celluloid import Camera
from collections import defaultdict
from functools import partial
from tqdm import tqdm

from tsmoothie.utils_func import sim_randomwalk, sim_seasonal_data
from tsmoothie.smoother import *


def plot_history(ax, i, is_anomaly, window_len, color='blue', **pltargs):
    
    posrange = np.arange(0,i)
    
    ax.fill_between(posrange[window_len:],
                    pltargs['low'][1:], pltargs['up'][1:],
                    color=color, alpha=0.2)
    if is_anomaly:
        ax.scatter(i-1, pltargs['original'][-1], c='red')
    else:
        ax.scatter(i-1, pltargs['original'][-1], c='black')
    ax.scatter(i-1, pltargs['smooth'][-1], c=color)
    
    ax.plot(posrange, pltargs['original'][1:], '.k')
    ax.plot(posrange[window_len:],
            pltargs['smooth'][1:], color=color, linewidth=3)
    
    if 'ano_id' in pltargs.keys():
        if pltargs['ano_id'].sum()>0:
            not_zeros = pltargs['ano_id'][pltargs['ano_id']!=0] -1
            ax.scatter(not_zeros, pltargs['original'][1:][not_zeros],
                       c='red', alpha=1.)

np.random.seed(42)

n_series, timesteps = 3, 200

data = sim_randomwalk(n_series=n_series, timesteps=timesteps,
                      process_noise=10, measure_noise=30)

window_len = 20

fig = plt.figure(figsize=(18,10))
camera = Camera(fig)

axes = [plt.subplot(n_series,1,ax+1) for ax in range(n_series)]
series = defaultdict(partial(np.ndarray, shape=(n_series,1), dtype='float32'))

for i in tqdm(range(timesteps+1), total=(timesteps+1)):
    
    if i>window_len:
    
        smoother = ConvolutionSmoother(window_len=window_len, window_type='ones')
        smoother.smooth(series['original'][:,-window_len:])

        series['smooth'] = np.hstack([series['smooth'], smoother.smooth_data[:,[-1]]])

        _low, _up = smoother.get_intervals('sigma_interval', n_sigma=2)
        series['low'] = np.hstack([series['low'], _low[:,[-1]]])
        series['up'] = np.hstack([series['up'], _up[:,[-1]]])

        is_anomaly = np.logical_or(
            series['original'][:,-1] > series['up'][:,-1],
            series['original'][:,-1] < series['low'][:,-1]
        ).reshape(-1,1)
        
        if is_anomaly.any():
            series['ano_id'] = np.hstack([series['ano_id'], is_anomaly*i]).astype(int)
            
        for s in range(n_series):
            pltargs = {
    
    k:v[s,:] for k,v in series.items()}
            plot_history(axes[s], i, is_anomaly[s], window_len,
                         **pltargs)

        camera.snap()
        
    if i>=timesteps:
        continue
    
    series['original'] = np.hstack([series['original'], data[:,[i]]])

    
print('CREATING GIF...') # it may take a few seconds
camera._photos = [camera._photos[-1]] + camera._photos
animation = camera.animate()
animation.save('animation1.gif', codec="gif", writer='imagemagick')
plt.close(fig)
print('DONE')

Note that not all outliers have negative effects, they may represent different meanings in different application scenarios.

picture

In stocks, for example, it may represent a signal of a trend reversal in a volatile market.

Or in the analysis of household electricity consumption, it may represent the peak electricity consumption at a certain moment, according to which we can turn on what kind of appliances at this moment.

Therefore, the role of outliers needs to be analyzed differently according to different application scenarios in order to find its true value.

picture

All in all, Tsmoothie can not only use a variety of smoothing techniques to smooth our time series data, making our model training more effective, but also find outliers in the data based on the smoothing results, which is a good helper for us in data analysis and research. , very valuable.

This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/124225252