When processing data, we often encounter some non-continuous scattered time series data:
Sometimes, such scattered data is not conducive to our data clustering and prediction. So we need to smooth them, as shown in the following image:
If we remove the scatter and its range, the smoothed effect is as follows:
Does this time series data look more comfortable? In addition, using smoothed time series data for clustering or forecasting can be amazing because it removes some bias and refines the distribution of the data.
If we develop such a smoothing tool ourselves, it will take a lot of time. Because there are many techniques for smoothing, you need to research them one by one to find the most suitable technique and write the code, which is a very time-consuming process. Smoothing techniques include but are not limited to:
-
Exponential smoothing
-
Convolution smoothing with various window types (Constant, Hanning, Hamming, Bartlett, Blackman)
-
Spectral smoothing of the Fourier transform
-
Polynomial smoothing
-
Various spline smoothing (linear, cubic, natural cubic)
-
Gaussian smoothing
-
binary smoothing
Fortunately, some big guys have already implemented these smoothing techniques for time series for us, and open sourced the code of this module on GitHub - it is the Tsmoothie module.
1. Prepare
Before you start, you want to make sure that Python and pip have been successfully installed on your computer.
Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd in Windows environment (Start-Run-CMD).
2. Open Terminal in MacOS environment (command+space to enter Terminal).
3. If you are using the VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.
pip install tsmoothie
(PS) Tsmoothie only supports Python 3.6 and above.
2. Basic use of Tsmoothie
To try the effect of Tsmoothie, we need to generate random data:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.utils_func import sim_randomwalk
from tsmoothie.smoother import LowessSmoother
# 生成 3 个长度为200的随机数据组
np.random.seed(123)
data = sim_randomwalk(n_series=3, timesteps=200,
process_noise=10, measure_noise=30)
Then use Tsmoothie to perform smoothing:
# 平滑
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)
With smooth.smooth_data you can get smoothed data:
print(smoother.smooth_data)
# [[ 5.21462928 3.07898076 0.93933646 -1.19847767 -3.32294934
# -5.40678762 -7.42425709 -9.36150892 -11.23591897 -13.05271523
# ....... ....... ....... ....... ....... ]]
Draw renderings:
# 生成范围区间
low, up = smoother.get_intervals('prediction_interval')
plt.figure(figsize=(18,5))
for i in range(3):
plt.subplot(1,3,i+1)
plt.plot(smoother.smooth_data[i], linewidth=3, color='blue')
plt.plot(smoother.data[i], '.k')
plt.title(f"timeseries {
i+1}"); plt.xlabel('time')
plt.fill_between(range(len(smoother.data[i])), low[i], up[i], alpha=0.3)
3. Extreme outlier detection based on Tsmoothie
In fact, based on the range region generated by smoothother, we can perform outlier detection:
It can be seen that the points outside the blue range are all outliers. We can easily mark these outliers in red or record them for subsequent processing.
_low, _up = smoother.get_intervals('sigma_interval', n_sigma=2)
series['low'] = np.hstack([series['low'], _low[:,[-1]]])
series['up'] = np.hstack([series['up'], _up[:,[-1]]])
is_anomaly = np.logical_or(
series['original'][:,-1] > series['up'][:,-1],
series['original'][:,-1] < series['low'][:,-1]
).reshape(-1,1)
Assuming that the maximum value of the blue range interval is up and the minimum value is low, if there is data > up or data < low, it indicates that the data is an abnormal point.
The GIF above can be saved using the following code for smoothing and anomaly detection by scrolling through the data points.
Swipe up to see more codes
# Origin: https://github.com/cerlymarco/MEDIUM_NoteBook/blob/master/Anomaly_Detection_RealTime/Anomaly_Detection_RealTime.ipynb
import numpy as np
import matplotlib.pyplot as plt
from celluloid import Camera
from collections import defaultdict
from functools import partial
from tqdm import tqdm
from tsmoothie.utils_func import sim_randomwalk, sim_seasonal_data
from tsmoothie.smoother import *
def plot_history(ax, i, is_anomaly, window_len, color='blue', **pltargs):
posrange = np.arange(0,i)
ax.fill_between(posrange[window_len:],
pltargs['low'][1:], pltargs['up'][1:],
color=color, alpha=0.2)
if is_anomaly:
ax.scatter(i-1, pltargs['original'][-1], c='red')
else:
ax.scatter(i-1, pltargs['original'][-1], c='black')
ax.scatter(i-1, pltargs['smooth'][-1], c=color)
ax.plot(posrange, pltargs['original'][1:], '.k')
ax.plot(posrange[window_len:],
pltargs['smooth'][1:], color=color, linewidth=3)
if 'ano_id' in pltargs.keys():
if pltargs['ano_id'].sum()>0:
not_zeros = pltargs['ano_id'][pltargs['ano_id']!=0] -1
ax.scatter(not_zeros, pltargs['original'][1:][not_zeros],
c='red', alpha=1.)
np.random.seed(42)
n_series, timesteps = 3, 200
data = sim_randomwalk(n_series=n_series, timesteps=timesteps,
process_noise=10, measure_noise=30)
window_len = 20
fig = plt.figure(figsize=(18,10))
camera = Camera(fig)
axes = [plt.subplot(n_series,1,ax+1) for ax in range(n_series)]
series = defaultdict(partial(np.ndarray, shape=(n_series,1), dtype='float32'))
for i in tqdm(range(timesteps+1), total=(timesteps+1)):
if i>window_len:
smoother = ConvolutionSmoother(window_len=window_len, window_type='ones')
smoother.smooth(series['original'][:,-window_len:])
series['smooth'] = np.hstack([series['smooth'], smoother.smooth_data[:,[-1]]])
_low, _up = smoother.get_intervals('sigma_interval', n_sigma=2)
series['low'] = np.hstack([series['low'], _low[:,[-1]]])
series['up'] = np.hstack([series['up'], _up[:,[-1]]])
is_anomaly = np.logical_or(
series['original'][:,-1] > series['up'][:,-1],
series['original'][:,-1] < series['low'][:,-1]
).reshape(-1,1)
if is_anomaly.any():
series['ano_id'] = np.hstack([series['ano_id'], is_anomaly*i]).astype(int)
for s in range(n_series):
pltargs = {
k:v[s,:] for k,v in series.items()}
plot_history(axes[s], i, is_anomaly[s], window_len,
**pltargs)
camera.snap()
if i>=timesteps:
continue
series['original'] = np.hstack([series['original'], data[:,[i]]])
print('CREATING GIF...') # it may take a few seconds
camera._photos = [camera._photos[-1]] + camera._photos
animation = camera.animate()
animation.save('animation1.gif', codec="gif", writer='imagemagick')
plt.close(fig)
print('DONE')
Note that not all outliers have negative effects, they may represent different meanings in different application scenarios.
In stocks, for example, it may represent a signal of a trend reversal in a volatile market.
Or in the analysis of household electricity consumption, it may represent the peak electricity consumption at a certain moment, according to which we can turn on what kind of appliances at this moment.
Therefore, the role of outliers needs to be analyzed differently according to different application scenarios in order to find its true value.
All in all, Tsmoothie can not only use a variety of smoothing techniques to smooth our time series data, making our model training more effective, but also find outliers in the data based on the smoothing results, which is a good helper for us in data analysis and research. , very valuable.
This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention.
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group