[Data Mining] Time Series Tutorial [1]

Chapter 1 Description

        The study of time series can be traced back to the end of the 19th century and the beginning of the 20th century. At that time, many scholars began to study time-related economic and social phenomena, trying to discover their laws and trends. The earliest time series research can be traced back to the research of French economist Maurice Allais and British economist James Clark. With the continuous development and application of time series analysis methods, time series research has gradually become an important research direction in statistics, economics, finance, engineering and other fields.

Chapter 2. Structure of Temporal Data

The time series data is

  • Observations or measurements indexed by time

  • Instead of Xi, we denote by Xt

Why does this make things different?

  • Time indexes have a special ordering.

  • Data measured over time is not commutative, which is what we usually assume when indexing data.

  • Time can also have its own special meaning, representing other unobserved variables.

        To be clear, a key property of time series data is that it sets it apart from other types of data that are commonly analyzed, and we don't think we can randomly permutate the indices of the data and model the data with the same distribution. The data is sorted. Furthermore, strong assumptions of data independence often do not apply.

        An interesting and potentially disturbing feature of time series data is that the data in its raw form provide very little real information. In a sense, raw data is the most useless form of data. So plotting or summarizing raw data often doesn't provide much insight into what's happening or why it's happening. However, because time indexes have such special significance, we can use time indexes to decompose time series data into changes on different time scales . Formal methods of time-scale analysis are sometimes called Fourier analysis or spectral analysis, but there are informal methods that are also useful.

        Another way to think about time series data is that a time series actually represents a mixture of time series that vary on different time scales . Part of the job of analyzing time series data is

  1. Pick a mixture of timescales and describe how they differ

  2. Determine the timescale of interest based on empirical properties or the scientific question at hand

2.1 Example: Air Pollution and Health

        For example, we might be interested in studying how long-term exposure to ambient air pollution affects your life expectancy. For example, some studies have shown that living your entire life in a more polluted city can reduce your life expectancy by as much as six months compared to living in a cleaner city. When thinking about how to approach this problem and how to analyze the data, we are primarily interested in comparing long-term average pollution levels between cities, perhaps over decades. We are unlikely to care how high pollution levels are on a given day or even a month.

        On the other hand, many studies have shown that short-term spikes in air pollution increase the number of deaths and hospitalizations from cardiovascular and respiratory diseases in a city. In this case, we might be interested in comparing day-to-day changes in air pollution with day-to-day changes in hospitalizations or mortality. Overall long-term average pollution levels are of little interest.

        Consider the following time series plot of particulate matter (PM1999) data for Detroit, Michigan over the period 10–1987.

        One might ask a seemingly simple question: Did Detroit's air pollution improve between 1987 and 1999? In fact, pollution levels decreased slightly overall during this time, but continued.

In fact, when we look at the fitted simple linear regression model results, we see that the coefficient for the slope is negative.

# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) 48.4      1.67         28.9  9.59e-170
2 date        -0.00157  0.000184     -8.54 1.77e- 17

However, when looking at the graph above, it's hard not to notice the extreme spikes that occur on a regular basis. A cursory reading of the graph shows the number of days when PM10 levels reached 100 \(mu\)g/m\(^3\). So, Detroit's PM10 has declined over time, but we still experience high levels on certain days. Has the situation improved?

The answer, of course, has to do with the timescales on which we consider the data. On long-term timescales, things seem to be decreasing, so the trend is flat. However, on short-term timescales, we still see large spikes. There isn't one answer; the answer depends on the time scale.

From a policy perspective, we can employ different strategies to influence air pollution on long-term and short-term timescales. To change pollution levels in the long term, we might try to switch local economies from fossil fuel-based energy sources to more renewable, less polluting sources. Such a plan could have a significant impact, but it will take a lot of time to implement. In response to short-term fluctuations in pollution, we may implement policies such as traffic bans or targeted source-based interventions to mitigate short-term peaks.

Now suppose we wanted to see if there was any association between death rates in Detroit and air pollution. We can make a simple scatterplot to see if there is a simple association.

Now, this scatterplot is what we might make if we didn't have time series data. However, since we do have time series data, we should immediately start thinking about changes in terms of different time scales. Are we concerned with the long-term association between pollution and mortality, or the short-term association?

The overall association shown in the graph above can be quantified with a simple linear regression model.

# A tibble: 2 x 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)  46.0      0.226      204.   0         
2 pm10          0.0275   0.00564      4.88 0.00000108

There appears to be a positive correlation between the two, suggesting that increased levels of air pollution are associated with increased mortality. But can we do more to gain more insight?

Let's calculate the annual average of PM10 and sum the annual death totals and make a scatterplot of these annual summary statistics.

As we can see from this plot, the association seems to be quite strong (only 13 data points, of course). When we fit a linear model to these data, we get the following.

# A tibble: 2 x 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)   10388.     987.      10.5  0.00000000725
2 pm10            190.      29.4      6.47 0.00000579   

A one-unit change in annual mean PM10 from one year to the next was associated with a change in 190.4 deaths.

We can now compare the daily deviation of PM10 to its annual average and see how this deviation correlates with daily mortality.

# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  46.9      0.116      404.    0     
2 pm10dev       0.0142   0.00574      2.47  0.0136

The correlation here is much smaller, but of course we're only looking at daily changes in PM10, not annual changes. We expect a one-unit change in PM10 from one day to the next to bring a large number of deaths.

One might wonder: which estimate is correct? Is the association between daily average PM10 and mortality, or between annual average PM10 and mortality? The answer is that both are "correct", but each answers a different question. Daily averages focus on short-term changes and can be interpreted as representing the "acute" effects of pollution, while annual averages may re-reflect the "chronic" effects of air pollution levels.

Another question to consider when looking at associations at different time scales is, what are the confounding factors present on this time scale? When looking at year-to-year changes in PM10, there may be many confounding factors that also vary by PM10 and mortality. The same confounding factors that vary smoothly from year to year may not be of concern when looking at daily changes in PM10. However, there may be other confounding factors that need to be considered on a daily basis.

2.2 Fixed change and random change

Most time series books tend to imagine time series as consisting of random phenomena only, rather than a mixture of stationary and random phenomena. Therefore, modeling often focuses on the stochastic aspects of time-series models. However, many real-time sequences in the world are composed of what we might think of as fixed and random variations.

  • Temperature data has non-"random" diurnal and seasonal components

  • Air pollution data could have week-of-week effects based on traffic or commuting patterns

While it is sometimes easy to dismiss everything as random, this is often a crutch when we lack observation of the true underlying phenomenon. Also, when something is fixed, treating it as random will lead to a violation of the stationarity assumptions we usually make (see below).

Depending on the nature of the application, it may make sense to model the same phenomenon either fixed or random. In other words, it depends.

  • In biomedical and public health applications, we typically deal with fully observed datasets and try to explain "what happened?

  • We are describing the past, perhaps making inferences about the future

  • In financial or control systems applications, we might make predictions about future events based on the past. Things that seem fixed from past data may change in the future, so we may want to allow the model to "fit" to unknown future patterns.

Consider the following map of average daily temperatures in Baltimore, Maryland, for 1990–1992. As one would expect from the temperature data, there is a strong seasonal pattern, with peaks in summer and troughs in winter.

Now, is this so-called seasonal pattern fixed or random? History tells us that seasonal patterns are fairly predictable. We don't usually believe that summers can be freezing and winters can reach 90 degrees (F).

A more formal way of discussing this might be to use the following model. Let \(y_1, y_2, \dots\) be the daily \(t\) temperature values ​​in Baltimore, and consider the following model,

\[ y_t = \mu + \varepsilon_t, \]

where \(\varepsilon\) is the random deviation between expected value\(\mu\) and observed value\(y_t\). Without any computer aid, we might look at the graph above and estimate \(\mu\) to be about 50-55 degrees. But now, suppose your job is to predict the value of \(\varepsilon_t\) for any value of \(t\). Obviously, if \(t\) falls in the middle of the year, it is likely that \(\varepsilon_t > 0\), and if \(t\) falls near the beginning or end of the year, then it is likely that \(\varepsilon_t < 0 \). Therefore, we only need to know the value of \(t\) to obtain important information about the deviation \(\varepsilon\). In other words, there is a fixed seasonal effect embedded in the sequence \(\varepsilon_t\), which we may have difficulty seeing as "random".

But now consider the following model.

\[ y_t = y_{t-1} + \varepsilon_t。 \]

This model predicts the value at \(y_t\) as a deviation from the value at time \(t-1\). So today's value is equal to yesterday's value plus a small offset. Now, suppose your job is to predict the value of \(\varepsilon_t\). It's kind of hard, right? If I know it was 70 degrees yesterday, am I sure it will be warmer than 70 degrees today? or colder? If I know it was 20 degrees yesterday, am I sure it will be warmer or colder today? In this model, the deviation \(\varepsilon_t\) may appear more "random" or less predictable. There is no hard and fast rule that today's temperature will always be warmer (or colder) than yesterday's temperature.

Consider the different time series below, which shows the weighted median transaction price for an ETF with ticker symbol SPY. The fund tracks the S&P 500 index of U.S. stocks. Note that the time scale on the x-axis is in microseconds.

Compared to the temperature time series, the plot looks less regular and there are no identifiable patterns. Furthermore, at the microsecond level, we may not be very familiar with the fixed patterns that such stock prices may have. Experienced traders probably know that this pattern always occurs over a window of hundreds of thousands of microseconds at a given time of day.

However, with finance, there is a theory known as the efficient market hypothesis that says this fixed pattern should not exist. If such a fixed pattern exists, it would represent an arbitrage opportunity, or an opportunity to make money without risk. For example, in the graph above, we could buy the stock at 20k microseconds and sell it around 50k microseconds for an easy profit. If this pattern exists every day, we can tell our broker to execute this trade every day for a small profit. However, as word of this pattern leaked into the market, more and more people would start buying at the same time as me and selling at the same time as me. This will increase the price when buying and lower the price when selling, and eventually the profit opportunity will disappear.

The efficient market hypothesis suggests that such a fixed pattern is highly unlikely. Therefore, it might make more sense to model such data as random rather than fixed. This suggests that different modeling strategies and different types of models can be employed. We will not discuss these types of models in detail here.

2.3 Objectives of Time Series Analysis

What do people expect from time series analysis? What questions are being answered?

2.3.1 Forecast

Given the past and present, what will the future look like (and its uncertainty)?

  • Given the past 10 years of quarterly EPS, what will be Apple's EPS for the next quarter?

  • Given the average global temperature for the past 200 years, what will be the average global temperature for the next 100 years?

2.3.2 Filtering

How should I update my estimate of the true state of nature given past and present observations?

  • Given my current estimates of the position and velocity of the spacecraft, how should I update my estimates of position and velocity based on new gyroscope and radar measurements?

  • Given the history of monthly U.S. unemployment data and my estimate of current unemployment levels, how should I revise my estimate based on the most recent data released by the Bureau of Labor Statistics?

  • Given the history of endowment returns, current year returns, and the need to spend a target percentage of endowment value each year, how much should the university spend from the endowment in the next fiscal year?

2.3.3 Time scale analysis

Given a set of observed data, which timescales of variation dominate or explain most of the temporal variation in the data.

  • Are there strong seasonal cycles in temperature observations in Baltimore, MD?

  • Is the association between ambient air pollution and mortality driven primarily by large annual changes in pollution levels or by short-term peaks?

2.3.4 Regression Modeling

Given the time series of two phenomena, what is the relationship between them?

  • What is the relationship between daily air pollution levels and daily values ​​for cardiac hospitalizations?

  • What is the lag (in months) between a change in a country's unemployment rate and a change in its GDP?

  • What is the cumulative number of excess deaths that occurred in the two weeks following a major hurricane?

2.3.5 Smoothing

Given a complete (noisy) dataset, can I infer the true state of nature in the past?

  • Given a noisy measurement signal, can I reconstruct the real signal from the data?

  • Now that my spaceship has orbited the moon, what is the closest distance to the moon?

  • (to be continued)

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131469477