There are many definitions of time series data, and they have the same meaning in different ways. A simple definition is that time series data includes data points appended to sequential time points.
The source of time series data is periodic measurement or observation. Time series data exists in many industries. To give a few examples:
- Stock price over a period of time
- Daily, weekly, and monthly sales
- Periodic measurement in the process
- Electricity or natural gas consumption rate over a period of time
In this article, I will list 20 points to help you fully understand how to process time series data with Pandas.
1. Different forms of time series data
Time series data can be in the form of a specific date, duration or fixed custom interval.
The timestamp can be one day or one second of the given date, depending on the precision. For example, '2020-01-01 14:59:30' is a timestamp based on seconds.
2. Time series data structure
Pandas provides flexible and efficient data structures to process various time series data.
In addition to these three structures, Pandas also supports the concept of date offset, which is a relative time duration related to the calendar algorithm.
3. Create a timestamp
The most basic time series data structure is a timestamp, which can be created using to_datetime or Timestamp function
import pandas as pdpd.to_datetime('2020-9-13')
Timestamp('2020-09-13 00:00:00')pd.Timestamp('2020-9-13')
Timestamp('2020-09-13 00:00:00')
4. Access information saved by timestamp
We can get information about the day, month and year stored in the timestamp.
a = pd.Timestamp('2020-9-13')a.day_name()
'Sunday'
a.month_name()
'September'
a.day
13
a.month
9
a.year
2020
5. Hidden information access
The timestamp object also saves information about the date algorithm. For example, we can ask if this year is a leap year. Here is some more specific information we can get:
b = pd.Timestamp('2020-9-30')b.is_month_end
Trueb.is_leap_year
Trueb.is_quarter_start
Falseb.weekofyear
40
6. European style dates
We can use the to_datetime function to handle European-style dates (that is, date first). The dayfirst parameter is set to True.
pd.to_datetime('10-9-2020', dayfirst=True)
Timestamp('2020-09-10 00:00:00')pd.to_datetime('10-9-2020')
Timestamp('2020-10-09 00:00:00')
Note: If the first item is greater than 12, Pandas will know that it cannot be a month.
pd.to_datetime('13-9-2020')
Timestamp('2020-09-13 00:00:00')
7. Convert the data format to time series data
The to_datetime function can convert data names with appropriate columns into time series. Consider the following data format:
pd.to_datetime(df)0 2020-04-13
1 2020-05-16
2 2019-04-11
dtype: datetime64[ns]
7. Time indication other than timestamp
In real life, we almost always use continuous time series data instead of individual dates. Moreover, Pandas is very simple to process sequential time series data.
We can pass a list of dates to the to_datetime function.
pd.to_datetime(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'])
DatetimeIndex(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'], dtype='datetime64[ns]', freq=None)
The returned object is a DatetimeIndex.
There are some more practical ways to create a series of time data.
9. Create time series with to_datetime and to_timedelta
You can create DatetimeIndex by adding TimedeltaIndex to the timestamp.
pd.to_datetime('10-9-2020') + pd.to_timedelta(np.arange(5), 'D')
"D" is used to mean "day", but there are many other options.
10. date_range function
It provides a more flexible way to create DatetimeIndex.
pd.date_range(start='2020-01-10', periods=10, freq='M')
The function of the parameter is to specify the number of items in the index. freq is the frequency, and "M" represents the last day of the month.
In terms of freq parameters, date_range is very flexible.
pd.date_range(start='2020-01-10', periods=10, freq='6D')
We created a data with a frequency of 6 days.
11. period_range function
It returns a PeriodIndex. The syntax is similar to the date_range function.
pd.period_range('2018', periods=10, freq='M')
12. timedelta_range function
It returns a TimedeltaIndex.
pd.timedelta_range(start='0', periods=24, freq='H')
13. Time zone
By default, Panda's time series object does not have a specified time zone.
dates = pd.date_range('2019-01-01','2019-01-10')
dates.tz is None
True
We can use the tz_localize method to allocate time zones for these objects.
dates_lcz = dates.tz_localize('Europe/Berlin')
dates_lcz.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>
14. Create a time series with a specified time zone
We can also use the tz keyword parameter to create a time series object with a time zone.
pd.date_range('2020-01-01', periods = 5, freq = 'D', tz='US/Eastern')
15. Offset
Suppose we have a time series index and want to offset all dates by a specific time.
A = pd.date_range('2020-01-01', periods=10, freq='D')
A
Let's add a week of offset to this data.
A + pd.offsets.Week()
16. Moving time series data
Time series data analysis may need to move data points for comparison. The shift function can shift data.
A.shift(10, freq='M')
17. Shift vs tshift
- Mobile: mobile data
- tshift: moving time index
Let's create a dataframe with a time series index and plot it to see the difference between shift and tshift.
dates = pd.date_range('2020-03-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
df = pd.DataFrame({
'values':values}, index=dates)df.head()
Let's draw the original time series and the shifted time series together.
import matplotlib.pyplot as pltfig, axs = plt.subplots(nrows=3, figsize=(10,6), sharey=True)
plt.tight_layout(pad=4)
df.plot(ax=axs[0], legend=None)
df.shift(10).plot(ax=axs[1], legend=None)
df.tshift(10).plot(ax=axs[2], legend=None)
18. Resample with sampling function
Another common operation of time series data is resampling. Depending on the task, we may need to resample the data at a higher or lower frequency.
Resample creates a specified internal group (or container) and allows you to merge the groups.
Let's create a Panda series with 30 values and a time series index.
A = pd.date_range('2020-01-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
S = pd.Series(values, index=A)
The following will return the average over a 3-day period.
S.resample('3D').mean()
In some cases, we may be interested in the value of a particular frequency. The function returns the value at the end of the specified interval. For example, in the series created in the previous step, we may only need a value every 3 days (instead of an average of 3 days).
S.asfreq('3D')
20. Scroll
Scrolling is a very useful operation for time series data. Scrolling means creating a scrolling window with a specified size and performing calculations on the data in the window. Of course, the window will scroll the data. The figure below explains the concept of scrolling.
It is worth noting that the entire window is in the data when the calculation starts. In other words, if the window size is 3, then the first merge will be performed on the third row.
Let's apply a 3-day rolling window to our data.
S.rolling(3).mean()[:10]
in conclusion
We have fully introduced time series analysis with Pandas. It is worth noting that Pandas provides more time series analysis.
Thank you for reading. If you have any feedback, please let me know.
作者 : Soner Yildirim
deephub translation group: Meng Xiangjie