20 key knowledge points of Pandas processing time series data

​ There are many definitions of time series data, and they have the same meaning in different ways. A simple definition is that time series data includes data points appended to sequential time points.

​ The source of time series data is periodic measurement or observation. Time series data exists in many industries. To give a few examples:

  • Stock price over a period of time
  • Daily, weekly, and monthly sales
  • Periodic measurement in the process
  • Electricity or natural gas consumption rate over a period of time

​ In this article, I will list 20 points to help you fully understand how to process time series data with Pandas.

1. Different forms of time series data

​ Time series data can be in the form of a specific date, duration or fixed custom interval.

​ The timestamp can be one day or one second of the given date, depending on the precision. For example, '2020-01-01 14:59:30' is a timestamp based on seconds.

2. Time series data structure

​ Pandas provides flexible and efficient data structures to process various time series data.

In addition to these three structures, Pandas also supports the concept of date offset, which is a relative time duration related to the calendar algorithm.

3. Create a timestamp

​ The most basic time series data structure is a timestamp, which can be created using to_datetime or Timestamp function

import pandas as pdpd.to_datetime('2020-9-13')
Timestamp('2020-09-13 00:00:00')pd.Timestamp('2020-9-13')
Timestamp('2020-09-13 00:00:00')

4. Access information saved by timestamp

​ We can get information about the day, month and year stored in the timestamp.

a = pd.Timestamp('2020-9-13')a.day_name()
'Sunday'
a.month_name()
'September'
a.day
13
a.month
9
a.year
2020

5. Hidden information access

​ The timestamp object also saves information about the date algorithm. For example, we can ask if this year is a leap year. Here is some more specific information we can get:

b = pd.Timestamp('2020-9-30')b.is_month_end
Trueb.is_leap_year
Trueb.is_quarter_start
Falseb.weekofyear
40

6. European style dates

​ We can use the to_datetime function to handle European-style dates (that is, date first). The dayfirst parameter is set to True.

pd.to_datetime('10-9-2020', dayfirst=True)
Timestamp('2020-09-10 00:00:00')pd.to_datetime('10-9-2020')
Timestamp('2020-10-09 00:00:00')

​ Note: If the first item is greater than 12, Pandas will know that it cannot be a month.

pd.to_datetime('13-9-2020')
Timestamp('2020-09-13 00:00:00')

7. Convert the data format to time series data

The to_datetime function can convert data names with appropriate columns into time series. Consider the following data format:

pd.to_datetime(df)0   2020-04-13 
1   2020-05-16 
2   2019-04-11 
dtype: datetime64[ns]

7. Time indication other than timestamp

​ In real life, we almost always use continuous time series data instead of individual dates. Moreover, Pandas is very simple to process sequential time series data.

​ We can pass a list of dates to the to_datetime function.

pd.to_datetime(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'])
DatetimeIndex(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'], dtype='datetime64[ns]', freq=None)

​ The returned object is a DatetimeIndex.

​ There are some more practical ways to create a series of time data.

9. Create time series with to_datetime and to_timedelta

​ You can create DatetimeIndex by adding TimedeltaIndex to the timestamp.

pd.to_datetime('10-9-2020') + pd.to_timedelta(np.arange(5), 'D')

​ "D" is used to mean "day", but there are many other options.

10. date_range function

​ It provides a more flexible way to create DatetimeIndex.

pd.date_range(start='2020-01-10', periods=10, freq='M')

The function of the ​ parameter is to specify the number of items in the index. freq is the frequency, and "M" represents the last day of the month.

​ In terms of freq parameters, date_range is very flexible.

pd.date_range(start='2020-01-10', periods=10, freq='6D')

​ We created a data with a frequency of 6 days.

11. period_range function

​ It returns a PeriodIndex. The syntax is similar to the date_range function.

pd.period_range('2018', periods=10, freq='M')

12. timedelta_range function

​ It returns a TimedeltaIndex.

pd.timedelta_range(start='0', periods=24, freq='H')

13. Time zone

​ By default, Panda's time series object does not have a specified time zone.

dates = pd.date_range('2019-01-01','2019-01-10')
dates.tz is None
True

​ We can use the tz_localize method to allocate time zones for these objects.

dates_lcz = dates.tz_localize('Europe/Berlin')
dates_lcz.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>

14. Create a time series with a specified time zone

​ We can also use the tz keyword parameter to create a time series object with a time zone.

pd.date_range('2020-01-01', periods = 5, freq = 'D', tz='US/Eastern')

15. Offset

​ Suppose we have a time series index and want to offset all dates by a specific time.

A = pd.date_range('2020-01-01', periods=10, freq='D')
A

​ Let's add a week of offset to this data.

A + pd.offsets.Week()

16. Moving time series data

​ Time series data analysis may need to move data points for comparison. The shift function can shift data.

A.shift(10, freq='M')

17. Shift vs tshift

  • Mobile: mobile data
  • tshift: moving time index

​ Let's create a dataframe with a time series index and plot it to see the difference between shift and tshift.

dates = pd.date_range('2020-03-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
df = pd.DataFrame({
    
    'values':values}, index=dates)df.head()

​ Let's draw the original time series and the shifted time series together.

import matplotlib.pyplot as pltfig, axs = plt.subplots(nrows=3, figsize=(10,6), sharey=True)
plt.tight_layout(pad=4)
df.plot(ax=axs[0], legend=None)
df.shift(10).plot(ax=axs[1], legend=None)
df.tshift(10).plot(ax=axs[2], legend=None)

18. Resample with sampling function

​ Another common operation of time series data is resampling. Depending on the task, we may need to resample the data at a higher or lower frequency.

​ Resample creates a specified internal group (or container) and allows you to merge the groups.

​ Let's create a Panda series with 30 values ​​and a time series index.

A = pd.date_range('2020-01-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
S = pd.Series(values, index=A)

​ The following will return the average over a 3-day period.

S.resample('3D').mean()

​ In some cases, we may be interested in the value of a particular frequency. The function returns the value at the end of the specified interval. For example, in the series created in the previous step, we may only need a value every 3 days (instead of an average of 3 days).

S.asfreq('3D')

20. Scroll

​ Scrolling is a very useful operation for time series data. Scrolling means creating a scrolling window with a specified size and performing calculations on the data in the window. Of course, the window will scroll the data. The figure below explains the concept of scrolling.

​ It is worth noting that the entire window is in the data when the calculation starts. In other words, if the window size is 3, then the first merge will be performed on the third row.

​ Let's apply a 3-day rolling window to our data.

S.rolling(3).mean()[:10]

in conclusion

​ We have fully introduced time series analysis with Pandas. It is worth noting that Pandas provides more time series analysis.

​ Thank you for reading. If you have any feedback, please let me know.

作者 : Soner Yildirim

deephub translation group: Meng Xiangjie

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/108672969