How to utilise the date_parser parameter of pandas.read_csv()

G Gr :

I am getting an issue with the timestamp column in my csv file.

ValueError: could not convert string to float: '2020-02-21 22:00:00'

for this line:

    import numpy as np
import pandas as pd
import matplotlib.pylab as plt 
from datetime import datetime
from statsmodels.tools.eval_measures import rmse
from sklearn.preprocessing import MinMaxScaler
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
import warnings
warnings.filterwarnings("ignore")

"Import dataset"
df = pd.read_csv('fx_intraday_1min_GBP_USD.csv')


train, test = df[:-3], df[-3:]
scaler = MinMaxScaler()
scaler.fit(train) <----------- This line
train = scaler.transform(train)
test = scaler.transform(test)

n_input = 3
n_features = 4

generator = TimeseriesGenerator(train, train, length=n_input, batch_size=6)

model = Sequential()
model.add(LSTM(200, activation='relu', input_shape=(n_input, n_features)))
model.add(Dropout(0.15))
model.add(Dense(1))
model.compile(optimizers='adam', loss='mse')
model.fit_generator(generator, epochs=180)

How can I convert the timestamp column (preferably when reading the csv) to a float?

enter image description here

Link to the dataset: https://www.alphavantage.co/query?function=FX_INTRADAY&from_symbol=GBP&to_symbol=USD&interval=1min&apikey=OF7SE183CNQLT9DW&datatype=csv

Todd :

Performing Conversion On CSV Input Columns While Reading In The Data

So it turns out that you might not have wanted to use the date_parser parameter after all. The converters parameter is more along the lines of what we need. If we specify a conversion function for the 'timestamp' column like so:

>>> df = pd.read_csv('~/Downloads/fx_intraday_1min_GBP_USD.csv', 
...                  converters={'timestamp': 
...                                 lambda t: pd.Timestamp(t).timestamp()})
>>> df
       timestamp    open    high     low   close
0   1.582322e+09  1.2953  1.2964  1.2953  1.2964
1   1.582322e+09  1.2955  1.2957  1.2952  1.2957
2   1.582322e+09  1.2956  1.2958  1.2954  1.2957
3   1.582322e+09  1.2957  1.2958  1.2954  1.2957
4   1.582322e+09  1.2957  1.2958  1.2955  1.2956
..           ...     ...     ...     ...     ...
95  1.582317e+09  1.2966  1.2967  1.2964  1.2965
96  1.582317e+09  1.2967  1.2968  1.2965  1.2966
97  1.582317e+09  1.2965  1.2967  1.2964  1.2966
98  1.582317e+09  1.2964  1.2967  1.2962  1.2966
99  1.582316e+09  1.2963  1.2965  1.2961  1.2964

[100 rows x 5 columns]

Then the timestamp column looks like it successfully converted ^^ to float values per your requirement. The way the converters parameter works is you set it to a dictionary with the column name as the key, and the callback as the value. You could also use the column number as the key - but it's clearer to use the name.

This strategy can be applied to other columns by providing callback functions to do any sort of conversion compatible with pandas. It's not limited to just this datetime to float case.

[side note: You may want to confirm that the machine learning package you're using expects these float values to be POSIX timestamps.]

Using the date_parser parameter seems be to be only understood by read_csv() as a way to have control over parsing the text to create datetime objects. Generally trying to use that to create a column of floats produced some strange results.

date_parser could be useful if the timestamp data spans more than one column or is in some strange format. The callback can receive the text from one or more columns for processing. The parse_dates parameter may need to be supplied with date_parser to indicate which columns to apply the callback to. date_parser is just a list of the column names or indices. An example of usage:

df = pd.read_csv('~/Downloads/fx_intraday_1min_GBP_USD.csv', 
                 date_parser=lambda t: pd.Timestamp(t), 
                 parse_dates=['timestamp'])

pd.read_csv() with no date/time parameters produces a timestamp column of type object. Simply specifying which column is the timestamp using parse_dates and no other additional parameters fixes that:

>>> df = pd.read_csv('~/Downloads/fx_intraday_1min_GBP_USD.csv', 
                     parse_dates=['timestamp'])
>>> df.dtypes
timestamp    datetime64[ns]
open                float64
high                float64
low                 float64
close               float64

The date_parser parameter was unnecessary in this case. I'm thinking this last example was all that your script may have needed.

Pandas provides some of its own date/time classes and functions, here's an example of pd.Timestamp and converting it to a numpy compatible timestamp:

>>> pd.Timestamp('2020-02-21 22:00:00')
Timestamp('2020-02-21 22:00:00')
>>> pd.Timestamp('2020-02-21 22:00:00').asm8
numpy.datetime64('2020-02-21T22:00:00.000000000')

Conversion of DataFrame Columns After Reading in CSV

As another user suggested, there's another way to convert the contents of a column using pd.to_datetime(). Here's an example:

>>> df = pd.read_csv('~/Downloads/fx_intraday_1min_GBP_USD.csv')
>>> df.dtypes
timestamp     object
open         float64
high         float64
low          float64
close        float64
dtype: object
>>> df['timestamp'] = pd.to_datetime(df['timestamp'])
>>> df.dtypes
timestamp    datetime64[ns]
open                float64
high                float64
low                 float64
close               float64
dtype: object
>>> 
>>> df['timestamp'] = df['timestamp'].apply(lambda t: t.timestamp())
>>> df
       timestamp    open    high     low   close
0   1.582322e+09  1.2953  1.2964  1.2953  1.2964
1   1.582322e+09  1.2955  1.2957  1.2952  1.2957
2   1.582322e+09  1.2956  1.2958  1.2954  1.2957
3   1.582322e+09  1.2957  1.2958  1.2954  1.2957
4   1.582322e+09  1.2957  1.2958  1.2955  1.2956
..           ...     ...     ...     ...     ...
95  1.582317e+09  1.2966  1.2967  1.2964  1.2965
96  1.582317e+09  1.2967  1.2968  1.2965  1.2966
97  1.582317e+09  1.2965  1.2967  1.2964  1.2966
98  1.582317e+09  1.2964  1.2967  1.2962  1.2966
99  1.582316e+09  1.2963  1.2965  1.2961  1.2964

[100 rows x 5 columns]

Or to do it all in one shot without pd.to_datetime(), it can be implemented as the following - this last method uses the same lambda as the first example at the top of this Answer:

>>> df = pd.read_csv('~/Downloads/fx_intraday_1min_GBP_USD.csv')
>>>
>>> df['timestamp'] = df['timestamp']
...                      .apply(lambda t: pd.Timestamp(t).timestamp())
>>>

This last method is very versatile using the df.apply() method. The function that's provided as a parameter is performed on each member of the column. Then this new column of values can be assigned back to the same, or another data frame column, or appended to the data frame.

So... There it is. Two examples on how to convert timestamp strings in a pandas column to float values. I've learned a few things trying to answer this question. Thank you for that @GGr.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=3910&siteId=1