Time Series Prediction with LSTMs using TensorFlow 2 and Keras in Python

TL;DR

Time series forecasting requires sequence-aware modeling because time points influence each other over time, especially through seasonality and recurring cycles.

Briefing Cornell Notes

Briefing

Time series forecasting with LSTMs hinges on treating past observations as a sequence, not as independent data points—and the practical payoff is a working pipeline that predicts future bike-share demand from historical hourly patterns. The core idea is that time series data is typically recorded at regular intervals (often hourly or daily) and exhibits dependencies over time, including stationarity-related behavior (stable mean/variance), trends, and especially seasonality (repeating cycles). For bike-share demand, those cycles show up clearly: monthly totals rise in summer, and hourly demand spikes in the morning commute window and again in the evening.

To model those temporal dependencies, the workflow builds on recurrent neural networks, with Long Short-Term Memory (LSTM) networks singled out as a practical choice for sequence learning. LSTMs are presented as a response to training difficulties common in vanilla recurrent networks—particularly vanishing/exploding gradients—handled through gated memory that can retain relevant history while discarding noise. Training uses backpropagation through time (implemented via an unrolled recurrent structure), enabling the network to learn how earlier time steps influence the next value.

The demonstration uses the Bike Sharing dataset (sourced from Chris to motive) and runs in Python with TensorFlow 2 and Keras inside a Google Colab notebook. After installing dependencies and enabling GPU runtime, the dataset is loaded into a Pandas DataFrame with timestamps parsed and set as the index. Feature engineering then adds time-derived predictors: hour of day, day of week, day of month, and month. The target variable is the bike-share count for each one-hour interval, while additional inputs include weather-related numeric features (e.g., temperature, humidity, wind speed) and categorical/encoded signals such as weather condition codes, holiday flags, and season labels.

Before training, the data is split chronologically: 90% for training and 10% for testing, preserving temporal order (no shuffling). Scaling is handled carefully with RobustScaler from scikit-learn—fitted on the training set only—to improve learning stability. Separate scaling is applied to feature columns (temperatures, humidity, wind speed) and to the target count, which later enables an inverse transform so predictions can be interpreted in real bike-share units.

The sequence preparation step converts the time series into supervised learning examples. A custom create_dataset function slices the data into rolling windows: for each sample, it uses the previous 24 hours of features to predict the next hour’s bike-share count. With sequences shaped as (samples, time_steps, features), a bidirectional LSTM model is built in Keras: a Bidirectional wrapper around an LSTM layer, followed by Dropout for regularization and a Dense output neuron for regression. The model is compiled with the Adam optimizer and mean squared error loss.

Training runs for 30 epochs with a 10% validation split and no shuffling. Validation loss lands around 0.0231, and the learning curves suggest the model reaches a good fit within roughly 10–15 epochs. Predictions on the test set are inverse-scaled back to counts and plotted against true values. The model tracks typical demand levels closely, though it underestimates or misses some extreme peaks—an expected limitation for a relatively simple architecture and feature set. The result is a clear, end-to-end template for LSTM-based time series forecasting that can be extended with richer preprocessing (e.g., better encoding for categorical variables) or more advanced modeling choices.

Cornell Notes

The pipeline treats time series forecasting as sequence learning: past observations must be fed to an LSTM as ordered windows, because time points are not independent. Using the Bike Sharing dataset, the workflow engineers time features (hour, day of week, day of month, month) and uses weather and calendar signals as inputs, while the target is the hourly bike-share count. Data is split chronologically (90% train, 10% test), scaled with RobustScaler fitted only on training data, and converted into supervised samples using rolling windows of 24 hours to predict the next hour. A bidirectional LSTM with dropout is trained in Keras using Adam and mean squared error, achieving a validation loss around 0.0231. Predictions are inverse-transformed back to real counts and compared to true values, with good performance on typical ranges and weaker handling of extremes.

Why does time series forecasting require sequence models instead of treating rows as independent samples?

Time series data points are recorded over time and exhibit dependencies—demand today is influenced by earlier demand patterns. The transcript highlights that the “independent data points” assumption common in other ML settings is “blatantly false” for time series. It also points to key properties like stationarity (roughly constant mean/variance), seasonality (repeating cycles), and trends. Those repeated cycles and temporal dependencies are exactly what LSTMs are designed to learn by processing ordered sequences.

What stationarity and seasonality mean in practical forecasting terms?

Stationarity is described as having a roughly constant mean and variance over time. Seasonality is described as repeated cycles over a fixed interval—e.g., the pattern repeats every few steps. In the bike-share example, monthly totals show a seasonal component (higher demand in summer), and hourly plots show daily commute-like spikes. These patterns motivate using a model that can learn periodic structure from history.

How does the notebook prepare data for an LSTM to predict the next hour?

A create_dataset function slices the scaled time series into rolling windows. For each training example, it takes time_steps=24 previous hours of features (shape: samples × 24 × 13 features) and pairs them with the label y at the next time step (the next hour’s count). This converts raw hourly rows into the sequence format Keras LSTM layers expect.

Why is scaling done with RobustScaler, and why is the target scaled separately?

The transcript notes that scaling can make learning faster and more accurate, and it uses RobustScaler to reduce sensitivity to outliers. It also scales the target count separately so the inverse transform can later convert predictions back into real bike-share counts. That inverse transform is necessary for meaningful plots comparing predicted counts to true counts.

What does the bidirectional LSTM add compared with a standard LSTM in this setup?

The model uses a Bidirectional wrapper around an LSTM layer, meaning it processes the input sequence in both forward and backward directions. The transcript frames this as looking at both “history and future” within the provided window. Even though the task is next-step prediction, bidirectionality can help the model extract stronger signals from the 24-hour context window.

What performance behavior should be expected when the model struggles with extremes?

When predictions are plotted against true test values, the model tracks typical demand levels well but misses some extreme peaks. The transcript attributes this to the difficulty of capturing outliers with a simple model and limited feature encoding (it mentions not doing one-hot encoding for categorical variables). This suggests that handling extremes may require richer features, more data, or more advanced architectures.

Review Questions

In what exact way does the create_dataset function define the relationship between X and y (which time step is predicted)?
Why must the train/test split preserve chronological order, and what goes wrong if data is shuffled?
How does inverse scaling of the target count enable an apples-to-apples comparison between predictions and ground truth?

Key Points

1
Time series forecasting requires sequence-aware modeling because time points influence each other over time, especially through seasonality and recurring cycles.
2
Stationarity (stable mean/variance) and seasonality (repeating patterns) are practical properties that guide feature design and model choice.
3
For LSTMs, convert hourly rows into rolling windows: use the previous 24 hours of features to predict the next hour’s bike-share count.
4
Scale features and the target using RobustScaler fitted only on the training set to avoid leakage and to stabilize training.
5
Use a chronological split (90% train, 10% test) and avoid shuffling during training to respect temporal dependence.
6
A bidirectional LSTM with dropout and a single regression output neuron can produce strong baseline forecasts, though extremes may remain challenging.
7
Inverse-transform predictions back to real count units so evaluation plots reflect actual bike-share demand rather than scaled values.

Highlights

Seasonality shows up in bike-share demand: monthly totals rise in summer, and daily demand spikes around morning and evening commute hours.

The dataset is transformed into supervised sequences where each sample uses 24 prior hours (time_steps=24) to predict the next hour’s count.

A bidirectional LSTM in Keras (with Dropout) trains quickly on GPU and reaches validation loss around 0.0231.

Predictions match typical demand levels closely but underperform on the most extreme peaks—an indicator that outliers and categorical handling may need improvement.

Topics

Time Series Forecasting
LSTM
Bidirectional LSTM
Feature Engineering
Data Preprocessing

Mentioned

LSTM
GPU
RNN
GRU
MSE
Adam
Keras
TensorFlow
CSV