## Featurizing time series data into a normal tabular format for classical ML models and improving accuracy using AutoML

This text delves into enhancing the means of forecasting day by day energy consumption levels by transforming a time series dataset right into a tabular format using open-source libraries. We explore the applying of a preferred multiclass classification model and leverage AutoML with Cleanlab Studio to significantly boost our out-of-sample accuracy.

The important thing takeaway from this text is that we are able to utilize more general methods to model a time series dataset by converting it to a tabular structure, and even find improvements in attempting to predict this time series data.

At a high level we are going to:

- Establish a baseline accuracy by fitting a Prophet forecasting model on our time series data
- Convert our time series data right into a tabular format by utilizing open-source featurization libraries after which will show that may outperform our Prophet model with a normal multiclass classification (Gradient Boosting) approach by a
**67% reduction in prediction error**(increase by 38% raw percentage points in out-of-sample accuracy). - Use an AutoML solution for multiclass classification
**resulted in a 42% reduction in prediction error**(increase by 8% in raw percentage points in out-of-sample accuracy) in comparison with our Gradient Boosting model and**resulted in a 81% reduction in prediction error**(increase by 46% in raw percentage points in out-of-sample accuracy) in comparison with our Prophet forecasting model.

To run the code demonstrated in this text, here’s the total notebook.

You possibly can download the dataset here.

The information represents PJM hourly energy consumption (in megawatts) on an hourly basis. PJM Interconnection LLC (PJM) is a regional transmission organization (RTO) in the US. It is an element of the Eastern Interconnection grid operating an electrical transmission system serving many states.

Let’s take a take a look at our dataset. The information includes one datetime column (`object`

type), and the Megawatt Energy Consumption (`float64`

) type) column we try to forecast as a discrete variable (corresponding to the quartile of hourly energy consumption levels). Our aim is to coach a time series forecasting model to have the option to forecast the tomorrow’s day by day energy consumption level falling into 1 of 4 levels: `low`

, `below average`

, `above average`

or `high`

(these levels were determined based on quartiles of the general day by day consumption distribution). We first exhibit the right way to apply time-series forecasting methods like Prophet to this problem, but these are restricted to certain varieties of ML models suitable for time-series data. Next we exhibit the right way to reframe this problem into a normal multiclass classification problem that we are able to apply any machine learning model to, and show how we are able to obtain superior forecasts by utilizing powerful supervised ML.

We first convert this data right into a average energy consumption at a day by day level and rename the columns to the format that the Prophet forecasting model expects. These real-valued day by day energy consumption levels are converted into quartiles, which is the worth we try to predict. Our training data is shown below together with the quartile each day by day energy consumption level falls into. The quartiles are computed using training data to stop data leakage.

We then show the test data below, which is the information we’re evaluating our forecasting results against.

We then show the test data below, which is the information we’re evaluating our forecasting results against.

As seen in the pictures above, we are going to use a date cutoff of `2015-04-09`

to finish the range of our training data and begin our test data at `2015-04-10`

. We compute quartile thresholds of our day by day energy consumption using ONLY training data. This avoids data leakage – using out-of-sample data that is out there only in the longer term.

Next, we are going to forecast the day by day PJME energy consumption level (in MW) at some stage in our test data and represent the forecasted values as a discrete variable. This variable represents which quartile the day by day energy consumption level falls into, represented categorically as 1 (`low`

), 2 (`below average`

), 3 (`above average`

), or 4 (`high`

). For evaluation, we’re going to use the `accuracy_score`

function from `scikit-learn`

to guage the performance of our models. Since we’re formulating the issue this fashion, we’re in a position to evaluate our model’s next-day forecasts (and compare future models) using classification accuracy.

`import numpy as np`

from prophet import Prophet

from sklearn.metrics import accuracy_score# Initialize model and train it on training data

model = Prophet()

model.fit(train_df)

# Create a dataframe for future predictions covering the test period

future = model.make_future_dataframe(periods=len(test_df), freq='D')

forecast = model.predict(future)

# Categorize forecasted day by day values into quartiles based on the thresholds

forecast['quartile'] = pd.cut(forecast['yhat'], bins = [-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])

# Extract the forecasted quartiles for the test period

forecasted_quartiles = forecast.iloc[-len(test_df):]['quartile'].astype(int)

# Categorize actual day by day values within the test set into quartiles

test_df['quartile'] = pd.cut(test_df['y'], bins=[-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])

actual_test_quartiles = test_df['quartile'].astype(int)

# Calculate the evaluation metrics

accuracy = accuracy_score(actual_test_quartiles, forecasted_quartiles)

# Print the evaluation metrics

print(f'Accuracy: {accuracy:.4f}')

>>> 0.4249

The out-of-sample accuracy is kind of poor at 43%. By modelling our time series this fashion, we limit ourselves to only use time series forecasting models (a limited subset of possible ML models). In the subsequent section, we consider how we are able to more flexibly model this data by transforming the time-series into a normal tabular dataset via appropriate featurization. Once the time-series has been transformed into a normal tabular dataset, we’re in a position to employ any supervised ML model for forecasting this day by day energy consumption data.

Now we convert the time series data right into a tabular format and featurize the information using the open source libraries `sktime`

, `tsfresh`

, and `tsfel`

. By employing libraries like these, we are able to extract a wide selection of features that capture underlying patterns and characteristics of the time series data. This includes statistical, temporal, and possibly spectral features, which offer a comprehensive snapshot of the information’s behavior over time. By breaking down time series into individual features, it becomes easier to grasp how different elements of the information influence the goal variable.

`TSFreshFeatureExtractor`

is a feature extraction tool from the `sktime`

library that leverages the capabilities of `tsfresh`

to extract relevant features from time series data. `tsfresh`

is designed to mechanically calculate an unlimited variety of time series characteristics, which may be highly useful for understanding complex temporal dynamics. For our use case, we make use of the minimal and essential set of features from our `TSFreshFeatureExtractor`

to featurize our data.

`tsfel`

, or Time Series Feature Extraction Library, offers a comprehensive suite of tools for extracting features from time series data. We make use of a predefined config that enables for a wealthy set of features (e.g., statistical, temporal, spectral) to be constructed from the energy consumption time series data, capturing a wide selection of characteristics that is likely to be relevant for our classification task.

`import tsfel`

from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor# Define tsfresh feature extractor

tsfresh_trafo = TSFreshFeatureExtractor(default_fc_parameters="minimal")

# Transform the training data using the feature extractor

X_train_transformed = tsfresh_trafo.fit_transform(X_train)

# Transform the test data using the identical feature extractor

X_test_transformed = tsfresh_trafo.transform(X_test)

# Retrieves a pre-defined feature configuration file to extract all available features

cfg = tsfel.get_features_by_domain()

# Function to compute tsfel features per day

def compute_features(group):

# TSFEL expects a DataFrame with the information in columns, so we transpose the input group

features = tsfel.time_series_features_extractor(cfg, group, fs=1, verbose=0)

return features

# Group by the 'day' level of the index and apply the feature computation

train_features_per_day = X_train.groupby(level='Date').apply(compute_features).reset_index(drop=True)

test_features_per_day = X_test.groupby(level='Date').apply(compute_features).reset_index(drop=True)

# Mix each featurization right into a set of combined features for our train/test data

train_combined_df = pd.concat([X_train_transformed, train_features_per_day], axis=1)

test_combined_df = pd.concat([X_test_transformed, test_features_per_day], axis=1)

Next, we clean our dataset by removing features that showed a high correlation (above 0.8) with our goal variable — average day by day energy consumption levels — and people with null correlations. High correlation features can result in overfitting, where the model performs well on training data but poorly on unseen data. Null-correlated features, then again, provide no value as they lack a definable relationship with the goal.

By excluding these features, we aim to enhance model generalizability and make sure that our predictions are based on a balanced and meaningful set of information inputs.

`# Filter out features which are highly correlated with our goal variable`

column_of_interest = "PJME_MW__mean"

train_corr_matrix = train_combined_df.corr()

train_corr_with_interest = train_corr_matrix[column_of_interest]

null_corrs = pd.Series(train_corr_with_interest.isnull())

false_features = null_corrs[null_corrs].index.tolist()columns_to_exclude = list(set(train_corr_with_interest[abs(train_corr_with_interest) > 0.8].index.tolist() + false_features))

columns_to_exclude.remove(column_of_interest)

# Filtered DataFrame excluding columns with high correlation to the column of interest

X_train_transformed = train_combined_df.drop(columns=columns_to_exclude)

X_test_transformed = test_combined_df.drop(columns=columns_to_exclude)

If we take a look at the primary several rows of the training data now, it is a snapshot of what it looks like. We now have 73 features that were added from the time series featurization libraries we used. The label we’re going to predict based on these features is the subsequent day’s energy consumption level.

It’s vital to notice that we used a best practice of applying the featurization process individually for training and test data to avoid data leakage (and the held-out test data are our most up-to-date observations).

Also, we compute our discrete quartile value (using the quartiles we originally defined) using the next code to acquire our train/test energy labels, which is what our y_labels are.

`# Define a function to categorise each value right into a quartile`

def classify_into_quartile(value):

if value < quartiles[0]:

return 1

elif value < quartiles[1]:

return 2

elif value < quartiles[2]:

return 3

else:

return 4 y_train = X_train_transformed["PJME_MW__mean"].rename("daily_energy_level")

X_train_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

y_test = X_test_transformed["PJME_MW__mean"].rename("daily_energy_level")

X_test_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

energy_levels_train = y_train.apply(classify_into_quartile)

energy_levels_test = y_test.apply(classify_into_quartile)

Using our featurized tabular dataset, we are able to apply any supervised ML model to predict future energy consumption levels. Here we’ll use a Gradient Boosting Classifier (GBC) model, the weapon of alternative for many data scientists operating on tabular data.

Our GBC model is instantiated from the `sklearn.ensemble`

module and configured with specific hyperparameters to optimize its performance and avoid overfitting.

`from sklearn.ensemble import GradientBoostingClassifier`gbc = GradientBoostingClassifier(

n_estimators=150,

learning_rate=0.1,

max_depth=4,

min_samples_leaf=20,

max_features='sqrt',

subsample=0.8,

random_state=42

)

gbc.fit(X_train_transformed, energy_levels_train)

y_pred_gbc = gbc.predict(X_test_transformed)

gbc_accuracy = accuracy_score(energy_levels_test, y_pred_gbc)

print(f'Accuracy: {gbc_accuracy:.4f}')

>>> 0.8075

The out-of-sample accuracy of 81% is considerably higher than our prior Prophet model results.

Now that we’ve seen the right way to featurize the time-series problem and the advantages of applying powerful ML models like Gradient Boosting, a natural query emerges: Which supervised ML model should we apply? In fact, we could experiment with many models, tune their hyperparameters, and ensemble them together. A neater solution is to let AutoML handle all of this for us.

Here we’ll use a straightforward AutoML solution provided in Cleanlab Studio, which involves zero configuration. We just provide our tabular dataset, and the platform mechanically trains many varieties of supervised ML models (including Gradient Boosting amongst others), tunes their hyperparameters, and determines which models are best to mix right into a single predictor. Here’s all of the code needed to coach and deploy an AutoML supervised classifier:

from cleanlab_studio import Studiostudio = Studio()

studio.create_project(

dataset_id=energy_forecasting_dataset,

project_name="ENERGY-LEVEL-FORECASTING",

modality="tabular",

task_type="multi-class",

model_type="regular",

label_column="daily_energy_level",

)

model = studio.get_model(energy_forecasting_model)

y_pred_automl = model.predict(test_data, return_pred_proba=True)

Below we are able to see model evaluation estimates within the AutoML platform, showing all of the different sorts of ML models that were mechanically fit and evaluated (including multiple Gradient Boosting models), in addition to an ensemble predictor constructed by optimally combining their predictions.

After running inference on our test data to acquire the next-day energy consumption level predictions, we see the test accuracy is 89%, a 8% raw percentage points improvement in comparison with our previous Gradient Boosting approach.

For our PJM day by day energy consumption data, we found that remodeling the information right into a tabular format and featurizing it achieved a **67% reduction in prediction error** (increase by 38% in raw percentage points in out-of-sample accuracy) in comparison with our baseline accuracy established with our Prophet forecasting model.

We also tried a straightforward AutoML approach for multiclass classification, which **resulted in a 42% reduction in prediction error** (increase by 8% in raw percentage points in out-of-sample accuracy) in comparison with our Gradient Boosting model and **resulted in a 81% reduction in prediction error** (increase by 46% in raw percentage points in out-of-sample accuracy) in comparison with our Prophet forecasting model.

By taking approaches like those illustrated above to model a time series dataset beyond the constrained approach of only considering forecasting methods, we are able to apply more general supervised ML techniques and achieve higher results for certain varieties of forecasting problems.

Unless otherwise noted, all images are by the writer.