Home Artificial Intelligence Large Language Model Performance in Time Series Evaluation Methodology Results Takeaways

Large Language Model Performance in Time Series Evaluation Methodology Results Takeaways

Large Language Model Performance in Time Series Evaluation

Image created by writer using Dall-E 3

How do major LLMs stack up at detecting anomalies or movements in the information when given a big set of time series data throughout the context window?

Towards Data Science

While LLMs clearly excel in natural language processing tasks, their ability to investigate patterns in non-textual data, akin to time series data, stays less explored. As more teams rush to deploy LLM-powered solutions without thoroughly testing their capabilities in basic pattern evaluation, the duty of evaluating the performance of those models on this context takes on elevated importance.

On this research, we set out to analyze the next query: given a big set of time series data throughout the context window, how well can LLMs detect anomalies or movements in the information? In other words, must you trust your money with a stock-picking OpenAI GPT-4 or Anthropic Claude 3 agent? To reply this query, we conducted a series of experiments comparing the performance of LLMs in detecting anomalous time series patterns.

All code needed to breed these results will be present in this GitHub repository.

Figure 1: A rough sketch of the time series data (image by writer)

We tasked GPT-4 and Claude 3 with analyzing changes in data points across time. The info we used represented specific metrics for various world cities over time and was formatted in JSON before input into the models. We introduced random noise, starting from 20–30% of the information range, to simulate real-world scenarios. The LLMs were tasked with detecting these movements above a particular percentage threshold and identifying the town and date where the anomaly was detected. The info was included on this prompt template:

  basic template = ''' You're an AI assistant for a knowledge scientist. You've got been given a time series dataset to investigate.
The dataset accommodates a series of measurements taken at regular intervals over a time frame.
There may be one timeseries for every city within the dataset. Your task is to discover anomalies in the information. The dataset is in the shape of a JSON object, with the date as the important thing and the measurement as the worth.

The dataset is as follows:

Please use the next directions to investigate the information:


Figure 2: The essential prompt template utilized in our tests

Analyzing patterns throughout the context window, detecting anomalies across a big set of time series concurrently, synthesizing the outcomes, and grouping them by date is not any walk in the park for an LLM; we actually desired to push the bounds of those models on this test. Moreover, the models were required to perform mathematical calculations on the time series, a task that language models generally struggle with.

We also evaluated the models’ performance under different conditions, akin to extending the duration of the anomaly, increasing the share of the anomaly, and ranging the variety of anomaly events throughout the dataset. We should always note that in our initial tests, we encountered a problem where synchronizing the anomalies, having all of them occur on the identical date, allowed the LLMs to perform higher by recognizing the pattern based on the date quite than the information movement. When evaluating LLMs, careful test setup is incredibly essential to forestall the models from picking up on unintended patterns that might skew results.

Figure 3: Claude 3 significantly outperforms GPT-4 in time series evaluation (image by writer)

In testing, Claude 3 Opus significantly outperformed GPT-4 in detecting time series anomalies. Given the character of the testing, it’s unlikely that this specific evaluation was included within the training set of Claude 3 — making its strong performance much more impressive.

Results with 50% Spike

Our first set of results is predicated on data where each anomaly was a 50% spike in the information.


Please enter your comment!
Please enter your name here