
A toy example of bulk inference on commodity hardware using Python, via llama.cpp and PySpark.

This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarize many documents without delay. The scalable summarization of unstructured, semi-structured, and structured text can exist as a feature by itself, and likewise be part of knowledge pipelines that feed into downstream machine learning models.
Specifically, we would like to prove the simultaneous feasibility of:
- Running Llama 2 on CPUs (i.e., removing GPU capability constraints)
- Smooth integration of an LLM with Apache Spark (a key a part of Big Data ecosystems)
- No usage of third-party endpoints (i.e., models must run locally resulting from air-gapped infrastructure or confidentiality requirements)
Lots of the labor has already been done for us!
The llama.cpp project enables running simplified LLMs on CPUs by reducing the resolution (“quantization”) of their numeric weights. These ready-to-use model files are easily available.
Next, the llama-cpp-python bindings provide easy access to using llama.cpp from inside Python.
Finally, Spark’s applyInPandas()
(docs) enables splitting giant data sources into Pandas-sized chunks and processing them independently. Note that this approach can be an anti-pattern if vectorized Spark functions can accomplish the identical result, but in our case, we’re mainly using Spark as a straightforward orchestrator to scale out our llama.cpp usage. There’s likely more efficient ways to make use of llama.cpp in batch processing, but this one is attractive given the simplicity and automatic advantages of Spark’s fault tolerance and scalability.
As a fun test, we’ll be using Llama 2 to summarize Leo Tolstoy’s War and Peace, a 1200+ page novel with over 360 chapters. We’ll treat each chapter as a document. Note that Llama 2 already “knows” in regards to the novel; asking it a few key character generates this output (using llama-2–7b-chat.ggmlv3.q8_0.bin
):
Prompt:
Briefly describe the character Anna Pavlovna from 'War and Peace'Response:
Anna Pavlovna is a serious character in Leo Tolstoy's novel "War and Peace".
She is a wealthy, intelligent, and charming socialite who is thought for her beauty and wit.
Despite her beauty and charm, she can be portrayed as being manipulative and cunning, often using her intelligence and wit to get what she wants.
Throughout the novel, Anna Pavlovna is shown to be a fancy and multifaceted character, with each admirable and flawed qualities.
She is especially near the important character Pierre Bezukhov, and plays a very important role in his personal and emotional development.
Steps:
- Install the 7B quantized chat model and llama-cpp-python.
- Download the novel, split by chapter, create a Spark
DataFrame
. - Partition by chapter and generate summaries.
Configuring a Spark cluster is outside our scope; I’ll assume you could have Spark running locally, through a managed service (like Synapse or Elastic Map Reduce), or a custom deployment like Kubernetes.
There are two artifacts that need installed on all employee nodes, whether those nodes are physical machines, VMs, or pods in a serverless pool:
- LLama 2 model in GGML format (situated in
/models
) - The llama-cpp-python module (installed via
pip
)
We’re using the 7B chat “Q8” version of Llama 2, found here. The download links might change, but a single-node, “bare metal” setup is comparable to below:
Ensure you should utilize the model via python3
and this instance. To recap, every Spark context must have the ability to read the model from /models
and access the llama-cpp-python module.
The Bash commands below download the novel and print word counts.
Next, we read the text file in Python, removing the Project Gutenberg header and footer. We’ll split on the CHAPTER .+
to create a listing of chapter strings and create a Spark DataFrame
from them (this code assumes a SparkSession
named spark
).
The code should produce the next output:
variety of chapters = 365
max words per chapter = 3636+------------------------------------------------------------+-------+
| text|chapter|
+------------------------------------------------------------+-------+
|nn“Well, Prince, so Genoa and Lucca at the moment are just family...| 1|
|nnAnna Pávlovna’s drawing room was progressively filling. T...| 2|
|nnAnna Pávlovna’s reception was in full swing. The spin...| 3|
|nnJust then one other visitor entered the drawing room: P...| 4|
|nn“And what do you're thinking that of this latest comedy, the cor...| 5|
|nnHaving thanked Anna Pávlovna for her charming soiree,...| 6|
|nnThe rustle of a lady’s dress was heard in the following r...| 7|
|nnThe friends were silent. Neither cared to start talki...| 8|
|nnIt was past one o’clock when Pierre left his friend. ...| 9|
|nnPrince Vasíli kept the promise he had given to Prince...| 10|
+------------------------------------------------------------+-------+
Great! Now now we have a DataFrame
with 365 rows, each containing the total chapter text and number. The ultimate step is making a recent DataFrame
with summaries of every chapter.
Below is the Python code for generating a single chapter summary (see the decision to limit(1)
to return a single row). Explanation below the snippet:
The llama2_summarize()
function is the code that’s applied per-group by Spark. Since we’re grouping by the chapter
column, the function is known as on each chapter row; the df
argument is just a Pandas DataFrame
with a single row. Note that we’re reading the model for every call of llama2_summarize()
; it is a shortcut we’re taking for simplicity, but not very efficient.
Finally, using Spark we do the groupby()
and call applyInPandas()
, setting the schema to incorporate the chapter summary and number.
The output (reformatted for readability) looks like this:
summary
The chapter is a few conversation between Prince Vasíli Kurágin and
Anna Pávlovna Schérer, a widely known socialite and favorite
of Empress Márya Fëdorovna.
They're discussing various political matters, including the likelihood
of war with France and Austria's role within the conflict.
Prince Vasíli is hoping to secure a post for his son through
the Dowager Empress, while Anna Pávlovna is enthusiastic
about Russia's potential to avoid wasting Europe from Napoleon's tyranny.
The conversation also touches on personal matters,
corresponding to Prince Vasíli's dissatisfaction along with his younger son
and Anna Pávlovna's suggestion that he marry off
his profligate son Anatole to a wealthy heiress.chapter
1
(Note the usage of Napoleon despite the very fact it doesn’t occur within the chapter! Again, it is a fun exercise relatively than a sensible example using truly unseen documents.)
The runtime for this single chapter test is about 2 minutes on a 64-core VM. There are many selections we glossed over that affect runtime, corresponding to model size/quantization and model parameters. The important thing result’s that by scaling out our Spark cluster appropriately, we will summarize all chapters in a handful of minutes. Processing a whole lot of hundreds (and even hundreds of thousands!) of documents day by day is thus possible using large Spark clusters comprised of low-cost virtual machines.
We haven’t even mentioned adjusting the usual LLM parameters like temperature
and top_p
which control the “creativity” and randomness of results, or prompt engineering, which is practically a discipline of its own. We also selected the Llama 2 7B model without justification; there is perhaps smaller and more performant models or model families more suited to our particular use case.
As an alternative, we’ve shown the way to easily distribute (quantized) LLM workloads using Spark with fairly minimal effort. Next steps might include:
- More efficient load/caching of models
- Parameter optimization for various use cases
- Custom prompts