Home Artificial Intelligence Running Local LLMs and VLMs on the Raspberry Pi

Running Local LLMs and VLMs on the Raspberry Pi

Running Local LLMs and VLMs on the Raspberry Pi

Get models like Phi-2, Mistral, and LLaVA running locally on a Raspberry Pi with Ollama

Towards Data Science
Host LLMs and VLMs using Ollama on the Raspberry Pi — Source: Writer

Ever considered running your personal large language models (LLMs) or vision language models (VLMs) on your personal device? You almost certainly did, however the thoughts of setting things up from scratch, having to administer the environment, downloading the appropriate model weights, and the lingering doubt of whether your device may even handle the model has probably given you some pause.

Let’s go one step further than that. Imagine operating your personal LLM or VLM on a tool no larger than a bank card — a Raspberry Pi. Unattainable? In no way. I mean, I’m writing this post in spite of everything, so it definitely is feasible.

Possible, yes. But why would you even do it?

LLMs at the sting seem quite far-fetched at this time limit. But this particular area of interest use case should mature over time, and we will certainly see some cool edge solutions being deployed with an all-local generative AI solution running on-device at the sting.

It’s also about pushing the bounds to see what’s possible. If it will probably be done at this extreme end of the compute scale, then it will probably be done at any level in between a Raspberry Pi and a giant and powerful server GPU.

Traditionally, edge AI has been closely linked with computer vision. Exploring the deployment of LLMs and VLMs at the sting adds an exciting dimension to this field that’s just emerging.

Most significantly, I just desired to do something fun with my recently acquired Raspberry Pi 5.

So, how will we achieve all this on a Raspberry Pi? Using Ollama!

What’s Ollama?

Ollama has emerged as among the finest solutions for running local LLMs on your personal laptop computer without having to cope with the effort of setting things up from scratch. With just a couple of commands, all the things might be arrange with none issues. The whole lot is self-contained and works splendidly in my experience across several devices and models. It even exposes a REST API for model inference, so you possibly can leave it running on the Raspberry Pi and call it out of your other applications and devices if you wish to.

Ollama’s Website

There’s also Ollama Web UI which is a lovely piece of AI UI/UX that runs seamlessly with Ollama for those apprehensive about command-line interfaces. It’s mainly a neighborhood ChatGPT interface, should you will.

Together, these two pieces of open-source software provide what I feel is the perfect locally hosted LLM experience immediately.

Each Ollama and Ollama Web UI support VLMs like LLaVA too, which opens up much more doors for this edge Generative AI use case.

Technical Requirements

All you would like is the next:

  • Raspberry Pi 5 (or 4 for a less speedy setup) — Go for the 8GB RAM variant to suit the 7B models.
  • SD Card — Minimally 16GB, the larger the dimensions the more models you possibly can fit. Have it already loaded with an appropriate OS akin to Raspbian Bookworm or Ubuntu
  • An online connection

Like I discussed earlier, running Ollama on a Raspberry Pi is already near the acute end of the hardware spectrum. Essentially, any device more powerful than a Raspberry Pi, provided it runs a Linux distribution and has the same memory capability, should theoretically be able to running Ollama and the models discussed on this post.

1. Installing Ollama

To put in Ollama on a Raspberry Pi, we’ll avoid using Docker to conserve resources.

Within the terminal, run

curl https://ollama.ai/install.sh | sh

It is best to see something much like the image below after running the command above.

Source: Writer

Just like the output says, go to to confirm that Ollama is running. It’s normal to see the ‘WARNING: No NVIDIA GPU detected. Ollama will run in CPU-only mode.’ since we’re using a Raspberry Pi. But should you’re following these instructions on something that’s purported to have a NVIDIA GPU, something didn’t go right.

For any issues or updates, confer with the Ollama GitHub repository.

2. Running LLMs through the command line

Take a take a look at the official Ollama model library for a listing of models that might be run using Ollama. On an 8GB Raspberry Pi, models larger than 7B won’t fit. Let’s use Phi-2, a 2.7B LLM from Microsoft, now under MIT license.

We’ll use the default Phi-2 model, but be at liberty to make use of any of the opposite tags found here. Take a take a look at the model page for Phi-2 to see how you possibly can interact with it.

Within the terminal, run

ollama run phi

When you see something much like the output below, you have already got a LLM running on the Raspberry Pi! It’s that easy.

Source: Writer
Here’s an interaction with Phi-2 2.7B. Obviously, you won’t get the identical output, but you get the thought. | Source: Writer

You’ll be able to try other models like Mistral, Llama-2, etc, just be sure there may be enough space on the SD card for the model weights.

Naturally, the larger the model, the slower the output can be. On Phi-2 2.7B, I can get around 4 tokens per second. But with a Mistral 7B, the generation speed goes right down to around 2 tokens per second. A token is roughly akin to a single word.

Here’s an interaction with Mistral 7B | Source: Writer

Now we’ve LLMs running on the Raspberry Pi, but we should not done yet. The terminal isn’t for everybody. Let’s get Ollama Web UI running as well!

3. Installing and Running Ollama Web UI

We will follow the instructions on the official Ollama Web UI GitHub Repository to put in it without Docker. It recommends minimally Node.js to be >= 20.10 so we will follow that. It also recommends Python to be not less than 3.11, but Raspbian OS already has that installed for us.

We now have to put in Node.js first. Within the terminal, run

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - &&
sudo apt-get install -y nodejs

Change the 20.x to a more appropriate version if need be for future readers.

Then run the code block below.

git clone https://github.com/ollama-webui/ollama-webui.git
cd ollama-webui/

# Copying required .env file
cp -RPp example.env .env

# Constructing Frontend Using Node
npm i
npm run construct

# Serving Frontend with the Backend
cd ./backend
pip install -r requirements.txt --break-system-packages
sh start.sh

It’s a slight modification of what’s provided on GitHub. Do take note that for simplicity and brevity we should not following best practices like using virtual environments and we’re using the — break-system-packages flag. If you happen to encounter an error like uvicorn not being found, restart the terminal session.

If all goes appropriately, it’s best to have the opportunity to access Ollama Web UI on port 8080 through on the Raspberry Pi, or through http://:8080/ should you are accessing through one other device on the identical network.

If you happen to see this, yes, it worked | Source: Writer

When you’ve created an account and logged in, it’s best to see something much like the image below.

Source: Writer

If you happen to had downloaded some model weights earlier, it’s best to see them within the dropdown menu like below. If not, you possibly can go to the settings to download a model.

Available models will appear here | Source: Writer
If you wish to download recent models, go to Settings > Models to tug models | Source: Writer

Your entire interface may be very clean and intuitive, so I won’t explain much about it. It’s truly a really well-done open-source project.

Here’s an interaction with Mistral 7B through Ollama Web UI | Source: Writer

4. Running VLMs through Ollama Web UI

Like I discussed at first of this text, we may also run VLMs. Let’s run LLaVA, a well-liked open source VLM which also happens to be supported by Ollama. To achieve this, download the weights by pulling ‘llava’ through the interface.

Unfortunately, unlike LLMs, it takes quite a while for the setup to interpret the image on the Raspberry Pi. The instance below took around 6 minutes to be processed. The majority of the time might be since the image side of things just isn’t properly optimised yet, but it will definitely change in the long run. The token generation speed is around 2 tokens/second.

Query Image Source: Pexels

To wrap all of it up

At this point we’re just about done with the goals of this text. To recap, we’ve managed to make use of Ollama and Ollama Web UI to run LLMs and VLMs like Phi-2, Mistral, and LLaVA on the Raspberry Pi.

I can definitely imagine quite a couple of use cases for locally hosted LLMs running on the Raspberry Pi (or one other other small edge device), especially since 4 tokens/second does seem to be an appropriate speed with streaming for some use cases if we’re going for models around the dimensions of Phi-2.

The sphere of ‘small’ LLMs and VLMs, somewhat paradoxically named given their ‘large’ designation, is an energetic area of research with quite a couple of model releases recently. Hopefully this emerging trend continues, and more efficient and compact models proceed to get released! Definitely something to keep watch over in the approaching months.

Disclaimer: I actually have no affiliation with Ollama or Ollama Web UI. All views and opinions are my very own and don’t represent any organisation.


Please enter your comment!
Please enter your name here