Run a Locally Hosted AI

Quick Start Guide

If you are reading this guide you probably already know the benefits of running a locally hosted AI. I want to keep this concise, so I will not cover that topic here. This guide will focus on running Open WebUI with bundled Ollama support. We will do this via Docker running on an Ubuntu 24.04 server. Depending on your setup, some steps may vary. This guide generally follows this one: https://docs.openwebui.com/getting-started/

I am using Portainer to manage my containers, you could use Docker Compose but I recommend checking out Portainer if you have not before. Before starting, you should have Docker and Portainer installed and running.

Docker NVIDIA Toolkit

If you plan to use a GPU you will need to do a few extra steps. First, ensure you have the latest NVIDIA drivers installed.

Install the Docker NVIDIA Toolkit by first adding the applicable sources:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Then update apt

sudo apt update

Then install the package

sudo apt install -y nvidia-container-toolkit

Then add the nvidia runtime for docker:

sudo nvidia-ctk runtime configure --runtime=docker

And finally restart docker

sudo systemctl restart docker

Open WebUI Container

Now open up Portainer and create a stack file called open-webui, then paste the following in the web editor for GPU support:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:ollama
    container_name: open-webui
    restart: always
    ports:
      - "3000:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama:/root/.ollama
      - open-webui:/app/backend/data
    runtime: nvidia
 
volumes:
  ollama:
  open-webui:

Or for CPU only:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:ollama
    container_name: open-webui
    restart: always
    ports:
      - "3000:8080"
    volumes:
      - ollama:/root/.ollama
      - open-webui:/app/backend/data
 
volumes:
  ollama:
  open-webui:

The initial deployment can take a while, the image is really large, so give it some time before getting too worried. If you do not see an error pop up it is probably fine. Once it is deployed you can either access it at http://localhost:3000 or if it is running on another server, at http://SERVER_IP:3000.

Getting Started with Open WebUI

You will be prompted to create an account, this creates an account locally, so no worries about your data being sent externally. After that, it is time to load up a model! Head over to https://ollama.com/library and click the link for llama3.2 (or whatever is the latest model while you are doing this). On the left, select if you want the 1b or 3b model, then look for the name of the model (underlined in red) on the right. Copy it.

Ollama library page showing the llama3.2 model with 1B and 3B parameter tags, 128K context length, and the command 'ollama run llama3.2' to run the model — Ollama Library Page - Llama 3.2 Model

Now go back to Open WebUI, at the top right click your avatar, then click "Admin Panel". Then click the "Settings" tab.

Open WebUI Admin Panel showing the Settings page with General Settings section, including sidebar navigation with options for Models, Connections, Documents, Web Search, Interface, Audio, Images, and Pipelines — Open WebUI Admin Panel Settings

Select "Models". Then in the input for entering the model tag, add the llama model:

Open WebUI input field showing 'Pull a model from Ollama.com' with 'llama3.2:3b' entered as the model name to download — Open WebUI Pull Model Input Field

Then on the far right click the download button to pull the model in.

Open WebUI showing download progress bar at 52.1% while pulling the llama3.2:3b model, with file size indicator showing progress — Llama 3.2 Model Download Progress

Now click "New Chat" at the top left, and you should be able to select a model now.

Open WebUI 'Select a model' dropdown menu showing llama3.2:3b (3.2B parameters) as an available model option — Open WebUI Model Selector Dropdown

Select the llama model and ask it a question. It should respond (it could take some time depending on your system. If it is taking too long, try the 1b model).

Open WebUI chat interface showing a 'Hello' message sent to the llama3.2:3b model, which responds with 'How can I assist you today?' confirming the local AI is working — Open WebUI Chat Response from Llama 3.2

If you want to see how much the model is using your GPUs you can install nvtop:

sudo apt install nvtop

And then run

nvtop

Now ask it a more complex question to see the GPUs get some load.

nvtop terminal application showing GPU usage graph with a spike in utilization during AI model inference, displaying real-time GPU metrics — nvtop GPU Monitoring During AI Inference

I am running two 4090s so with a 3b model the responses are almost instant. And that's it! Go ahead and try some more models, you can download a bunch and easily switch between them (or even run 2 at the same time). There is a lot more you can do with Open WebUI so take a look at the docs. Some interesting things I plan to cover are how to integrate SearXNG and image generation.