Deploying Llama 3.1 on UpCloud (NVIDIA L4) with llama.cpp

| | |

Implementation Guide — European Infrastructure Edition

Why UpCloud?

Finnish-headquartered, ISO 27001 certified, GDPR-native infrastructure. For European clients with data residency requirements or companies aligning with the Cloud and AI Development Act (CADA) and related EU digital sovereignty initiatives, this is a practical alternative to US hyperscalers. The NVIDIA L4 GPU on UpCloud is also a step up from the AWS T4 – newer architecture (Ada Lovelace vs Turing), 24 GB VRAM vs 15 GB, and meaningfully faster on quantized models.

Hardware Comparison: AWS g4dn.xlarge vs UpCloud NVIDIA L4

AWS g4dn.xlarge

UpCloud NVIDIA L4

GPU

NVIDIA T4 (Turing, 2018)

NVIDIA L4 (Ada Lovelace, 2023)

VRAM

15 GB usable

24 GB

vCPUs

4

8

RAM

16 GB

64 GB

Generation throughput *

34.36 tok/s

50.53 tok/s

Prompt throughput *

1,093 tok/s

3,038.68 tok/s

VRAM used (8B Q4_K_M) *

5.2 GB / 15 GB

4.58 GiB / 22 GiB

Data residency

US-based (configurable)

EU-native (Finland – Helsinki 2)

GDPR / Data Residency

Requires configuration

EU-native by default

Hourly price

~$0.53/hr on-demand

~€0.616/hr

Firewall

Included (Security Groups)

Add-on (~€2/month extra)

Setup time (build + drivers)

~90 minutes

~7 minutes

* Both sets of numbers are real benchmarks — AWS T4 from the [published AWS guide](https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/), UpCloud L4 measured during the writing of this guide.

For EU clients: The L4’s 24 GB VRAM means you can run Llama 3 70B at Q4_K_M without VRAM overflow – something the T4 cannot do.

Cost Reality Check

For a dev/learning exercise (a few hours), the cost is minimal:

Item

Cost

GPU server (1xL4)

€0.616/hr

Firewall (optional add-on)

~€0.003/hr (€2/month prorated)

Storage (100 GB, while running)

Included in server price

Storage (while server is stopped)

Small per-GB charge — check UpCloud pricing

Total for a 3-hour session

~€1.85

Key difference from AWS: UpCloud does not have spot instances. There’s no equivalent of AWS’s 60–70% spot discount. The tradeoff is predictable pricing with no interruptions – better for anything beyond a quick test. When the server is stopped (not deleted), you only pay for the storage, so leave it stopped between sessions and delete when fully done.

Prerequisites

  • UpCloud account with billing enabled (upcloud.com)
  • UpCloud CLI (`upctl`) installed – OR use the UpCloud Control Panel web UI
  • An SSH key added to your UpCloud account
  • A HuggingFace account (free) — needed to download Llama 3:
    • Go to https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and accept the license
    • Then: HF Settings → Access Tokens → New token (read scope) → copy it

Install the UpCloud CLI (upctl)

Bash
# macOS
brew install UpCloudLtd/tap/upctl

# Linux (or WSL)
curl -Lo upctl.tar.gz https://github.com/UpCloudLtd/upcloud-cli/releases/latest/download/upctl_linux_amd64.tar.gz
tar -xzf upctl.tar.gz
sudo mv upctl /usr/local/bin/

# Authenticate
upctl account login
# Enter your UpCloud username and password when prompted

Step 1: Launch the GPU Server

UpCloud provides Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA) as a ready-made template – the equivalent of AWS’s Deep Learning AMI. CUDA drivers come pre-installed, so you skip the manual driver setup entirely.

Via UpCloud Control Panel (recommended for first-timers)

  1. Log in at hub.upcloud.com
  2. Servers → GPU Servers → Deploy GPU Server
  3. Choose 1 x NVIDIA L4 / 8 cores / 64 GB (€0.616/hr)
  4. Region: select Finland – Helsinki 2 – currently the only zone where GPU servers are available
  5. OS: select “Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA)”
  6. Storage: set to 100 GB MaxIOPS
  7. SSH keys: add your public key
  8. Firewall: optionally add the UpCloud firewall add-on (~€2/month) – see Step 2 for the alternative
  9. Click Deploy

Via UpCloud CLI

Bash
# List available GPU server plans
upctl server plan list | grep -i gpu

# List available OS templates — look for Ubuntu 24.04 with CUDA
upctl server template list | grep -i ubuntu-24

# Create the GPU server (replace YOUR_SSH_KEY_NAME and zone as needed)
upctl server create \
  --hostname llama3-inference \
  --plan GPU-1xL4-8-64 \
  --zone fi-hel2 \
  --os "Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA)" \
  --ssh-keys YOUR_SSH_KEY_NAME \
  --storage-size 100 \
  --storage-tier maxiops \
  --title "Llama3 Inference Server"

Available zone for GPU servers: “fi-hel2” – Helsinki 2, Finland (currently the only zone where GPU servers can be created)

Other UpCloud zones (Amsterdam, Warsaw, Stockholm) do not yet support GPU instances. This may change as UpCloud expands GPU availability — check the Control Panel for the latest. For most EU data residency and GDPR purposes, Finland is fully compliant.

Why 100 GB storage? Ubuntu 24.04 + pre-installed CUDA drivers take ~15–20 GB. The Llama 3.1 8B Q4_K_M model is ~4.7 GB. Build tools and headroom bring the rest. If you plan to try the 70B model, bump to 150 GB.

Get the server’s public IP

Bash
upctl server show llama3-inference
# Look for "IP addresses" in the output

For the Control Panel users: go to the running servers list under Servers in the left menu, click on the server and copy the IP address in the next page.

Step 2: Firewall / Port Access

UpCloud’s firewall is an optional paid add-on (~€2/month). You have two options:

Option A: Use the UpCloud Firewall add-on (recommended for production)

Enable it during server creation in the Control Panel, then add rules for ports 22 (SSH) and 8080 (inference).

Via CLI:

Bash
upctl firewall create --name llama-fw

# Allow SSH
upctl firewall rule create llama-fw \
  --direction in --action accept --protocol tcp \
  --destination-port-start 22 --destination-port-end 22

# Allow inference port
upctl firewall rule create llama-fw \
  --direction in --action accept --protocol tcp \
  --destination-port-start 8080 --destination-port-end 8080

# Attach to server
upctl server firewall attach <SERVER_UUID> --firewall llama-fw

Option B: Use Ubuntu’s built-in firewall (ufw) – free, no add-on needed

Bash
# Make sure ufw is enabled and working
systemctl enable ufw
systemctl start ufw

# After SSH-ing in:
ufw allow 22/tcp
ufw allow 8080/tcp
ufw enable
ufw status

Please note!

For a dev. exercise, you could choose to to go with the ufw service or even no firewall at all (my instance worked for ~25 minutes during this exercise). Needless to say, for 25 minutes of life, there is little use even for ufw. An instance for dev purposes would benefit having at least the ufw service configured. However, any instance production facing is worth paying the extra ~€2/month, the ufw service sits on the machine and takes some resources while working. For your peace of mind, go for the UpCloud Firewall when installing a production facing instance.

Step 3: Connect and Verify GPU

Bash
ssh -i ~/.ssh/your-key root@<PUBLIC_IP>

Because we used the Ubuntu 24.04 with NVIDIA drivers & CUDA template, the drivers are already installed. Verify immediately:

Bash
nvidia-smi
[Screenshot: nvidia-smi output showing NVidia L4, 24 GB VRAM, driver version, cuda version]
[Screenshot: nvidia-smi output showing NVidia L4, 24 GB VRAM, driver version, cuda version]

Expected output: NVIDIA L4, 24 GB VRAM, driver version, CUDA version shown.

Bash
# Verify CUDA toolkit is available
nvcc --version

Compared to AWS: On the AWS guide, the Deep Learning AMI also came with drivers pre-installed. UpCloud’s Ubuntu 24.04 CUDA template gives you the same convenience — but on EU infrastructure, with a more current Ubuntu LTS base.

Step 4: Install Build Dependencies and Clone llama.cpp

Bash
# Update the system
apt update && apt upgrade -y

# Install build tools
apt install -y git cmake gcc g++ make python3-pip tmux

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Step 5: Build llama.cpp with CUDA Support

Bash
mkdir build && cd build

# Configure with CUDA enabled
cmake .. -DGGML_CUDA=ON

# Build (uses all 8 cores — takes ~3–5 min)
cmake --build . --config Release -j$(nproc)

What -DGGML_CUDA=ON does: Compiles GPU kernels so matrix operations run on the L4 instead of CPU. Without this flag, inference is 5-10x slower and ignores the GPU entirely. This is the key difference between a “happens to run” and a “properly deployed” LLM.

Setup time comparison: On AWS g4dn.xlarge, building llama.cpp with CUDA took approximately 90 minutes. On the UpCloud NVIDIA L4, the same build completed in approximately 7 minutes – thanks to the L4’s newer architecture and the 8 cores vs 4 on the T4 instance. For a hands-on exercise or a client demo, that difference matters.

Verify the CUDA build succeeded:

Bash
./bin/llama-cli --version
# Should mention CUDA in the build info

Step 6: Download Llama 3.1 8B (Quantized)

Bash
cd ~

# Ubuntu 24.04 locks down system Python — use pipx instead
apt install -y pipx
pipx ensurepath
source ~/.bashrc

# Install HuggingFace CLI
pipx install huggingface_hub[cli]

# Log in with your HF token
hf auth login
# Paste your token when prompted

# Create model directory
mkdir -p ~/models

# Download Q4_K_M quantized GGUF (~4.7 GB)
hf download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ~/models

What Q4_K_M means: 4-bit quantization, K-quant method, medium size variant. Full fp16 would need ~16 GB VRAM. Q4_K_M fits in ~5 GB VRAM with minimal quality loss – on the L4’s 24 GB you have 19 GB of VRAM headroom to spare.

Bonus: Download Llama 3.1 70B (L4-exclusive capability)

Bash
# ~40 GB download — only possible because L4 has 24 GB VRAM (T4 can't do this)
hf download \
  bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
  --include "Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf" \
  --local-dir ~/models

Step 7: Start the Inference Server

Bash
cd ~/llama.cpp

./build/bin/llama-server \
  --model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 999 \
  --ctx-size 8192 \
  --parallel 4

Flag Breakdown:

  • ‘–n-gpu-layers 999’ : offload all layers to GPU (999 = “as many as fit”)
  • ‘–ctx-size 8192’ : context window in tokens (doubled vs T4 guide — L4 VRAM allows it)
  • ‘–parallel 4’ : handle 4 simultaneous requests (doubled — more RAM/VRAM headroom)
  • ‘–host 0.0.0.0’ : bind to all interfaces (needed for external access)

You should see: “llama server listening at http://0.0.0.0:8080” like so:

[Screenshot: Server listening on port 8080]
[Screenshot: Server listening on port 8080]

Step 8: Test the Endpoint

Open a second terminal (or use ‘tmux’ to background the server).

Bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain what quantization means in the context of LLMs in 2 sentences."}
    ],
    "max_tokens": 150
  }'

From your local machine (use the public IP):

Bash
curl http://<PUBLIC_IP>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is 17 * 43?"}],
    "max_tokens": 50
  }'
[Screenshot: Query the prompt from my local machince]

Please Note: llama-server exposes an OpenAI-compatible API. Any code that talks to OpenAI can point to this endpoint instead – zero code changes required. For European companies wanting to move off US API providers, this is the practical path.

Step 9: Benchmark

Bash
./build/bin/llama-bench \
  --model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 999

Measured results on the UpCloud NVIDIA L4:

Test

Result

Prompt processing (pp512)

3,038.68 ± 66.76 tok/s

Text generation (tg128)

50.53 ± 0.04 tok/s

Model size

4.58 GiB

Backend

CUDA (all 999 layers on GPU)

VRAM available

22,565 MiB (~22 GiB)

Compared to the T4 on AWS (34.36 tok/s generation, 1,093 tok/s prompt): the L4 delivers ~1.5x faster generation and a remarkable ~2.8x faster prompt processing. The prompt processing leap is especially relevant for RAG pipelines and long-context use cases.

Cost Tracking

Bash
# UpCloud CLI — check server status and uptime
upctl server show llama3-inference

Or: Control Panel → Billing → Resource usage.

No spot instances on UpCloud: Unlike AWS (where spot can save 60–70%), UpCloud pricing is flat. The upside is no interruptions. For a short dev exercise, stop the server immediately after — at €0.616/hr, a 3-hour session costs about €1.85.

IMPORTANT: Stop the Server When Done

Bash
# Stop (server off, storage preserved — model and llama.cpp build intact)
upctl server stop llama3-inference

# Delete entirely when you no longer need it
upctl server delete llama3-inference --delete-storages

Storage backup tip: Before deleting, create a backup of the disk. Next time you spin up a server from this backup, CUDA is already installed, llama.cpp is already built, and the model is already downloaded – zero setup time.

Bash
# Find your storage UUID
upctl server show llama3-inference

# Create a backup
upctl storage backup create <STORAGE_UUID> \
  --title "llama3-inference-ready"

Troubleshooting

Problem

Command

Fix

GPU not detected

‘nvidia-smi’

Wrong OS template selected – must use the “with NVIDIA drivers & CUDA” variant.

‘nvidia-smi’ works but CUDA not found

‘nvcc –version’

CUDA not in PATH. Add ‘export PATH=/usr/local/cuda/bin:$PATH’ to ~/.bashrc’ and ‘source ~/.bashrc’.

CUDA build fails

Check ‘cmake ..’ output

Verify ‘nvcc –version’ works. Re-run with ‘CUDA_HOME’ exported if needed.

‘pip3 install’ fails with “externally-managed-environment”

Ubuntu 24.04 blocks system-wide pip. Use `pipx` instead – see Step 6.

‘hf auth login’ returns “Invalid user token”

Your token may have expired. Go to huggingface.co/settings/tokens, delete the old token and generate a fresh one. Use a classic Read token, not fine-grained.

Out of VRAM

Server crashes on model load

Very unlikely with 8B on L4. Switch to Q3_K_M if it happens.

Slow inference

‘nvidia-smi’ during run

If GPU memory usage is 0, ‘–n-gpu-layers’ wasn’t applied. Rebuild with ‘-DGGML_CUDA=ON’.

Port 8080 unreachable externally

‘curl localhost:8080’ works but public IP doesn’t

Either add the UpCloud Firewall add-on with port 8080 open, or run ‘ufw allow 8080/tcp’ on the server.

SSH refused

Server still booting. Wait 60s, then retry. Check that port 22 is open (ufw or UpCloud firewall).

Why it matters

  1. Data residency – EU zones mean data never leaves the EU. Required for healthcare, finance, and public sector clients under GDPR and sector-specific regulations.
  2. CADA / AI Act alignment – The EU AI Act and the emerging Cloud and AI Development Act (CADA) push companies toward auditable, EU-controlled AI infrastructure. Running your own model on EU infrastructure is a concrete step toward compliance.
  3. No US vendor lock-in – UpCloud is not tied to a US hyperscaler’s terms of service or export controls. European companies increasingly care about this.
  4. Performance – The L4 (Ada Lovelace, 2023) beats the T4 (Turing, 2018) by ~2–3x on inference throughput. EU infrastructure does not mean slower infrastructure.
  5. Honest pricing – No spot/preemptible complexity. €0.616/hr flat, plus ~€2/month if you want the managed firewall. What you see is what you pay.
Need to deploy open-source AI on European infrastructure for your company?
I set up private LLM endpoints on EU-hosted GPU servers – GDPR-compliant, cost-effective, and fully under your control.
Contact Me

Restore from Backup workflow

Since we took a backup, we can now set the whole thing up in minutes just by running

Bash
# List your storage backups
upctl storage list --type backup

# Create a new server from the backup — zero setup time
upctl server create \
  --hostname llama3-inference-v2 \
  --plan GPU-1xL4-8-64 \
  --zone fi-hel2 \
  --storage <BACKUP_UUID> \
  --ssh-keys YOUR_SSH_KEY_NAME \
  --title "Llama3 from backup"

Server boots → SSH in → run the `llama-server` command from Step 7. Everything else is already done.