Implementation Guide — European Infrastructure Edition
Why UpCloud?
Finnish-headquartered, ISO 27001 certified, GDPR-native infrastructure. For European clients with data residency requirements or companies aligning with the Cloud and AI Development Act (CADA) and related EU digital sovereignty initiatives, this is a practical alternative to US hyperscalers. The NVIDIA L4 GPU on UpCloud is also a step up from the AWS T4 – newer architecture (Ada Lovelace vs Turing), 24 GB VRAM vs 15 GB, and meaningfully faster on quantized models.
Hardware Comparison: AWS g4dn.xlarge vs UpCloud NVIDIA L4
AWS g4dn.xlarge | UpCloud NVIDIA L4 | |
|---|---|---|
GPU | NVIDIA T4 (Turing, 2018) | NVIDIA L4 (Ada Lovelace, 2023) |
VRAM | 15 GB usable | 24 GB |
vCPUs | 4 | 8 |
RAM | 16 GB | 64 GB |
Generation throughput * | 34.36 tok/s | 50.53 tok/s |
Prompt throughput * | 1,093 tok/s | 3,038.68 tok/s |
VRAM used (8B Q4_K_M) * | 5.2 GB / 15 GB | 4.58 GiB / 22 GiB |
Data residency | US-based (configurable) | EU-native (Finland – Helsinki 2) |
GDPR / Data Residency | Requires configuration | EU-native by default |
Hourly price | ~$0.53/hr on-demand | ~€0.616/hr |
Firewall | Included (Security Groups) | Add-on (~€2/month extra) |
Setup time (build + drivers) | ~90 minutes | ~7 minutes |
* Both sets of numbers are real benchmarks — AWS T4 from the [published AWS guide](https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/), UpCloud L4 measured during the writing of this guide.
For EU clients: The L4’s 24 GB VRAM means you can run Llama 3 70B at Q4_K_M without VRAM overflow – something the T4 cannot do.
Cost Reality Check
For a dev/learning exercise (a few hours), the cost is minimal:
Item | Cost |
|---|---|
GPU server (1xL4) | €0.616/hr |
Firewall (optional add-on) | ~€0.003/hr (€2/month prorated) |
Storage (100 GB, while running) | Included in server price |
Storage (while server is stopped) | Small per-GB charge — check UpCloud pricing |
Total for a 3-hour session | ~€1.85 |
Key difference from AWS: UpCloud does not have spot instances. There’s no equivalent of AWS’s 60–70% spot discount. The tradeoff is predictable pricing with no interruptions – better for anything beyond a quick test. When the server is stopped (not deleted), you only pay for the storage, so leave it stopped between sessions and delete when fully done.
Prerequisites
- UpCloud account with billing enabled (upcloud.com)
- UpCloud CLI (`upctl`) installed – OR use the UpCloud Control Panel web UI
- An SSH key added to your UpCloud account
- A HuggingFace account (free) — needed to download Llama 3:
- Go to https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and accept the license
- Then: HF Settings → Access Tokens → New token (read scope) → copy it
Install the UpCloud CLI (upctl)
# macOS
brew install UpCloudLtd/tap/upctl
# Linux (or WSL)
curl -Lo upctl.tar.gz https://github.com/UpCloudLtd/upcloud-cli/releases/latest/download/upctl_linux_amd64.tar.gz
tar -xzf upctl.tar.gz
sudo mv upctl /usr/local/bin/
# Authenticate
upctl account login
# Enter your UpCloud username and password when promptedStep 1: Launch the GPU Server
UpCloud provides Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA) as a ready-made template – the equivalent of AWS’s Deep Learning AMI. CUDA drivers come pre-installed, so you skip the manual driver setup entirely.
Via UpCloud Control Panel (recommended for first-timers)
- Log in at hub.upcloud.com
- Servers → GPU Servers → Deploy GPU Server
- Choose 1 x NVIDIA L4 / 8 cores / 64 GB (€0.616/hr)
- Region: select Finland – Helsinki 2 – currently the only zone where GPU servers are available
- OS: select “Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA)”
- Storage: set to 100 GB MaxIOPS
- SSH keys: add your public key
- Firewall: optionally add the UpCloud firewall add-on (~€2/month) – see Step 2 for the alternative
- Click Deploy
Via UpCloud CLI
# List available GPU server plans
upctl server plan list | grep -i gpu
# List available OS templates — look for Ubuntu 24.04 with CUDA
upctl server template list | grep -i ubuntu-24
# Create the GPU server (replace YOUR_SSH_KEY_NAME and zone as needed)
upctl server create \
--hostname llama3-inference \
--plan GPU-1xL4-8-64 \
--zone fi-hel2 \
--os "Ubuntu Server 24.04 LTS (with NVIDIA drivers & CUDA)" \
--ssh-keys YOUR_SSH_KEY_NAME \
--storage-size 100 \
--storage-tier maxiops \
--title "Llama3 Inference Server"Available zone for GPU servers: “fi-hel2” – Helsinki 2, Finland (currently the only zone where GPU servers can be created)
Other UpCloud zones (Amsterdam, Warsaw, Stockholm) do not yet support GPU instances. This may change as UpCloud expands GPU availability — check the Control Panel for the latest. For most EU data residency and GDPR purposes, Finland is fully compliant.
Why 100 GB storage? Ubuntu 24.04 + pre-installed CUDA drivers take ~15–20 GB. The Llama 3.1 8B Q4_K_M model is ~4.7 GB. Build tools and headroom bring the rest. If you plan to try the 70B model, bump to 150 GB.
Get the server’s public IP
upctl server show llama3-inference
# Look for "IP addresses" in the outputFor the Control Panel users: go to the running servers list under Servers in the left menu, click on the server and copy the IP address in the next page.
Step 2: Firewall / Port Access
UpCloud’s firewall is an optional paid add-on (~€2/month). You have two options:
Option A: Use the UpCloud Firewall add-on (recommended for production)
Enable it during server creation in the Control Panel, then add rules for ports 22 (SSH) and 8080 (inference).
Via CLI:
upctl firewall create --name llama-fw
# Allow SSH
upctl firewall rule create llama-fw \
--direction in --action accept --protocol tcp \
--destination-port-start 22 --destination-port-end 22
# Allow inference port
upctl firewall rule create llama-fw \
--direction in --action accept --protocol tcp \
--destination-port-start 8080 --destination-port-end 8080
# Attach to server
upctl server firewall attach <SERVER_UUID> --firewall llama-fwOption B: Use Ubuntu’s built-in firewall (ufw) – free, no add-on needed
# Make sure ufw is enabled and working
systemctl enable ufw
systemctl start ufw
# After SSH-ing in:
ufw allow 22/tcp
ufw allow 8080/tcp
ufw enable
ufw statusPlease note!
For a dev. exercise, you could choose to to go with the ufw service or even no firewall at all (my instance worked for ~25 minutes during this exercise). Needless to say, for 25 minutes of life, there is little use even for ufw. An instance for dev purposes would benefit having at least the ufw service configured. However, any instance production facing is worth paying the extra ~€2/month, the ufw service sits on the machine and takes some resources while working. For your peace of mind, go for the UpCloud Firewall when installing a production facing instance.
Step 3: Connect and Verify GPU
ssh -i ~/.ssh/your-key root@<PUBLIC_IP>Because we used the Ubuntu 24.04 with NVIDIA drivers & CUDA template, the drivers are already installed. Verify immediately:
nvidia-smi![[Screenshot: nvidia-smi output showing NVidia L4, 24 GB VRAM, driver version, cuda version]](https://gizmojack.com/wp-content/uploads/2026/06/nvidia-smi-screenshot-1024x525.png.webp)
Expected output: NVIDIA L4, 24 GB VRAM, driver version, CUDA version shown.
# Verify CUDA toolkit is available
nvcc --versionCompared to AWS: On the AWS guide, the Deep Learning AMI also came with drivers pre-installed. UpCloud’s Ubuntu 24.04 CUDA template gives you the same convenience — but on EU infrastructure, with a more current Ubuntu LTS base.
Step 4: Install Build Dependencies and Clone llama.cpp
# Update the system
apt update && apt upgrade -y
# Install build tools
apt install -y git cmake gcc g++ make python3-pip tmux
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cppStep 5: Build llama.cpp with CUDA Support
mkdir build && cd build
# Configure with CUDA enabled
cmake .. -DGGML_CUDA=ON
# Build (uses all 8 cores — takes ~3–5 min)
cmake --build . --config Release -j$(nproc)What ‘-DGGML_CUDA=ON‘ does: Compiles GPU kernels so matrix operations run on the L4 instead of CPU. Without this flag, inference is 5-10x slower and ignores the GPU entirely. This is the key difference between a “happens to run” and a “properly deployed” LLM.
Setup time comparison: On AWS g4dn.xlarge, building llama.cpp with CUDA took approximately 90 minutes. On the UpCloud NVIDIA L4, the same build completed in approximately 7 minutes – thanks to the L4’s newer architecture and the 8 cores vs 4 on the T4 instance. For a hands-on exercise or a client demo, that difference matters.
Verify the CUDA build succeeded:
./bin/llama-cli --version
# Should mention CUDA in the build infoStep 6: Download Llama 3.1 8B (Quantized)
cd ~
# Ubuntu 24.04 locks down system Python — use pipx instead
apt install -y pipx
pipx ensurepath
source ~/.bashrc
# Install HuggingFace CLI
pipx install huggingface_hub[cli]
# Log in with your HF token
hf auth login
# Paste your token when prompted
# Create model directory
mkdir -p ~/models
# Download Q4_K_M quantized GGUF (~4.7 GB)
hf download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir ~/modelsWhat Q4_K_M means: 4-bit quantization, K-quant method, medium size variant. Full fp16 would need ~16 GB VRAM. Q4_K_M fits in ~5 GB VRAM with minimal quality loss – on the L4’s 24 GB you have 19 GB of VRAM headroom to spare.
Bonus: Download Llama 3.1 70B (L4-exclusive capability)
# ~40 GB download — only possible because L4 has 24 GB VRAM (T4 can't do this)
hf download \
bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
--include "Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf" \
--local-dir ~/modelsStep 7: Start the Inference Server
cd ~/llama.cpp
./build/bin/llama-server \
--model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 999 \
--ctx-size 8192 \
--parallel 4Flag Breakdown:
- ‘–n-gpu-layers 999’ : offload all layers to GPU (999 = “as many as fit”)
- ‘–ctx-size 8192’ : context window in tokens (doubled vs T4 guide — L4 VRAM allows it)
- ‘–parallel 4’ : handle 4 simultaneous requests (doubled — more RAM/VRAM headroom)
- ‘–host 0.0.0.0’ : bind to all interfaces (needed for external access)
You should see: “llama server listening at http://0.0.0.0:8080” like so:
![[Screenshot: Server listening on port 8080]](https://gizmojack.com/wp-content/uploads/2026/06/server-listening.png.webp)
Step 8: Test the Endpoint
Open a second terminal (or use ‘tmux’ to background the server).
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what quantization means in the context of LLMs in 2 sentences."}
],
"max_tokens": 150
}'From your local machine (use the public IP):
curl http://<PUBLIC_IP>:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "What is 17 * 43?"}],
"max_tokens": 50
}'
Please Note: llama-server exposes an OpenAI-compatible API. Any code that talks to OpenAI can point to this endpoint instead – zero code changes required. For European companies wanting to move off US API providers, this is the practical path.
Step 9: Benchmark
./build/bin/llama-bench \
--model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 999Measured results on the UpCloud NVIDIA L4:
Test | Result |
|---|---|
Prompt processing (pp512) | 3,038.68 ± 66.76 tok/s |
Text generation (tg128) | 50.53 ± 0.04 tok/s |
Model size | 4.58 GiB |
Backend | CUDA (all 999 layers on GPU) |
VRAM available | 22,565 MiB (~22 GiB) |
Compared to the T4 on AWS (34.36 tok/s generation, 1,093 tok/s prompt): the L4 delivers ~1.5x faster generation and a remarkable ~2.8x faster prompt processing. The prompt processing leap is especially relevant for RAG pipelines and long-context use cases.
Cost Tracking
# UpCloud CLI — check server status and uptime
upctl server show llama3-inferenceOr: Control Panel → Billing → Resource usage.
No spot instances on UpCloud: Unlike AWS (where spot can save 60–70%), UpCloud pricing is flat. The upside is no interruptions. For a short dev exercise, stop the server immediately after — at €0.616/hr, a 3-hour session costs about €1.85.
IMPORTANT: Stop the Server When Done
# Stop (server off, storage preserved — model and llama.cpp build intact)
upctl server stop llama3-inference
# Delete entirely when you no longer need it
upctl server delete llama3-inference --delete-storagesStorage backup tip: Before deleting, create a backup of the disk. Next time you spin up a server from this backup, CUDA is already installed, llama.cpp is already built, and the model is already downloaded – zero setup time.
# Find your storage UUID
upctl server show llama3-inference
# Create a backup
upctl storage backup create <STORAGE_UUID> \
--title "llama3-inference-ready"Troubleshooting
Problem | Command | Fix |
|---|---|---|
GPU not detected | ‘nvidia-smi’ | Wrong OS template selected – must use the “with NVIDIA drivers & CUDA” variant. |
‘nvidia-smi’ works but CUDA not found | ‘nvcc –version’ | CUDA not in PATH. Add ‘export PATH=/usr/local/cuda/bin:$PATH’ to ~/.bashrc’ and ‘source ~/.bashrc’. |
CUDA build fails | Check ‘cmake ..’ output | Verify ‘nvcc –version’ works. Re-run with ‘CUDA_HOME’ exported if needed. |
‘pip3 install’ fails with “externally-managed-environment” | – | Ubuntu 24.04 blocks system-wide pip. Use `pipx` instead – see Step 6. |
‘hf auth login’ returns “Invalid user token” | – | Your token may have expired. Go to huggingface.co/settings/tokens, delete the old token and generate a fresh one. Use a classic Read token, not fine-grained. |
Out of VRAM | Server crashes on model load | Very unlikely with 8B on L4. Switch to Q3_K_M if it happens. |
Slow inference | ‘nvidia-smi’ during run | If GPU memory usage is 0, ‘–n-gpu-layers’ wasn’t applied. Rebuild with ‘-DGGML_CUDA=ON’. |
Port 8080 unreachable externally | ‘curl localhost:8080’ works but public IP doesn’t | Either add the UpCloud Firewall add-on with port 8080 open, or run ‘ufw allow 8080/tcp’ on the server. |
SSH refused | – | Server still booting. Wait 60s, then retry. Check that port 22 is open (ufw or UpCloud firewall). |
Why it matters
- Data residency – EU zones mean data never leaves the EU. Required for healthcare, finance, and public sector clients under GDPR and sector-specific regulations.
- CADA / AI Act alignment – The EU AI Act and the emerging Cloud and AI Development Act (CADA) push companies toward auditable, EU-controlled AI infrastructure. Running your own model on EU infrastructure is a concrete step toward compliance.
- No US vendor lock-in – UpCloud is not tied to a US hyperscaler’s terms of service or export controls. European companies increasingly care about this.
- Performance – The L4 (Ada Lovelace, 2023) beats the T4 (Turing, 2018) by ~2–3x on inference throughput. EU infrastructure does not mean slower infrastructure.
- Honest pricing – No spot/preemptible complexity. €0.616/hr flat, plus ~€2/month if you want the managed firewall. What you see is what you pay.
I set up private LLM endpoints on EU-hosted GPU servers – GDPR-compliant, cost-effective, and fully under your control.
Restore from Backup workflow
Since we took a backup, we can now set the whole thing up in minutes just by running
# List your storage backups
upctl storage list --type backup
# Create a new server from the backup — zero setup time
upctl server create \
--hostname llama3-inference-v2 \
--plan GPU-1xL4-8-64 \
--zone fi-hel2 \
--storage <BACKUP_UUID> \
--ssh-keys YOUR_SSH_KEY_NAME \
--title "Llama3 from backup"Server boots → SSH in → run the `llama-server` command from Step 7. Everything else is already done.

