W4D1: Local LLM Warm-Up
Goal
You’ve run an LLM on your laptop.
Key questions:
- What does “running” an LLM require (compute, memory, storage)?
- What are the benefits of running an LLM on our own hardware vs using a cloud API?
- What are the limitations of running an LLM on our own hardware vs using a cloud API
Setup
In 2025, the easiest way to run an LLM locally is probably ollama
.
Install ollama
on your machine and get at least gemma3:270m running. If you have the disk space and memory, you could also try a larger model, like gpt-oss:20b
or one of the qwen
models.
Note: there’s an app that also installs, but the instructions here assume the command-line interface.
Exploration
Start by running ollama run gemma3:270m
in your terminal. After a short wait to download and load the model, you should see a prompt like >>> Send a message
.
- Try a simple prompt, like “What is the capital of Michigan?” or “Write a haiku about computers.” Observe:
- How long does it take to start generating output? (“time to first token”)
- How long does it take to generate the full output?
- Is it correct? Useful?
- Information about the model:
- Run
/show info
.- What is the model’s context length? Estimate how many pages of text that is.
- How many parameters does it have?
- Run
/show license
. What are you allowed, and not allowed, to do with this model?
- Resource usage:
- How much memory is the Ollama process using? (Use Task Manager on Windows, Activity Monitor on macOS, or
htop
on Linux) - How much disk space does the model take up? Run
ollama list
. (Optionally, to find the actual files, see FAQ: Where are models stored?)
- How much memory is the Ollama process using? (Use Task Manager on Windows, Activity Monitor on macOS, or
Doing Something Useful
Apple uses a local LLM to summarize notifications. Try to do something similar:
- Collect 3 notifications you have received recently (e.g., from your phone or computer).
- Write a prompt that asks the model to summarize these notifications. Save this prompt to a text file (e.g.,
summarize_notifications.txt
. For example:
Summarize the following notifications in 1-2 sentences:
1. [First notification]
2. [Second notification]
3. [Third notification]
- Run the model with your prompt file as input. For example, if your prompt is in
summarize_notifications.txt
, run:
ollama run gemma3:270m < summarize_notifications.txt
or in PowerShell:
Get-Content summarize_notifications.txt | ollama run gemma3:270m
- Observe the output. Is it a good summary? How long does it take to generate?
On macOS or Linux, you can use time
to measure how long the command takes. For example:
time ollama run gemma3:270m < summarize_notifications.txt
Try that command a few times. You’ll probably notice that the second time is much faster than the first time. Reasons:
- The model needs to get loaded into memory. The ollama server process keeps it in memory after the first run, but times out after a while.
- The prompt gets stored in a “key-value cache” (KV cache) that speeds up subsequent runs with the same prompt.
- Try modifying your prompt to see how it affects the output. For example, you could ask for a more detailed summary, or a summary in bullet points. Consider whether it’s actually a good summary, based on how you’d actually use it if it were summarizing notifications on your own phone.
For comparison, try the same prompt with a cloud API, like Google Gemini. Also try a larger local model, if you have the resources (e.g., gpt-oss:20b
or qwen3:8b
).
Discussion
- What are the benefits of running an LLM locally vs using a cloud API? Consider factors like cost, privacy, control, latency, reliability, etc.
- What are the limitations of running an LLM locally vs using a cloud API? Consider factors like model size, performance, updates, etc.