W4D1: Local LLM Warm-Up

Goal

You’ve run an LLM on your laptop.

Key questions:

What does “running” an LLM require (compute, memory, storage)?
What are the benefits of running an LLM on our own hardware vs using a cloud API?
What are the limitations of running an LLM on our own hardware vs using a cloud API

Setup

In 2025, the easiest way to run an LLM locally is probably ollama.

Install ollama on your machine and get at least gemma3:270m running. If you have the disk space and memory, you could also try a larger model, like gpt-oss:20b or one of the qwen models.

Note: there’s an app that also installs, but the instructions here assume the command-line interface.

Exploration

Start by running ollama run gemma3:270m in your terminal. After a short wait to download and load the model, you should see a prompt like >>> Send a message.

Try a simple prompt, like “What is the capital of Michigan?” or “Write a haiku about computers.” Observe:
- How long does it take to start generating output? (“time to first token”)
- How long does it take to generate the full output?
- Is it correct? Useful?
Information about the model:

Run /show info.
- What is the model’s context length? Estimate how many pages of text that is.
- How many parameters does it have?
Run /show license. What are you allowed, and not allowed, to do with this model?

Resource usage:
- How much memory is the Ollama process using? (Use Task Manager on Windows, Activity Monitor on macOS, or htop on Linux)
- How much disk space does the model take up? Run ollama list. (Optionally, to find the actual files, see FAQ: Where are models stored?)

Doing Something Useful

Apple uses a local LLM to summarize notifications. Try to do something similar:

Collect 3 notifications you have received recently (e.g., from your phone or computer).
Write a prompt that asks the model to summarize these notifications. Save this prompt to a text file (e.g., summarize_notifications.txt. For example:

Summarize the following notifications in 1-2 sentences:
1. [First notification]
2. [Second notification]
3. [Third notification]

Run the model with your prompt file as input. For example, if your prompt is in summarize_notifications.txt, run:

ollama run gemma3:270m < summarize_notifications.txt

or in PowerShell:

Get-Content summarize_notifications.txt | ollama run gemma3:270m

Observe the output. Is it a good summary? How long does it take to generate?

Tip

On macOS or Linux, you can use time to measure how long the command takes. For example:

time ollama run gemma3:270m < summarize_notifications.txt

Doing this repeatedly

Try that command a few times. You’ll probably notice that the second time is much faster than the first time. Reasons:

The model needs to get loaded into memory. The ollama server process keeps it in memory after the first run, but times out after a while.
The prompt gets stored in a “key-value cache” (KV cache) that speeds up subsequent runs with the same prompt.

Try modifying your prompt to see how it affects the output. For example, you could ask for a more detailed summary, or a summary in bullet points. Consider whether it’s actually a good summary, based on how you’d actually use it if it were summarizing notifications on your own phone.

For comparison, try the same prompt with a cloud API, like Google Gemini. Also try a larger local model, if you have the resources (e.g., gpt-oss:20b or qwen3:8b).

Discussion

What are the benefits of running an LLM locally vs using a cloud API? Consider factors like cost, privacy, control, latency, reliability, etc.
What are the limitations of running an LLM locally vs using a cloud API? Consider factors like model size, performance, updates, etc.