W4D1: Local LLM Warm-Up

Goal

You’ve run an LLM on your laptop.

Key questions:

  • What does “running” an LLM require (compute, memory, storage)?
  • What are the benefits of running an LLM on our own hardware vs using a cloud API?
  • What are the limitations of running an LLM on our own hardware vs using a cloud API

Setup

In 2025, the easiest way to run an LLM locally is probably ollama.

Install ollama on your machine and get at least gemma3:270m running. If you have the disk space and memory, you could also try a larger model, like gpt-oss:20b or one of the qwen models.

Note: there’s an app that also installs, but the instructions here assume the command-line interface.

Exploration

Start by running ollama run gemma3:270m in your terminal. After a short wait to download and load the model, you should see a prompt like >>> Send a message.

  1. Try a simple prompt, like “What is the capital of Michigan?” or “Write a haiku about computers.” Observe:
    • How long does it take to start generating output? (“time to first token”)
    • How long does it take to generate the full output?
    • Is it correct? Useful?
  2. Information about the model:
  • Run /show info.
    • What is the model’s context length? Estimate how many pages of text that is.
    • How many parameters does it have?
  • Run /show license. What are you allowed, and not allowed, to do with this model?
  1. Resource usage:
    • How much memory is the Ollama process using? (Use Task Manager on Windows, Activity Monitor on macOS, or htop on Linux)
    • How much disk space does the model take up? Run ollama list. (Optionally, to find the actual files, see FAQ: Where are models stored?)

Doing Something Useful

Apple uses a local LLM to summarize notifications. Try to do something similar:

  1. Collect 3 notifications you have received recently (e.g., from your phone or computer).
  2. Write a prompt that asks the model to summarize these notifications. Save this prompt to a text file (e.g., summarize_notifications.txt. For example:
Summarize the following notifications in 1-2 sentences:
1. [First notification]
2. [Second notification]
3. [Third notification]
  1. Run the model with your prompt file as input. For example, if your prompt is in summarize_notifications.txt, run:
ollama run gemma3:270m < summarize_notifications.txt

or in PowerShell:

Get-Content summarize_notifications.txt | ollama run gemma3:270m
  1. Observe the output. Is it a good summary? How long does it take to generate?
Tip

On macOS or Linux, you can use time to measure how long the command takes. For example:

time ollama run gemma3:270m < summarize_notifications.txt
NoteDoing this repeatedly

Try that command a few times. You’ll probably notice that the second time is much faster than the first time. Reasons:

  1. The model needs to get loaded into memory. The ollama server process keeps it in memory after the first run, but times out after a while.
  2. The prompt gets stored in a “key-value cache” (KV cache) that speeds up subsequent runs with the same prompt.
  1. Try modifying your prompt to see how it affects the output. For example, you could ask for a more detailed summary, or a summary in bullet points. Consider whether it’s actually a good summary, based on how you’d actually use it if it were summarizing notifications on your own phone.

For comparison, try the same prompt with a cloud API, like Google Gemini. Also try a larger local model, if you have the resources (e.g., gpt-oss:20b or qwen3:8b).

Discussion

  • What are the benefits of running an LLM locally vs using a cloud API? Consider factors like cost, privacy, control, latency, reliability, etc.
  • What are the limitations of running an LLM locally vs using a cloud API? Consider factors like model size, performance, updates, etc.