Class Meetings

Overall Structure:

Week 1

Wednesday: Intro

During Class

  • Introductions
    • Some interactive AI systems I’ve built: Thoughtful, live-outline, translation workflow, notebook feedback bot, …
  • (on paper) goals, background, project interest
    • What do we want to wrestle with together in this class?
  • Discerning the AI movement

Before Next Time

  1. Vibe code something testable for next time: see the instructions
  2. Get logged into Perusall (via Moodle) and start Reading 1.

Friday

During Class

  • Opening discussion
    • Suggestion-type experiment
    • “Good” vs “Good for”: capability vs health
  • Logistics
    • Why Perusall, how that will work
    • Grading policy discussion
    • Tentative topics list
    • How we should use AI together
  • Activity: vibe prototype testing. Iterating.
  • Project Time:
    • Intro to Project Logistics
    • Exploration and team-forming time

Before Next Time

  1. Finish Reading 1.
  2. Start working through the Anthropic AI Fluency Foundations course

Reflect on your experiences using LLMs, e.g., for the vibe prototyping task, based on the “4 D” framework from that course.

Week 2

Monday: LLM APIs and Prompting

  • Logistics
    • How we should use AI: together.
    • Projects
  • Activity: prompting
    • Fundamentals: context, messages, responses
    • Intro to prompt engineering and context engineering

Wednesday: Continuing LLM APIs

  • Debrief Anthropic course
  • Logistics: Reading 2
  • Finish prompting activity.
    • Go back and do Step 2B
    • Also do the new Step 4 (refresh if you don’t see it)
  • Start working on course advisor bot

Friday: Evaluation

Week 3

Monday: Review, Catch-Up, and Project Work

  • Check-in, open Q&A
  • Main concepts from last week
    • Monday
      • A prompt is a program.
      • The LLM “messages” API pattern
    • Wednesday
      • We can interpret LLM outputs are more than just things to say to the user: tool calls, “thoughts”, etc.
      • (Let’s do an example function call together.)
    • Friday
      • Evaluation is hard, but we can do better than “it seems good”
  • Work on unfinished activities from last week
    • Evaluation activity (debrief questions)
    • Work on course advisor bot
  • Project work time

Wednesday: Project Work

  • Discussion of debrief questions from evaluation activity
  • Project work time
  • Logistics
    • Project proposal due today
    • Reading 3 posted, including “Guidelines for Human-AI Interaction” - useful table for low-level analysis of systems
  1. What else do we want to know?
  • How does it work on different quality emails?
  • When someone’s using it incorrectly?
  • When people who don’t know what you’re trying to do use it?
  • How does it affect how we present ourselves to others?
  1. What would we want to get alerted for?
  • AI insults user or user’s work.
  • AI breaks the law… (how would we measure?)
    • suggesting doing something illegal
    • (are we responsible for what the AI says?)
  • AI suggests something not consistent with our policies
  • AI suggests PII that’s not in the original message

Friday: Testing Course Advisor Bots, Project Work

  • Projects
    • Be much more specific about use case. Make an amazing solution for a narrow problem, not a mediocre solution for a broad problem.
  • Debrief from handouts
    • If the model gives a rationale for its answer, is that transparent?
    • Agents: calling tools in a loop to achieve a goal
    • What would we do with a “reference” human output for a task like the email feedback bot?
    • Ratings are hard to calibrate (what does a “4” mean?), but rankings might be easier to get right (which of these two is better?). Possible example.
    • Gathering user feedback: A/B tests, comparisons
  • Activity: Testing Course Advisor Bots
  • Project work time

What would we do with a reference “good” output?

  • give it to the bot as an example.
    • In-Context Learning - now possible b/c self-attention (Transformers model)
    • but LLM can “overfit” to the example.
    • make sure you give a range of examples.
    • test the impact of these examples.
  • fine-tune the model on it
    • start with pre-trained model and update its weights to fit some data better
    • but you need LOTS of examples.
    • could be more efficient
  • use it to evaluate / iterate
    • Score how much the output aligns with the reference
      • word / phrase overlap? but maybe exact matches unlikely
      • summarize tone / idea / points / …, compute overlap of that
      • Give LLM a comparison task? “How much do these align?”

Other stakeholders:

  • transfer students
  • Entrada or other prospective students: see courses that might be of interest
  • advisors: save time, but might miss courses e.g., with shorter descriptions (BIAS), or it recommends the same class to many students
    • might replace human interaction of advising days, which is about more than picking classes.
  • The system’s data was limited
    • only current sections
    • scheduling
    • requirements / programs (current and possible programs, overall)
    • seats available
    • student’s year, prior classes
    • faculty match

Other planned notes:

  • Stakeholders: Current students, prospective students, advisors, instructors, department chairs, IT staff, …
  • One-shot interaction, no conversation, focused on an explicit statement of interests, no constraints, no connection to broader programs; driven by interest not long-term goals / who do you want to become
  • Assumptions: students know what they want and can articulate it succinctly (vs discovery, exploration), course is the unit of interest; no feedback (the only data is the course catalog), no constraints (e.g., time of day, prerequisites, …)
  • Are there courses that will get systematically under- or over-recommended by the bot?

Before Next Time

Install ollama on your machine and get at least gemma3:270m running. If you have the disk space and memory, you could also try a larger model, like gpt-oss:20b or one of the qwen models.

Week 4

Monday

  • Logistics
    • Next project milestone
    • Readings
    • Weekly reflections
    • DS Social
  • Local LLMs: activity
  • Project work time

Wednesday

  • Discussion
    • Benefits and drawbacks of local models
    • KV Cache
    • “Summarize” = discern what’s important. Straightforward?
  • Course Progress Check
    • So far: prototyping in human contexts, how to program with and for LLMs, evaluation challenges
    • Next: discuss the questions on the handout
  • Project work time

Friday

  • Results of topics survey
    • Activities:
      • Analyzing more existing systems (12) (but few specifics given)
      • Upgrading course advisor bot (10)
      • Debate about AI (8): topics like privacy, use in education, and job impacts
    • Readings
      • business impacts (10)
      • engineering and technology details (7)
      • opportunities (7) vs harms (5)
      • faith perspectives (5)
  • An analysis of Perusall’s comment quality scoring AI
    • Design Norms
    • Toy Version of the algorithm
    • Toy prototype of an alternative kind of interaction
  • Project work time
How to score quality?

Without fancy AI:
- length
- uniqueness (# of words that aren't already in the text)
- ... other factors? game-able.

With LLM or other AI:
- Provide both the highlighted text and the comment
- Write rubric for what constitutes a good comment
  score based on how many rubric items match
- Give examples of good and mediocre comments in an LLM prompt
- fine-tune a classification model


def score(comment, highlighted_text):
    # construct an LLM prompt, run it, extract class
    # OR
    # run a fine-tuned model


Norms and reflections
- Transparency: Instructor should be able to fine-tune
- Social: how does scoring affect how others perceive our comments?
- Justice and Trust: people interact with content differently; some thoughts are good but wouldn't be scored highly (e.g., not viewed as relevant)
- Caring and Justice: Perusall is not as accessible (no TTS)
- Trust: what if people trust AI judgment more than human judgment?

Overall: what's the *goal* of reading and commenting?

Caring: judge whether comments are promoting a thoughtful environment, not just a grade

Could Perusall provide questions for students to respond to?

Week 5

Monday

  • Projects: project milestones should be revised proposals rather than diffs (keep the same format, just highlight what’s changed)
  • Analyze some existing system: Recommender Systems
  • Micro-teaching on Model Context Protocol
  • Apply “Guidelines for Human-AI Interaction” to our projects

For more about the Playwright MCP, see Using Playwright MCP with Claude Code | Simon Willison’s TILs > Using Playwright MCP with Claude Code

Wednesday

Friday

Week 6

Monday

  • Image, video, and audio generation:
    • Where does the training data come from?
    • How to evaluate these models?
  • Rest of this class
    • Wednesday: peer feedback on projects
    • Friday: Mini debate, project work time
    • Monday project presentations 1
    • Wednesday project presentations 2
    • Remaining at-home activities:
      • Course advisor bot
      • Write a prayer about AI?
  • Ungrading logistics
  • Project work time

Wednesday

  • A great solution to the wrong problem
  • Logistics
  • Activity: Peer feedback on projects
    • Pair up with one or two other project groups
    • Use the project rubric to organize your time together
    • Switch after about 15 minutes
  • How do we measure “intelligent”? Problem: benchmarks ≠ real tasks.

Possible additional activities:

  • try actually getting something done using a voice assistant chatbot.
  • Read an announcement post for a few different AI products and analyze the metrics that they are using based on whether those are meaningful in human contexts

Possible additional topics:

  • Reliable, safe, and trustworthy AI systems
  • Mimicry vs exploration (RL)
  • Advanced LLM APIs
    • multimodal I/O
  • History of human-AI interaction (ELIZA)
  • Theories of impacts of automation (e.g., levels of automation in autonomous vehicles)
  • Problems of automation (e.g., over-reliance)
  • Grounding in domain-specific data (e.g., RAG)

Specific risks for AI: