Class Meetings

Week 1

Wednesday: Intro

During Class

Introductions
- Some interactive AI systems I’ve built: Thoughtful, live-outline, translation workflow, notebook feedback bot, …
(on paper) goals, background, project interest
- What do we want to wrestle with together in this class?
Discerning the AI movement

Before Next Time

Vibe code something testable for next time: see the instructions
Get logged into Perusall (via Moodle) and start Reading 1.

Friday

During Class

Opening discussion
- Suggestion-type experiment
- “Good” vs “Good for”: capability vs health
Logistics
- Why Perusall, how that will work
- Grading policy discussion
- Tentative topics list
- How we should use AI together
Activity: vibe prototype testing. Iterating.
Project Time:
- Intro to Project Logistics
- Exploration and team-forming time

Before Next Time

Finish Reading 1.
Start working through the Anthropic AI Fluency Foundations course

Reflect on your experiences using LLMs, e.g., for the vibe prototyping task, based on the “4 D” framework from that course.

Week 2

Monday: LLM APIs and Prompting

Logistics
- How we should use AI: together.
- Projects
Activity: prompting
- Fundamentals: context, messages, responses
- Intro to prompt engineering and context engineering

Wednesday: Continuing LLM APIs

Debrief Anthropic course
Logistics: Reading 2
Finish prompting activity.
- Go back and do Step 2B
- Also do the new Step 4 (refresh if you don’t see it)
Start working on course advisor bot

Friday: Evaluation

Evaluation activity
Continue working on course advisor bot
Logistics
- Project group selection and initial proposal
- Weekly reflections
- Turn-in for Course Advisor bot
- Anyone going to the Hackathon on Monday?

Week 3

Monday: Review, Catch-Up, and Project Work

Check-in, open Q&A
Main concepts from last week
- Monday
  - A prompt is a program.
  - The LLM “messages” API pattern
- Wednesday
  - We can interpret LLM outputs are more than just things to say to the user: tool calls, “thoughts”, etc.
  - (Let’s do an example function call together.)
- Friday
  - Evaluation is hard, but we can do better than “it seems good”
Work on unfinished activities from last week
- Evaluation activity (debrief questions)
- Work on course advisor bot
Project work time

Wednesday: Project Work

Discussion of debrief questions from evaluation activity
Project work time
Logistics
- Project proposal due today
- Reading 3 posted, including “Guidelines for Human-AI Interaction” - useful table for low-level analysis of systems

Notes: Evaluation

What else do we want to know?

How does it work on different quality emails?
When someone’s using it incorrectly?
When people who don’t know what you’re trying to do use it?
How does it affect how we present ourselves to others?

What would we want to get alerted for?

AI insults user or user’s work.
AI breaks the law… (how would we measure?)
- suggesting doing something illegal
- (are we responsible for what the AI says?)
AI suggests something not consistent with our policies
AI suggests PII that’s not in the original message

Friday: Testing Course Advisor Bots, Project Work

Projects
- Be much more specific about use case. Make an amazing solution for a narrow problem, not a mediocre solution for a broad problem.
Debrief from handouts
- If the model gives a rationale for its answer, is that transparent?
- Agents: calling tools in a loop to achieve a goal
- What would we do with a “reference” human output for a task like the email feedback bot?
- Ratings are hard to calibrate (what does a “4” mean?), but rankings might be easier to get right (which of these two is better?). Possible example.
- Gathering user feedback: A/B tests, comparisons
Activity: Testing Course Advisor Bots
Project work time

Notes: Feedback Bot

What would we do with a reference “good” output?

give it to the bot as an example.
- In-Context Learning - now possible b/c self-attention (Transformers model)
- but LLM can “overfit” to the example.
- make sure you give a range of examples.
- test the impact of these examples.
fine-tune the model on it
- start with pre-trained model and update its weights to fit some data better
- but you need LOTS of examples.
- could be more efficient
use it to evaluate / iterate
- Score how much the output aligns with the reference
  - word / phrase overlap? but maybe exact matches unlikely
  - summarize tone / idea / points / …, compute overlap of that
  - Give LLM a comparison task? “How much do these align?”

Notes: Course Advisor Bot

Other stakeholders:

transfer students
Entrada or other prospective students: see courses that might be of interest
advisors: save time, but might miss courses e.g., with shorter descriptions (BIAS), or it recommends the same class to many students
- might replace human interaction of advising days, which is about more than picking classes.
The system’s data was limited
- only current sections
- scheduling
- requirements / programs (current and possible programs, overall)
- seats available
- student’s year, prior classes
- faculty match

Other planned notes:

Stakeholders: Current students, prospective students, advisors, instructors, department chairs, IT staff, …
One-shot interaction, no conversation, focused on an explicit statement of interests, no constraints, no connection to broader programs; driven by interest not long-term goals / who do you want to become
Assumptions: students know what they want and can articulate it succinctly (vs discovery, exploration), course is the unit of interest; no feedback (the only data is the course catalog), no constraints (e.g., time of day, prerequisites, …)
Are there courses that will get systematically under- or over-recommended by the bot?

Before Next Time

Install ollama on your machine and get at least gemma3:270m running. If you have the disk space and memory, you could also try a larger model, like gpt-oss:20b or one of the qwen models.

Week 4

Monday

Logistics
- Next project milestone
- Readings
- Weekly reflections
- DS Social
Local LLMs: activity
Project work time

Wednesday

Discussion
- Benefits and drawbacks of local models
- KV Cache
- “Summarize” = discern what’s important. Straightforward?
Course Progress Check
- So far: prototyping in human contexts, how to program with and for LLMs, evaluation challenges
- Next: discuss the questions on the handout
Project work time

Friday

Results of topics survey
- Activities:
  - Analyzing more existing systems (12) (but few specifics given)
  - Upgrading course advisor bot (10)
  - Debate about AI (8): topics like privacy, use in education, and job impacts
- Readings
  - business impacts (10)
  - engineering and technology details (7)
  - opportunities (7) vs harms (5)
  - faith perspectives (5)
An analysis of Perusall’s comment quality scoring AI
- Design Norms
- Toy Version of the algorithm
- Toy prototype of an alternative kind of interaction
Project work time

Notes: Perusall

How to score quality?

Without fancy AI:
- length
- uniqueness (# of words that aren't already in the text)
- ... other factors? game-able.

With LLM or other AI:
- Provide both the highlighted text and the comment
- Write rubric for what constitutes a good comment
  score based on how many rubric items match
- Give examples of good and mediocre comments in an LLM prompt
- fine-tune a classification model


def score(comment, highlighted_text):
    # construct an LLM prompt, run it, extract class
    # OR
    # run a fine-tuned model


Norms and reflections
- Transparency: Instructor should be able to fine-tune
- Social: how does scoring affect how others perceive our comments?
- Justice and Trust: people interact with content differently; some thoughts are good but wouldn't be scored highly (e.g., not viewed as relevant)
- Caring and Justice: Perusall is not as accessible (no TTS)
- Trust: what if people trust AI judgment more than human judgment?

Overall: what's the *goal* of reading and commenting?

Caring: judge whether comments are promoting a thoughtful environment, not just a grade

Could Perusall provide questions for students to respond to?

Week 5

Monday

Projects: project milestones should be revised proposals rather than diffs (keep the same format, just highlight what’s changed)
Analyze some existing system: Recommender Systems
Micro-teaching on Model Context Protocol
Apply “Guidelines for Human-AI Interaction” to our projects

For more about the Playwright MCP, see Using Playwright MCP with Claude Code | Simon Willison’s TILs > Using Playwright MCP with Claude Code

Wednesday

Activity: Agents and the Model Context Protocol

Friday

Just For Fun
Recommender Systems takeaway from Monday: What do we desire?
Goodhart’s Law (see also: perverse incentives), reductionism
Discussion about recent readings
Analysis of existing systems: deep research agents, and the importance of the system prompt. References:
Project work time

Week 6

Monday

Image, video, and audio generation:
- Where does the training data come from?
- How to evaluate these models?
Rest of this class
- Wednesday: peer feedback on projects
- Friday: Mini debate, project work time
- Monday project presentations 1
- Wednesday project presentations 2
- Remaining at-home activities:
  - Course advisor bot
  - Write a prayer about AI?
Ungrading logistics
Project work time

Wednesday

A great solution to the wrong problem
Logistics
- Presentation signups
- Prayer about AI Discussion
- Wrap-up Meeting Signups
- “debate” format
- Final readings (optional)
Activity: Peer feedback on projects
- Pair up with one or two other project groups
- Use the project rubric to organize your time together
- Switch after about 15 minutes

Friday

Logistics
- Reminder to sign up for project presentations and wrap-up meetings
- Self-Assessments for Wrap-Up Meetings
Activity: Mini debate on AI topics details
Project work time

Week 7

Monday

Logistics
Small notes
- PSA - make sure your family knows about deepfakes
- More AI-coding stories
Project presentations
- 10 minutes per group
- Use handouts to write feedback for each group

Wednesday

Finish project presentations
Reading the joint prayer about AI