W2D3: Testing Email Feedback Bots

Goal

Test the reliability of both human and LLM evaluation of your message feedback bot.

Key Insight

When building is easy, evaluation becomes the bottleneck. How do we know if our AI systems are actually working?

Setup

Each team needs their message feedback bot working
Have your original email example and the bot’s feedback ready
Have your guidelines/problems list from W2D1 prep accessible

Activity

Part A: Human Rating Variance

Each person on your team independently:

Rate your bot’s feedback quality: 1 (unhelpful) to 5 (excellent)
Write one sentence explaining your rating

Record: Individual ratings (e.g., “3, 4, 4”)

Part B: LLM Judge Prompt Engineering

Create two versions of an evaluation prompt using your guidelines/problems from W2D1:

Version 1: Rating-first

Rate this email feedback on a 1-5 scale, then explain your reasoning.

Guidelines the feedback should follow:
[Insert your team's guidelines from W2D1 here]

Common problems to avoid:
[Insert your team's problems list from W2D1 here]

Email: [original email]
Feedback: [bot's feedback]

Rating (1-5): 
Explanation:

Version 2: Reasoning-first

Evaluate this email feedback, then give a 1-5 rating.

[Same guidelines and problems sections]

Email: [original email]  
Feedback: [bot's feedback]

Analysis:
Rating (1-5):

Part C: LLM Rating Variance

For each version of your prompt: 1. Run it 3 times in your LLM (use temperature > 0 to get variation). Read the explanations. Tweak your prompt if necessary. 2. Extract just the numeric ratings

Record your results like this:

Human ratings: [3, 4, 4] (range: 1)
LLM Version 1 (rating-first): [4, 4, 3] (range: 1)  
LLM Version 2 (reasoning-first): [2, 4, 3] (range: 2)

Debrief Questions

Suppose you were going to deploy this bot. Discuss with your team:

How would you initially evaluate its performance (to determine whether it’s good enough to launch)? What else would you need to know beyond the results of the test you just did?
How would you create a regression test to ensure that changes (like the LLM provider changing the model) don’t harm performance? What criterion would you set for alerting your team?

Consider the roles of human and LLM evaluators in both of these scenarios. Also consider the variability of judgments.

Extension Ideas (if time)

Different example: Test on a different email example
Prompt order effects: Try rating-first vs. reasoning-first - does the LLM rationalize its initial rating?
Guidelines impact: Run the same evaluation without your specific guidelines/problems - does including them change the results?
Cross-evaluation: Have another team evaluate your bot’s output
Different model: Try evaluating with a different LLM model