W2D3: Testing Email Feedback Bots
Goal
Test the reliability of both human and LLM evaluation of your message feedback bot.
Key Insight
When building is easy, evaluation becomes the bottleneck. How do we know if our AI systems are actually working?
Setup
- Each team needs their message feedback bot working
- Have your original email example and the bot’s feedback ready
- Have your guidelines/problems list from W2D1 prep accessible
Activity
Part A: Human Rating Variance
Each person on your team independently:
- Rate your bot’s feedback quality: 1 (unhelpful) to 5 (excellent)
- Write one sentence explaining your rating
Record: Individual ratings (e.g., “3, 4, 4”)
Part B: LLM Judge Prompt Engineering
Create two versions of an evaluation prompt using your guidelines/problems from W2D1:
Version 1: Rating-first
Rate this email feedback on a 1-5 scale, then explain your reasoning.
Guidelines the feedback should follow:
[Insert your team's guidelines from W2D1 here]
Common problems to avoid:
[Insert your team's problems list from W2D1 here]
Email: [original email]
Feedback: [bot's feedback]
Rating (1-5):
Explanation:
Version 2: Reasoning-first
Evaluate this email feedback, then give a 1-5 rating.
[Same guidelines and problems sections]
Email: [original email]
Feedback: [bot's feedback]
Analysis:
Rating (1-5):
Part C: LLM Rating Variance
For each version of your prompt: 1. Run it 3 times in your LLM (use temperature > 0 to get variation). Read the explanations. Tweak your prompt if necessary. 2. Extract just the numeric ratings
Record your results like this:
Human ratings: [3, 4, 4] (range: 1)
LLM Version 1 (rating-first): [4, 4, 3] (range: 1)
LLM Version 2 (reasoning-first): [2, 4, 3] (range: 2)
Debrief Questions
Suppose you were going to deploy this bot. Discuss with your team:
- How would you initially evaluate its performance (to determine whether it’s good enough to launch)? What else would you need to know beyond the results of the test you just did?
- How would you create a regression test to ensure that changes (like the LLM provider changing the model) don’t harm performance? What criterion would you set for alerting your team?
Consider the roles of human and LLM evaluators in both of these scenarios. Also consider the variability of judgments.
Extension Ideas (if time)
- Different example: Test on a different email example
- Prompt order effects: Try rating-first vs. reasoning-first - does the LLM rationalize its initial rating?
- Guidelines impact: Run the same evaluation without your specific guidelines/problems - does including them change the results?
- Cross-evaluation: Have another team evaluate your bot’s output
- Different model: Try evaluating with a different LLM model