Course Advisor Bot 2.0: Conversational Agent with MCP

In your original course advisor bot, you built a one-shot system: the user states their interests, the bot generates search queries, searches the catalog, and returns recommendations. No conversation, no clarification, no constraints beyond interest matching.

In this assignment, you’ll redesign it as a conversational agent using the Model Context Protocol (MCP) to expose course search tools, and you’ll implement a testing framework to evaluate its performance.

This assignment integrates concepts from our prior activities:

Learning Objectives

By completing this assignment, you will:

  1. Architect a conversational AI system that maintains context across multiple turns
  2. Design and implement MCP tools with well-defined interfaces for an agent
  3. Evaluate AI system performance using automated testing frameworks
  4. Analyze security and trust implications of tool-based AI systems
  5. Apply stakeholder-centered design to identify and address systematic biases

Part 1: MCP Service Implementation (30%)

Requirements

Create an MCP service (mcp_service.py) that exposes at least 3 tools for searching the course sections data.

The sections JSON file contains section-level data (not just course-level), so your tools should work with sections.

If you want to use this for Spring 2026 advising course selection, it might help to have more up-to-date data.

I pulled the Spring 2026 section data by watching the Network tab on the Course Offerings tool. It’s compressed with LZMA to save space. You can decompress it in Python with something like:

import lzma
import json
with lzma.open("Sections-26SP.json.lzma") as f:
    sections = json.load(f)

There appears to be a new outer-level “report” key, and there might be some other schema changes, so you may need to adjust your code accordingly. I suggest working in a Jupyter notebook in VS Code to explore the data interactively; Copilot can help with that.

Suggested Tools

You should implement at least 3. Here are some options:

  1. find_courses(query: str): Text search over course titles and descriptions (from your original bot)
  2. find_sections(course_title: str): Get all sections for a given course title
  3. find_sections_by_department(department: str): Filter by department code (e.g., “CS”, “ENGL”)
  4. find_sections_by_level(level: str): Filter by course level (e.g., “100”, “200”, “300”, “400”)
  5. find_sections_by_time(time_of_day: str): Filter by time preferences (morning/afternoon/evening)
  6. get_section_details(section_id: str): Get full information about a specific section
  7. find_sections_filtered(query: str, department: str | None, level: str | None, ...): Combined search with multiple filters

You can design your own tools beyond this list if they serve your use case better.

Tool Design Considerations

  • Clear interfaces: Each tool should have well-documented parameters with type hints
  • Useful docstrings: Describe what the tool does and give examples (the LLM will see these!)
  • Appropriate granularity: Not too specific (requiring many calls) or too broad (returning too much data)
  • Error handling: What happens if no results are found? Empty list? Error message?

Testing Your MCP Service

Test your service using the MCP inspector:

uv run mcp dev mcp_service.py

You should be able to:

  1. List your tools
  2. Call each tool with example inputs
  3. Verify the outputs match your expectations

Starter Code

from mcp.server.fastmcp import FastMCP
import requests

mcp = FastMCP("Course Advisor Tools")

# Load the sections data
sections_json_url = "https://cs.calvin.edu/courses/cs/375/25sp/notebooks/AY25FA25_Sections.json"
sections_json = requests.get(sections_json_url)
sections_json.raise_for_status()
sections = sections_json.json()

@mcp.tool()
def find_sections(query: str) -> list[dict]:
    """
    Search for course sections by text query.

    Searches section titles and descriptions for the given query string.
    Returns a list of matching sections with basic information.

    Example: find_sections("artificial intelligence")
    """
    # TODO: Implement your search logic
    pass

# TODO: Add more tools here

if __name__ == "__main__":
    mcp.run(transport="stdio")

Part 2: Conversational Agent (30%)

Requirements

Create a conversational agent (course_advisor_agent.py) that:

  1. Connects to your MCP service to access the course search tools
  2. Maintains conversation context across multiple user inputs
  3. Uses a system prompt to define the advisor’s behavior and personality
  4. Handles multi-turn interactions like clarification questions and progressive constraint refinement

Conversational Design Considerations

Think about:

  • When to ask clarifying questions vs. when to just show results
  • How to handle vague requests (e.g., “I want something interesting”)
  • How to handle follow-up queries (e.g., “Tell me more about that CS course”)
  • How to handle constraints (e.g., “I can only take classes in the morning”)
  • When to suggest exploration vs. when to be directive

System Prompt Design

Your system prompt shapes the agent’s behavior. Consider:

  • What persona should the advisor have? (Formal? Friendly? Socratic?)
  • When should it call tools vs. rely on general knowledge?
  • How should it structure its responses?
  • What should it do when it can’t find good matches?

Conversation Interface

You can implement the conversation interface however you prefer:

  • CLI loop (like the MCP activity example)
  • Streamlit app (see mcp-streamlit for integration examples)
  • Jupyter notebook with interactive widgets
  • Other framework of your choice

The interface doesn’t need to be polished—functionality matters more than aesthetics.

Context Management

You might not need to manually manage conversation history—the conversation object from the llm package (or equivalent in your chosen framework) typically handles this for you. However, you should think about:

  • How far back should the context extend?
  • Are there performance implications of long conversations?
  • Should you summarize or compress old context?

Starter Code Pattern

Building on the W5D2 MCP client example, your agent might look like:

import asyncio
import llm
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Configure MCP server connection
server_params = StdioServerParameters(
    command="uv",
    args=["run", "mcp_service.py"],
)

# [Include MCPToolbox class from W5D2 activity]

# Get the model you want to use (you might try different ones)
model = llm.get_async_model('gemini-2.5-flash')

system_prompt = """You are a helpful course advisor...
[TODO: Complete your system prompt here]
"""

async def main():
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            toolbox = MCPToolbox(session)
            await toolbox.prepare_async()

            # Get the model
            conversation = model.conversation()

            # Conversation loop
            while True:
                user_input = input("> ").strip()
                if not user_input or user_input.lower() in ['exit', 'quit']:
                    break

                # TODO: Implement your conversation logic here

if __name__ == "__main__":
    asyncio.run(main())

Part 3: Testing & Evaluation (25%)

Testing conversational AI systems requires both technical validation (do the tools work?) and interaction quality assessment (are the conversations helpful?). We’ll address these differently.

Part 3A: Testing MCP Tools with pytest (Required)

Write pytest tests for your MCP tools to verify they return appropriate results.

Example tests:

import pytest
# Import your MCP tool functions

def test_find_sections_returns_cs_courses():
    """Test that search for 'computer science' returns CS courses."""
    results = find_sections("computer science")
    assert len(results) > 0
    assert any("CS" in section["SectionName"] for section in results)

def test_find_sections_handles_no_results():
    """Test behavior when no sections match the query."""
    results = find_sections("zzznonexistentcoursezzz")
    assert results == []  # or however you handle no results

def test_find_sections_by_level():
    """Test that level filtering works correctly."""
    results = find_sections_by_level("300")
    assert len(results) > 0
    for section in results:
        course_code = section["SectionName"].split("-")[0].strip()
        level = course_code.split()[1]
        assert level.startswith("3")

# TODO: Add tests for each of your tools, including edge cases

What to test: - Each tool returns expected results for typical inputs - Edge cases: empty results, malformed inputs, boundary conditions - Tool composition: if one tool uses another, does it work correctly?

Performance measurement: Add simple timing to understand where latency comes from:

import time

def test_find_sections_performance():
    """Measure how long a typical search takes."""
    start = time.time()
    results = find_sections("programming")
    duration = time.time() - start
    print(f"\nSearch took {duration:.3f} seconds, returned {len(results)} results")
    assert duration < 2.0  # Should be fast for local data

Run your tests: uv run pytest test_mcp_tools.py -v

Part 3B: Evaluating Agent Conversations (Required)

Testing whether conversations are “good” is harder than testing whether functions return correct results. Here’s a practical approach:

Step 1: Add Conversation Logging

Modify your agent to log all conversations to a file.

import json
from datetime import datetime

class ConversationLogger:
    def __init__(self, log_file="conversations.jsonl"):
        self.log_file = log_file
        self.current_conversation = []

    def log_turn(self, role, content, tools_called=None):
        """Log a single conversation turn."""
        turn = {
            "timestamp": datetime.now().isoformat(),
            "role": role,
            "content": content,
        }
        if tools_called:
            turn["tools_called"] = tools_called
        self.current_conversation.append(turn)

    def save_conversation(self, success=True, notes=""):
        """Save the current conversation and start fresh."""
        conversation_record = {
            "id": datetime.now().isoformat(),
            "turns": self.current_conversation,
            "success": success,
            "notes": notes
        }
        with open(self.log_file, "a") as f:
            f.write(json.dumps(conversation_record) + "\n")
        self.current_conversation = []

# In your agent code:
logger = ConversationLogger()

# Track tools called in this turn
current_tools = []

def before_call(tool, tool_call):
    """Hook that runs before each tool call."""
    print(f"Calling {tool.name} with {tool_call.arguments}")
    current_tools.append({
        "name": tool.name,
        "arguments": tool_call.arguments
    })

def after_call(tool, tool_call, tool_result):
    """Hook that runs after each tool call."""
    print(f"{tool.name} returned {len(tool_result.output)} results")

# In your conversation loop:
while True:
    user_input = input("> ").strip()
    if not user_input or user_input.lower() in ['exit', 'quit']:
        logger.save_conversation(notes="User exited")
        break

    # Log user input
    logger.log_turn("user", user_input)

    # Reset tool tracking for this turn
    current_tools = []

    # Get agent response (this will trigger before_call/after_call hooks)
    chain = conversation.chain(
        user_input,
        tools=[toolbox],
        before_call=before_call,
        after_call=after_call
    )

    # Collect the response text
    response_text = ""
    async for chunk in chain:
        response_text += chunk
        print(chunk, end="")
    print()  # Newline

    # Log assistant response with tools used
    logger.log_turn("assistant", response_text, tools_called=current_tools.copy())

Key points:

  • Use the before_call hook to track which tools are called (append to current_tools list)
  • Reset current_tools = [] at the start of each turn
  • Pass tools_called=current_tools.copy() when logging the assistant’s response
  • Save conversations when the user exits

Step 2: Conduct Test Conversations

Run at least 10 diverse test conversations covering different scenarios:

Required test scenarios: 1. Clear, focused interest (e.g., “I want to learn about AI”) 2. Vague interest (e.g., “I want something interesting”) 3. Multi-turn exploration (e.g., “Show me CS courses” → “Tell me more about the AI one”) 4. Constraint-based query (e.g., “I need a 200-level humanities course”) 5. No results scenario (e.g., “I want a course about underwater basket weaving”) 6. Follow-up question (e.g., “What about morning sections?”) 7. Ambiguous reference (e.g., “Tell me about that one”) 8. Out-of-scope request (e.g., “What’s the weather today?”) 9. Edge case input (empty string, very long query, non-English) 10. Adversarial/red-team attempt (try to break it)

Save all these conversations to your log file.

Step 3: Manual Evaluation

Create a simple evaluation rubric and manually label your logged conversations.

Suggested evaluation dimensions:

Dimension Rating (1-5) Notes
Tool selection Did it call appropriate tools?
Response relevance Were recommendations on-topic?
Context awareness Did it remember prior turns?
Helpfulness Would this be useful to a student?
Error handling How did it handle problems?

For each logged conversation: 1. Re-read the transcript 2. Rate it on each dimension (1=poor, 5=excellent) 3. Note specific successes or failures 4. Categorize any errors you observe

Example categorization of errors: - Wrong tool called - Missed context from previous turn - Fabricated information - Unhelpful response to valid question - Failed to handle edge case gracefully

Create a summary document with: - Overall ratings across all conversations (e.g., “average tool selection: 4.2”) - Common failure patterns you observed - Specific examples of best and worst interactions - What you learned about conversation design

Step 4 (Optional): Automated Evaluation

If you want to explore automated agent evaluation, consider:

Option A: LLM-as-judge

Use another LLM to rate your logged conversations:

def evaluate_conversation(conversation_log):
    """Use an LLM to evaluate a logged conversation."""
    prompt = f"""
    Evaluate this course advisor conversation on a scale of 1-5.

    Conversation:
    {format_conversation(conversation_log)}

    Rate on:
    - Tool selection appropriateness
    - Response relevance
    - Context awareness

    Provide ratings and brief justification.
    """
    # Call LLM with this prompt
    return ratings

Option B: Rule-based metrics

Extract simple metrics from logs: - Average number of tools called per conversation - Percentage of conversations with zero results - Response time distribution - Context window size over time

Option C: Eval framework

If you have experience with evaluation frameworks (OpenAI Evals, Langfuse, Braintrust), you can use them. Document your setup and learnings.

Required Deliverables for Part 3

  1. pytest test suite for MCP tools (test_mcp_tools.py)
  2. Conversation logs from at least 10 diverse test conversations (conversations.jsonl or similar)
  3. Manual evaluation summary with ratings, failure patterns, and examples
  4. Performance measurements (latency, throughput)
  5. Documentation of at least 3 specific failure modes discovered and how you addressed them

Part 4: Critical Analysis (15%)

Write a reflective analysis (1-2 pages) addressing:

Stakeholder Analysis

  • Who does this system serve well? What types of students or use cases work best?
  • Who does it serve poorly? What stakeholder groups (from W3D3) are still underserved?
  • What design choices created these limitations? Be specific about architectural decisions.

Data Limitations & Bias

  • Section-level vs. course-level: How does working with sections (not courses) affect recommendations?
  • Systematic bias: Are there courses or departments that will be systematically over- or under-recommended? Why?
  • Missing data: What information would improve the system? (Consider: prerequisites, learning outcomes, program requirements, historical enrollment, etc.)

Security & Trust Considerations

Consider:

  • Tool access: What could go wrong if an adversary could call your MCP tools directly?
  • Prompt injection: Could a user manipulate the conversation to make the agent behave inappropriately?
  • Information leakage: Could the agent reveal its system prompt or tool implementation details?

If you implemented the SQL query tool (see optional extensions), specifically analyze:

  • How does SQL injection risk compare to prompt injection?
  • What mitigations did you implement? (e.g., read-only database access)
  • What risks remain even with mitigations?

Human-AI Interaction Quality

Reflect on the conversation design:

  • When does the agent ask good clarifying questions?
  • When does it fail to understand user intent?
  • How does it handle uncertainty or ambiguity?
  • What would make it feel more “natural” or helpful?

Deliverable Format

Submit:

  1. Code: Your MCP service, conversational agent, and test suite

    • Include a README.md with setup/running instructions
    • Include any configuration files needed
  2. Writeup: Your critical analysis (Part 4)

    • Format: Markdown, PDF, or Google Doc
    • Length: 1-2 pages (aim for substance over length)
    • Include your testing framework documentation from Part 3
  3. Demo artifacts (optional but helpful):

    • Example conversation transcripts showing successes and failures
    • Screenshots of test results
    • Performance measurement outputs

Suggested Timeline (2 weeks)

Week 1: - Days 1-2: Implement MCP service with tools - Days 3-4: Build conversational agent - Day 5: Basic testing

Week 2: - Days 1-2: Comprehensive testing and debugging - Days 3-4: Critical analysis writeup - Day 5: Final polish and submission

Optional Extensions

If you finish early or want to go deeper, consider:

Database Backend

  • Convert the JSON sections data to SQLite
  • Implement a query_sections_sql(sql_query: str) tool that accepts arbitrary SQL
  • Analyze the security implications in your writeup

Richer Outputs

  • Generate visual schedule grids
  • Create prerequisite flowcharts
  • Display department/program diversity metrics alongside recommendations

Bias Mitigation

  • Track which departments/programs appear in recommendations
  • Alert when recommendations are skewed toward certain areas
  • Implement active diversification strategies

User Study

  • Test your bot with real students (with IRB approval if publishing results)
  • Collect qualitative feedback on conversation quality
  • Compare to baseline (current course search tools)

Evaluation Rubric

  • MCP Service (30%): Tool quality, interface design, functionality
  • Conversational Agent (30%): Multi-turn handling, context management, system prompt design
  • Testing & Evaluation (25%): Test coverage, documented failure modes, performance measurement
  • Critical Analysis (15%): Depth of stakeholder/bias/security analysis, specific examples, actionable insights

Resources

Getting Help

  • MCP debugging: Use mcp dev mcp_service.py to inspect tools
  • Conversation issues: Add debug logging to see what tools are called and when
  • Testing confusion: Start with one simple test case and expand from there
  • Overwhelmed? Focus on the core requirements first, skip the extensions