Testing AI-Integrated Applications with Automation

Artificial Intelligence is becoming a core part of modern applications, especially in client-facing UIs. From chatbots to recommendation systems, AI now powers user experiences that were unthinkable just a few years ago.

But as testers and engineers, we face a difficult question: how do we test something that doesn’t always give the same response?

The Challenge of Non-Deterministic AI

Traditional test automation works best in deterministic systems. For example:

If I click “Add to Cart” on an e-commerce site, I can assert that the cart count increases by one — always.
If I request /api/products, the schema and values are predictable.

With AI-driven systems, outputs vary:

The same chatbot question might return different answers.
A recommendation engine may surface different items over time.
AI copilots in IDEs might generate multiple correct-but-different code snippets.

This variability makes rigid assertions fragile. If we test for exact strings or specific items, most tests fail even though the system is working correctly.

Where Assertions Still Work

Automation is not useless in AI contexts, it just shifts focus. We can still validate outputs meaningfully at several levels:

E2E UI tests
- Ensure the flow works: inputs trigger AI responses, responses render in the UI.
API/schema validation
- Check response structure, presence of required fields, and metadata like response time or confidence scores.
- Example: "title", "highlights", "season" must always exist.
Functional guardrails
- Validate content is relevant, safe, and aligned with the product’s intent.
- Example: a “budget travel” query should never suggest private jets.

Assertions here are generic and flexible, focusing on presence, formatting, and intent rather than exact wording.

Using AI to Test AI

A powerful emerging approach is to let AI help validate AI.

Imagine this workflow:

The test framework triggers an AI request in UI/API.
The product’s AI generates a response.
The test sends both the request and response to another AI (validator model) with a prompt:
“Does this response make sense for the given request?”
The validator AI returns a pass/fail verdict with reasoning.

This “AI validating AI” approach simulates user judgment much better than brittle assertions. Importantly, the validator should be a different model (or configured differently) to avoid bias and rubber-stamping.

A Sandbox Project Example

To explore this, I built a simple Trip Planner Sandbox:

Frontend: React + TypeScript + Vite form asking travel preferences.
Backend: Express + OpenAI SDK, prompting the AI to generate a trip plan:

app.post("/api/plan-trip", async (req, res) => {
  const { preferences } = req.body;

  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content:
          "You are a helpful travel planner. Always respond in strict JSON with keys: destination (string), highlights (array of 2–4 strings), season (string: best season to visit)."
      },
      {
        role: "user",
        content: `Plan a trip based on these preferences:\n${JSON.stringify(preferences, null, 2)}`
      }
    ],
    response_format: { type: "json_object" }
  });

  const content = completion.choices[0]?.message?.content;
  res.json({ success: true, data: JSON.parse(content!) });
});

and return structured JSON:

{
  "destination": "Appalachian Mountains",
  "highlights": ["Hiking", "Camping"],
  "season": "Fall"
}

Tests: Playwright with Page Object Model and a custom AI Validator utility that checks whether the trip plan is plausible:

export async function validateTripPlan(
  prefs: Preferences,
  plan: TripResult
): Promise<{ pass: boolean; reason?: string }> {
  const systemPrompt = `
    You are a validator. Respond in JSON with:
    - pass (boolean)
    - reason (string if not valid)
  `;

  const userPrompt = `
    Request: ${JSON.stringify(prefs)}
    Response: ${JSON.stringify(plan)}

    Rules:
    1. Destination must be plausible.
    2. Highlights: 2–4 relevant activities.
    3. Season: best time to visit.
  `;

  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini", // Or a different model for validation
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userPrompt }
    ],
    response_format: { type: "json_object" }
  });

  return JSON.parse(completion.choices[0]!.message!.content!);
}

Then, in a test case, we can utilize this AI validation utility to validate our AI generated response:

test("AI trip suggestions make sense", async ({ page }) => {
  const tripPage = new TripPlannerPage(page);
  await tripPage.goto();

  const prefs = {
    preference: "mountains",
    budget: "$1000",
    companions: "family",
    climate: "mild",
    duration: "1 week"
  };

  await tripPage.fillPreferences(prefs);
  await tripPage.clickOnSubmitButton();

  const result = await tripPage.getTripResult();
  const validation = await validateTripPlan(prefs, result);

  expect(validation.pass, validation.reason).toBeTruthy();
});

The validator enforces rules like:

Destination must be a real, plausible place.
Highlights must be 2–4 relevant activities.
Season must reflect the best time to visit. This sandbox showed how we can integrate AI-assisted validation into a modern test automation workflow.

Pros of This Approach

✅ Simulation of real user experience: Instead of checking raw text, we validate the meaning and relevance.
✅ Relevancy testing: Helps catch off-topic, nonsensical, or unsafe AI outputs.
✅ True E2E flow: Covers request, AI generation, and user-facing response validation.

Cons and Limitations

⏳ Execution time: Each test requires at least two AI calls (generation + validation).
⚙️ Complexity: Needs a fully integrated product and supporting infrastructure for AI-based assertions.
💸 Maintenance cost: Teams must ensure consistency, handle evolving AI behavior, and manage test flakiness.
📉 Flakiness: Sometimes the AI Validator can be overly strict. Test data must not only be valid but also realistic. As models evolve, results may shift due to improved reasoning. For example, during one test run, what seemed like a valid input was rejected with the following message:
- “Error: Validation failed. Reason: The budget might be insufficient for a family trip to Maui, especially during peak season without any accommodation or transport details provided.”

Conclusion

AI-assisted testing of AI systems isn’t a silver bullet, but it’s a powerful complement to traditional automation. While it introduces cost and complexity, it can provide valuable insights into user experience and AI response quality - things that deterministic tests alone cannot cover.

For many teams, the best use case is to integrate this approach into nightly or exploratory test runs, where slower but more meaningful validations are acceptable. Ultimately, if your product heavily relies on AI for user-facing functionality, this type of testing may be worth the investment.

🔗 Resources & Connect

If you’d like to explore the full Trip Planner Sandbox project, the code is available here:
👉 GitHub Repository

I’d also love to connect and discuss more about AI testing, automation, and software engineering.
👉 Connect with me on LinkedIn

Testing AI-Integrated Products with Test Automation: Complexities and Opportunities

The Challenge of Non-Deterministic AI

Where Assertions Still Work

Using AI to Test AI

A Sandbox Project Example

Pros of This Approach

Cons and Limitations

Conclusion

🔗 Resources & Connect

Comments

More from this blog

Why Hardware Thinking Builds Better Software

The Evolution of Trust: From On-Premises Systems to Cloud and AI Adoption

Command Palette

The Challenge of Non-Deterministic AI

Where Assertions Still Work

Using AI to Test AI

A Sandbox Project Example

Pros of This Approach

Cons and Limitations

Conclusion

🔗 Resources & Connect

Comments

More from this blog