Running Experiments in CI/CD

There are two ways to run experiments in your CI/CD pipeline:

Platform Experiments - Configure the experiment in LangWatch, then trigger it from CI/CD with a single line
Experiments via SDK - Define the entire experiment in code and run it in CI/CD

Choose based on your needs:

Approach	Best For
Platform Experiments	Non-technical team members can modify experiments; configuration lives in LangWatch
Experiments via SDK	Version control your experiment config; full flexibility in code

Option 1: Platform Experiments

Configure your experiment once in the LangWatch Experiments via UI, then trigger it from CI/CD.

Setup

Create your experiment in the Experiments via UI
- Add your dataset
- Configure targets (prompts, models, or API endpoints)
- Select evaluators
- Run it once to verify it works

Get your experiment slug from the URL:

https://app.langwatch.ai/your-project/experiments/your-experiment-slug
                                                   ^^^^^^^^^^^^^^^^^^^^

Or click the CI/CD button in the experiment toolbar.

Run from CI/CD:

Python
TypeScript

import langwatch

result = langwatch.experiment.run("your-experiment-slug")
result.print_summary()

import { LangWatch } from "langwatch";

const langwatch = new LangWatch();
const result = await langwatch.experiments.run("your-experiment-slug");
result.printSummary();

That’s it! The experiment runs with the configuration saved in LangWatch.

GitHub Actions Example

name: LLM Quality Gate

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install LangWatch
        run: pip install langwatch

      - name: Run experiment
        env:
          LANGWATCH_API_KEY: ${{ secrets.LANGWATCH_API_KEY }}
        run: |
          python -c "
          import langwatch
          result = langwatch.experiment.run('my-experiment')
          result.print_summary()
          "

Options

result = langwatch.experiment.run(
    "my-experiment",
    timeout=300.0,           # Max wait time (seconds)
    poll_interval=5.0,       # How often to check status
    on_progress=lambda done, total: print(f"{done}/{total}"),
)
result.print_summary(exit_on_failure=True)  # Exit with code 1 on failures

Option 2: Experiments via SDK

Define your entire experiment in code. This gives you full control and version control over your experiment configuration.

Basic Example

Python
TypeScript

import langwatch

# Load your dataset
dataset = langwatch.dataset.get_dataset("my-dataset").to_pandas()

# Initialize experiment
experiment = langwatch.experiment.init("ci-quality-check")

# Run through each test case
for idx, row in experiment.loop(dataset.iterrows()):
    # Call your LLM/agent
    response = my_llm(row["input"])

    # Run evaluators
    experiment.evaluate(
        "ragas/faithfulness",
        index=idx,
        data={
            "input": row["input"],
            "output": response,
            "contexts": row["contexts"],
        },
    )

# Print summary and exit with code 1 on failure
experiment.print_summary()

import { LangWatch } from "langwatch";

const langwatch = new LangWatch();

// Load your dataset
const dataset = await langwatch.datasets.get("my-dataset");

// Initialize experiment
const experiment = await langwatch.experiments.init("ci-quality-check");

// Run through each test case
await experiment.run(
  dataset.entries.map(e => e.entry),
  async ({ item, index }) => {
    // Call your LLM/agent
    const response = await myLLM(item.input);

    // Run evaluators
    await experiment.evaluate("ragas/faithfulness", {
      index,
      data: {
        input: item.input,
        output: response,
        contexts: item.contexts,
      },
    });
  },
  { concurrency: 4 }
);

// Print summary and exit with code 1 on failure
experiment.printSummary();

GitHub Actions Example

name: LLM Quality Gate

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langwatch openai  # Add your LLM SDK

      - name: Run experiment
        env:
          LANGWATCH_API_KEY: ${{ secrets.LANGWATCH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_evaluation.py

Where scripts/run_evaluation.py contains your full experiment code.

Comparing Multiple Configurations

SDK experiments shine when comparing different configurations:

import langwatch

dataset = langwatch.dataset.get_dataset("qa-dataset").to_pandas()
experiment = langwatch.experiment.init("model-comparison-ci")

for idx, row in experiment.loop(dataset.iterrows()):
    def compare(idx, row):
        # Test GPT-4
        with experiment.target("gpt-4o", {"model": "gpt-4o", "temperature": 0.7}):
            response = call_openai("gpt-4o", row["input"])
            experiment.log_response(response)
            experiment.evaluate("ragas/faithfulness", index=idx, data={
                "input": row["input"],
                "output": response,
                "contexts": row["contexts"],
            })

        # Test Claude
        with experiment.target("claude-3.5", {"model": "claude-3-5-sonnet"}):
            response = call_anthropic(row["input"])
            experiment.log_response(response)
            experiment.evaluate("ragas/faithfulness", index=idx, data={
                "input": row["input"],
                "output": response,
                "contexts": row["contexts"],
            })

    experiment.submit(compare, idx, row)

# Print summary and exit with code 1 on failure
experiment.print_summary()

Results Summary

Both approaches output a CI-friendly summary:

════════════════════════════════════════════════════════════
  EXPERIMENT RESULTS
════════════════════════════════════════════════════════════
  Run ID:     run_abc123
  Status:     COMPLETED
  Duration:   45.2s
────────────────────────────────────────────────────────────
  Passed:     42
  Failed:     3
  Pass Rate:  93.3%
────────────────────────────────────────────────────────────
  TARGETS:
    gpt-4o: 20 passed, 2 failed
      Avg latency: 1250ms
      Total cost: $0.0125
    claude-3.5: 22 passed, 1 failed
      Avg latency: 980ms
      Total cost: $0.0098
────────────────────────────────────────────────────────────
  EVALUATORS:
    Faithfulness: 95.0% pass rate
      Avg score: 0.87
────────────────────────────────────────────────────────────
  View details: https://app.langwatch.ai/project/experiments/...
════════════════════════════════════════════════════════════

The print_summary() method:

Outputs results in a structured format
Returns exit code 1 if any evaluations failed (unless exit_on_failure=False)
Provides a link to view detailed results in LangWatch

CI Platform Examples

GitLab CI

Platform Experiment
via SDK

evaluate:
  stage: test
  image: python:3.11
  script:
    - pip install langwatch
    - python -c "
      import langwatch
      result = langwatch.experiment.run('my-experiment')
      result.print_summary()
      "
  variables:
    LANGWATCH_API_KEY: $LANGWATCH_API_KEY

evaluate:
  stage: test
  image: python:3.11
  script:
    - pip install langwatch openai
    - python scripts/run_evaluation.py
  variables:
    LANGWATCH_API_KEY: $LANGWATCH_API_KEY
    OPENAI_API_KEY: $OPENAI_API_KEY

CircleCI

Platform Experiment
via SDK

version: 2.1

jobs:
  evaluate:
    docker:
      - image: python:3.11
    steps:
      - checkout
      - run:
          name: Run experiment
          command: |
            pip install langwatch
            python -c "
            import langwatch
            result = langwatch.experiment.run('my-experiment')
            result.print_summary()
            "

version: 2.1

jobs:
  evaluate:
    docker:
      - image: python:3.11
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: pip install langwatch openai
      - run:
          name: Run experiment
          command: python scripts/run_evaluation.py

Error Handling

Python
TypeScript

from langwatch.evaluation import (
    EvaluationNotFoundError,
    EvaluationTimeoutError,
    EvaluationRunFailedError,
)

try:
    result = langwatch.experiment.run("my-experiment", timeout=300)
    result.print_summary()
except EvaluationNotFoundError:
    print("Experiment not found - check the slug")
    exit(1)
except EvaluationTimeoutError as e:
    print(f"Timeout: only {e.progress}/{e.total} completed")
    exit(1)
except EvaluationRunFailedError as e:
    print(f"Run failed: {e.error_message}")
    exit(1)

import {
  EvaluationNotFoundError,
  EvaluationTimeoutError,
  EvaluationRunFailedError,
} from "langwatch";

try {
  const result = await langwatch.experiments.run("my-experiment", { timeout: 300000 });
  result.printSummary();
} catch (error) {
  if (error instanceof EvaluationNotFoundError) {
    console.error("Experiment not found - check the slug");
  } else if (error instanceof EvaluationTimeoutError) {
    console.error(`Timeout: only ${error.progress}/${error.total} completed`);
  } else if (error instanceof EvaluationRunFailedError) {
    console.error(`Run failed: ${error.errorMessage}`);
  }
  process.exit(1);
}

REST API (Platform Experiments)

For custom integrations, you can use the REST API directly:

Start a Run

curl -X POST "https://app.langwatch.ai/api/evaluations/v3/{slug}/run" \
  -H "X-Auth-Token: ${LANGWATCH_API_KEY}"

Response:

{
  "runId": "run_abc123",
  "status": "running",
  "total": 45,
  "runUrl": "https://app.langwatch.ai/..."
}

Poll for Status

curl "https://app.langwatch.ai/api/evaluations/v3/runs/{runId}" \
  -H "X-Auth-Token: ${LANGWATCH_API_KEY}"

Response (completed):

{
  "runId": "run_abc123",
  "status": "completed",
  "progress": 45,
  "total": 45,
  "summary": {
    "totalCells": 45,
    "completedCells": 45,
    "failedCells": 3,
    "duration": 45000
  }
}

Next Steps

Experiments via UI

Create experiments in the platform UI

Experiments via SDK

Full guide to SDK experiments

Evaluators

Browse available evaluators

Datasets

Manage your test datasets

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

Help

Running Experiments in CI/CD

Option 1: Platform Experiments

Setup

GitHub Actions Example

Options

Option 2: Experiments via SDK

Basic Example

GitHub Actions Example

Comparing Multiple Configurations

Results Summary

CI Platform Examples

GitLab CI

CircleCI

Error Handling

REST API (Platform Experiments)

Start a Run

Poll for Status

Next Steps

Experiments via UI

Experiments via SDK

Evaluators

Datasets

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

Help

Documentation Index

​Option 1: Platform Experiments

​Setup

​GitHub Actions Example

​Options

​Option 2: Experiments via SDK

​Basic Example

​GitHub Actions Example

​Comparing Multiple Configurations

​Results Summary

​CI Platform Examples

​GitLab CI

​CircleCI

​Error Handling

​REST API (Platform Experiments)

​Start a Run

​Poll for Status

​Next Steps

Experiments via UI

Experiments via SDK

Evaluators

Datasets

Option 1: Platform Experiments

Setup

GitHub Actions Example

Options

Option 2: Experiments via SDK

Basic Example

GitHub Actions Example

Comparing Multiple Configurations

Results Summary

CI Platform Examples

GitLab CI

CircleCI

Error Handling

REST API (Platform Experiments)

Start a Run

Poll for Status

Next Steps