Y

21 Nov 2025 • 13 min read

Testing Master Manual - Single Source of Truth

This document is the authoritative guide for all testing, architecture rules, and stability loop workflows.

Executive Summary & Objectives
The Architecture
The Trace Schema
The Workflow (The "Loop")
Coding Standards (The Strategy Pattern)
Control API & Configuration
Monitoring & Debugging

Executive Summary & Objectives

Project background: An original contract clause is sent to the LLM, which reviews it, and returns the revised clause as the output. The tool then applies the differences between the original and revised clauses as tracked changes using the MS Word API.

Goal: Stability Against Non-Deterministic LLM Output

The primary goal is Stability Testing - ensuring the system handles non-deterministic LLM output consistently and correctly.

Key Objectives:

Validate document restoration - Ensure document is restored to clean state before each test
Verify original test clause - Confirm same original clause is passed to LLM each time
Achieve consistent results - Get 5 consecutive passes where diff-ed output matches LLM output
Confirm anchor behavior - Verify correct anchor finding (no failures, no fallbacks)
Record & Replay - Capture exact Word API behavior in traces for offline debugging

Concept: Record & Replay (Not Simulation)

Instead of traditional mocks, we use Record & Replay:

Record: WordAdapter records exact API inputs/outputs to trace-log.json during live Word execution
Replay: ReplayWordAdapter replays these exact API calls offline to reproduce bugs deterministically
Stability: This handles non-deterministic LLM output by capturing the exact Word behavior that led to failures

Why Record & Replay?

LLM Output is Non-Deterministic: The same stability test may produce different LLM outputs, making it unreliable for verification
Word Behavior is Unpredictable: MS Word has hidden characters, edge cases, and behaviors that are impossible to simulate accurately
Exact Reproduction: Traces capture exact Word API behavior, allowing deterministic bug reproduction
No Guessing: Don't simulate Word behavior - replay exact reality from traces
Offline Debugging: No need for live Word instance - traces can be replayed in Jest tests
Fast Feedback: ReplayWordAdapter tests run quickly without Word overhead
Regression Prevention: Traces can be replayed after code changes to verify fixes don't break existing behavior

Critical Principle: You are not writing tests to simulate the issue. You are writing tests to host the recording of the issue. This is the only way to reliably fix logic bugs caused by Word's unpredictability.

Key Concept: Coding Agent Controls Loop

CRITICAL: When MS Word shows "waiting for trigger", it means:

✅ The Word machine has paused and is waiting
✅ The coding agent (assistant) controls when the next iteration starts
✅ The loop will NOT auto-continue - it waits for the coding agent to trigger it
✅ The coding agent will automatically analyze logs, create Jest tests, fix code, verify tests pass, and trigger the next iteration

No manual intervention is needed - the coding agent handles everything automatically after each iteration completes.

The Architecture

Components

The stability testing system consists of three components:

Word Add-in (taskpane.js): Runs stability loop, records traces on failure, pushes logs to dev server
Dev Server (webpack.config.cjs): Receives logs, stores trace-log.json artifacts
Coding Agent: Analyzes traces, uses ReplayWordAdapter to reproduce bugs offline, fixes code, triggers next iteration

The Loop: Trace Generation on Failure

Normal Flow

Word Add-in runs stability test iteration
WordAdapter executes Word API calls (with tracing enabled)
Test completes → If pass, continue to next iteration
If failure → Save trace to trace-log.json and pause loop

Trace Generation

When a test failure occurs:

WordAdapter has been recording all API calls to its internal trace array
On failure, the trace is serialized and saved to logs/trace-log-{testRunNumber}.json
Trace format follows strict JSON schema (see below)
Loop pauses and waits for coding agent to analyze and fix

ReplayWordAdapter Modes

Mode A: Replay (Stability Debugging)

When loadTrace() is called:

All API calls are compared against trace entries
Method name must match exactly
Arguments must match exactly (Range objects compared by start/end/text)
Returns result from trace (not computed)
Throws TraceDeviationError on any mismatch

Mode B: Simulation (Legacy Fallback - Not Recommended)

When loadTrace() is never called:

Uses string manipulation to simulate Word API behavior
Preserves existing unit tests that don't use traces
⚠️ WARNING: Simulation mode is unreliable - always prefer trace replay for Stability Loop failures
Useful for general integration testing without specific traces (but trace replay is preferred)

File Structure

word-ai-redliner/
├── src/
│   ├── lib/
│   │   ├── word-adapter.js           # Records traces during live execution
│   │   └── replay-word-adapter.js    # Replays traces offline
│   └── taskpane/
│       └── taskpane.js                # Stability loop (enables tracing, saves traces on failure)
├── tests/
│   └── tests_index.md                 # Dynamic inventory of all test files (Must update this when adding tests)
├── logs/
│   ├── trace-log-1.json               # Trace from test run 1 (if failure)
│   ├── trace-log-2.json               # Trace from test run 2 (if failure)
│   ├── e2e-test-logs.json             # Human-readable test logs
│   └── fix-logs.json                  # Code fix logs
└── webpack.config.cjs                  # Dev server (stores trace files)

The Trace Schema

Trace File Format

Trace files saved to logs/trace-log-{testRunNumber}.json contain complete test context:

{
  "testRunNumber": 5,
  "testId": "test-doc-1234567890-abc",
  "originalText": "Original document text...",
  "expectedText": "Expected LLM output text...",
  "finalText": "Final document text after operations...",
  "trace": [
    {
      "timestamp": 1703123456789,
      "method": "searchWithinRange",
      "args": [
        {
          "start": 0,
          "end": 100,
          "text": "Sample document text..."
        },
        "search query",
        {
          "ignorePunct": false,
          "ignoreSpace": false
        }
      ],
      "result": [
        {
          "start": 10,
          "end": 25,
          "text": "search query"
        }
      ]
    },
    {
      "timestamp": 1703123456790,
      "method": "insertTextAtRange",
      "args": [
        {
          "start": 10,
          "end": 25,
          "text": "search query"
        },
        "replacement text",
        "replace"
      ],
      "result": null
    },
    {
      "timestamp": 1703123456791,
      "method": "getWholeDocumentText",
      "args": [],
      "result": "Sample document with replacement text..."
    },
    {
      "timestamp": 1703123456792,
      "method": "applyOperation",
      "args": [
        {
          "start": 0,
          "end": 100,
          "text": "Sample document text..."
        },
        {
          "type": "REPLACE",
          "origText": "old text",
          "newText": "new text",
          "beforeContext": "context before",
          "afterContext": "context after"
        }
      ],
      "result": {
        "success": true,
        "strategy": "PrimaryAnchor"
      }
    }
  ],
  "timestamp": "2025-11-20T07:19:39.047Z"
}

Trace File Schema

Top-Level Fields:

testRunNumber: Test run number (used in filename: trace-log-{testRunNumber}.json)
testId: Unique test identifier
originalText: Original document text (for test setup - use with new ReplayWordAdapter(originalText))
expectedText: Expected LLM output (for verification)
finalText: Final document text after operations (for comparison)
trace: Array of trace entries with exact API calls (use with adapter.loadTrace(trace))
timestamp: When trace was saved (ISO format)

Trace Entry Schema (within `trace` array)

Each entry in the trace array records exact API inputs and outputs:

timestamp: Unix timestamp in milliseconds
method: Method name (e.g., "searchWithinRange", "insertTextAtRange")
args: Array of serialized arguments
- Range objects are serialized as {start, end, text}
- Primitives (strings, numbers, booleans) are preserved as-is
- Objects are JSON-serialized
result: Serialized result
- Range objects are serialized as {start, end, text}
- Arrays of ranges are arrays of serialized ranges
- Errors are recorded as {error: "message", code: "errorCode"}
- null or undefined for void methods
- For applyOperation method: Result includes {success: boolean, strategy: string} when anchor is found
  - success: true if operation succeeded, false if anchor not found
  - strategy: Name of the strategy that found the anchor (e.g., "PrimaryAnchor", "FuzzyScanNormalized")
  - Rationale: Logging which strategy succeeded enables observability and prevents "silent degradation"
  - Why Critical?: If _searchByFuzzyScan starts matching 90% of anchors, it indicates our Primary strategy is broken, even if tests are passing

Trace Recording in WordAdapter

Enabling Tracing

const wordAdapter = new WordAdapter();

// Enable tracing before test run
wordAdapter.enableTracing();

// Run test operations...
await wordAdapter.searchWithinRange(range, query, options);
await wordAdapter.insertTextAtRange(range, text, 'replace');

// On failure, save trace
const trace = wordAdapter.getTrace();
fs.writeFileSync('logs/trace-log-5.json', JSON.stringify(trace, null, 2));

Automatic Recording

All public methods in WordAdapter automatically record traces when tracingEnabled is true:

searchWithinRange()
insertTextAtRange()
deleteRange()
getRangeText()
getWholeDocumentText()
applyOperation()
And all other public methods

Trace Deviation Errors

When ReplayWordAdapter detects a mismatch:

TraceDeviationError: Trace deviation at index 5: searchWithinRange() called with unexpected arguments
  methodName: 'searchWithinRange'
  expected: { start: 0, end: 100, text: '...' }
  actual: { start: 0, end: 150, text: '...' }
  traceIndex: 5

This indicates:

Expected: What was recorded in the trace
Actual: What the code is trying to call now
Root cause: Code behavior changed, or trace is from different scenario

The Workflow (The "Loop")

Step 1: Failure in Word (Trace Generation)

When a stability test fails:

WordAdapter has been recording all API calls
Trace is saved to logs/trace-log-{testRunNumber}.json
Loop pauses and waits for coding agent

Step 2: Pause & Agent Takeover

The loop MUST pause after each iteration and will NOT auto-continue on its own.

The loop:

Calls /api/e2e-loop/pause after each iteration
Polls /api/e2e-loop/status every 2 seconds
Waits indefinitely until coding agent triggers next iteration
Will NOT auto-continue if server is unreachable

Step 3: Offline Reproduction (Loading Traces into ReplayWordAdapter)

The coding agent uses ReplayWordAdapter to reproduce the bug offline:

import ReplayWordAdapter from './src/lib/replay-word-adapter.js';
import { applyAmendment } from './src/lib/diff-orchestrator.js';
import fs from 'fs';

// Load trace (includes originalText, expectedText, and trace array)
const traceData = JSON.parse(fs.readFileSync('logs/trace-log-5.json', 'utf8'));

// Create ReplayWordAdapter with initial document state from trace
const adapter = new ReplayWordAdapter(traceData.originalText);

// Load trace array for Replay Mode
// CRITICAL: This replays EXACT Word behavior, not simulated behavior
adapter.loadTrace(traceData.trace);

// Now replay the exact operations that caused the failure
const bodyRange = {
  start: 0,
  end: traceData.originalText.length,
  getText: () => adapter.document
};

// Replay exact API calls from trace
await applyAmendment(
  bodyRange,
  traceData.originalText,
  traceData.expectedText,
  null,
  adapter
);

// Verify final state matches expected
const finalText = await adapter.getWholeDocumentText();
expect(finalText).toBe(traceData.expectedText);

Key Points:

✅ Load trace file which includes originalText, expectedText, and trace array
✅ Use traceData.originalText for initial document state
✅ Use traceData.trace (not traceData itself) for loadTrace()
✅ Replay exact operations - don't simulate or guess Word behavior

Step 4: The Fix (Reference Coding Standards Below)

CRITICAL: All fixes to anchor finding logic MUST follow the Strategy Pattern. Ad-hoc patching is forbidden.

4.1: Identify the Pattern

Analyze the trace to understand why the anchor search failed:

Punctuation issues? (e.g., Word ignores punctuation in search)
Spacing issues? (e.g., missing spaces, malformed text)
Missing context? (e.g., operation has no beforeContext/afterContext)
New Word behavior? (e.g., Word splits words across table cells)
Malformed text? (e.g., camelCase words stuck together from previous operations)

4.2: Select or Create Strategy

Option A: Tune Existing Strategy

If an existing Strategy (e.g., _searchByPrimaryAnchor) should have caught it but didn't:

Identify the gap: What specific condition did the strategy miss?
Enhance the strategy: Add logic to handle the new condition within that strategy method
Verify: Ensure the trace replay passes with the enhanced strategy

Example: If _searchByPrimaryAnchor fails because of malformed text, add normalization logic to that strategy (not a new fallback).

Option B: Create New Strategy

If the failure represents a completely new class of issue:

Create new private method: _searchBy[Name]() following the naming convention
Implement specific logic: Handle only this new class of issue
Fail fast: Return null immediately if strategy doesn't apply
Return strategy name: Return { range: result, strategy: 'StrategyName' } or null

Example: If Word splits words across table cells, create _searchAcrossTableCells() strategy.

4.3: Register Strategy

Add the new method to the strategies pipeline array in word-adapter.js:

// In _findAnchorInContext method
const strategies = [
  this._searchByPrimaryAnchor,
  this._searchByContextCombination,
  this._searchByOrigTextOnly,
  this._searchByFuzzyScan,
  this._searchByInsertionPoints,
  this._searchAcrossTableCells  // ← New strategy added here
];

Placement: Add strategies in order of preference (most specific → least specific).

4.4: Verify Fix

Replay trace: Load trace with ReplayWordAdapter.loadTrace(traceData)
Run Jest test: Ensure trace replay passes
Check strategy name: Verify the correct strategy is being used (check trace logs)
Run all tests: npm test to ensure no regressions

Success Criteria:

Trace replay passes
Strategy name is logged correctly
No regressions in existing tests

No Second Round of Live Testing Needed: Once trace replay succeeds and Jest tests pass, the fix is considered verified. The next stability iteration will naturally test the fix in the live Word environment, but we don't need to wait for a second successful run because LLM output variance makes it unreliable as a verification step.

Step 5: Verification & Trigger Next

After fixes are verified:

Coding agent triggers next iteration via /api/e2e-loop/trigger
Loop continues with fixed code
Process repeats until 5 consecutive passes

Coding Agent Responsibilities

The coding agent (assistant) is fully responsible for the following tasks after each stability test iteration:

Analyze Logs: Read logs/e2e-test-logs.json to identify failures and root causes
Load Trace: If failure occurred, load logs/trace-log-{testRunNumber}.json to get exact Word API behavior
Reproduce with ReplayWordAdapter (Trace Replay):
- Create ReplayWordAdapter instance with initial document state
- Call loadTrace(traceData) to enable Replay Mode
- CRITICAL: Do NOT simulate or guess Word behavior - always replay from traces
- Reproduce the bug offline by replaying exact API calls from trace
- Write Jest test that wraps the trace file
Fix Code: The coding agent MUST fix the code (WordAdapter and related logic) until trace replay succeeds
Verify Fix: Replay trace again to ensure fix works, then run npm test to ensure no regressions
Automatically Trigger Next Iteration: After verification, the coding agent MUST automatically call /api/e2e-loop/trigger

Critical Requirements:

The coding agent is responsible for creating all Jest tests - no manual test creation needed
The coding agent MUST use trace replay, not simulation - always load traces with loadTrace() and replay exact Word behavior
The coding agent MUST NOT simulate or guess Word behavior - always replay from traces captured during failures
The coding agent is responsible for fixing all code - fixes are applied automatically
Fixes cannot involve hardcoding of specific words or tokens - fixes must be able to generalize to other clauses
The coding agent is responsible for verifying Jest tests pass - verification happens automatically
The user does NOT need to manually trigger the next iteration - the coding agent handles everything automatically

Note: All these steps are performed automatically by the coding agent. The user does not need to manually create tests, fix code, verify tests, or trigger iterations - the coding agent handles the entire workflow automatically.

Coding Standards (The Strategy Pattern)

Anti-Pattern (Forbidden)

DO NOT use ad-hoc patching or numbered fallbacks:

❌ Forbidden Patterns:

Numbered fallbacks (e.g., "Fallback 2a", "Fallback 3h")
Nested if/else blocks for specific edge cases inside main methods
Ad-hoc boolean flags (if (specialCase) { ... }) scattered throughout code
Inline special-case handling within _findAnchorInContext or similar methods

Why Forbidden?

Creates "spaghetti code" that becomes unmaintainable
Makes debugging nearly impossible (non-linear logic flow)
Leads to silent degradation (fixes mask root causes)
Violates single responsibility principle

Requirement: Standalone Strategy Methods

Every search logic modification MUST be implemented as a Standalone Strategy Method.

✅ Required Pattern:

Each search strategy is a private method with a single, clear responsibility
Strategies are named by intent/logic, not by history or order
Strategies return { range: Range | null, strategy: string } or null
Strategies are registered in an ordered pipeline array

Naming Convention

Strategies must be named by intent/logic, not by history:

❌ BAD:

tryFallback4()
tryFallback2a()
_searchByFallback3h()

✅ GOOD:

_searchByPrimaryAnchor()      // Searches full anchor (before+orig+after)
_searchByContextCombination()  // Searches context combinations
_searchByOrigTextOnly()        // Searches origText alone
_searchByFuzzyScan()           // Normalized text, wildcards, partial matches
_searchByInsertionPoints()     // INSERT-specific logic
_searchAcrossTableCells()      // Handles Word splitting words across table cells

Rationale: Strategy names should describe what they do, not when they were added or where they appear in the fallback chain.

Strategy Pipeline

All strategies are registered in a single, ordered array in _findAnchorInContext:

const strategies = [
  this._searchByPrimaryAnchor,        // Most specific - try first
  this._searchByContextCombination,   // Context combinations
  this._searchByOrigTextOnly,         // OrigText alone
  this._searchByFuzzyScan,            // Fuzzy matching
  this._searchByInsertionPoints       // INSERT-specific
];

for (const strategy of strategies) {
  const result = await strategy.call(this, selectionRange, op, context);
  if (result && result.range) {
    return result; // Return both range and strategy name
  }
}

Benefits:

Linear, predictable execution flow
Easy to add new strategies (just add to array)
Easy to reorder strategies (change array order)
Each strategy is isolated and testable
Strategy name is logged and recorded in traces for observability

Observability Requirement

CRITICAL: Every strategy MUST return its name so we can track which strategy succeeded:

// ✅ GOOD: Returns strategy name
return { range: result, strategy: 'PrimaryAnchor' };

// ❌ BAD: Returns only range
return result;

Why? If _searchByFuzzyScan starts matching 90% of anchors, it indicates our Primary strategy is broken, even if tests are passing. This prevents "silent degradation."

Strategy Template

Use this template when creating a new strategy:

/**
 * Strategy: [Descriptive Name]
 * Context: Handles cases where [specific Word behavior or edge case]
 * 
 * @private
 * @param {Object} selectionRange - Word Range to search within
 * @param {Object} op - Operation object
 * @param {Object} context - Word context
 * @returns {Promise<Object|null>} Range object or null, with strategy name if found
 */
async _searchBy[Name](selectionRange, op, context) {
  const logger = (await import('./logger.js')).default;
  const MAX_SEARCH_LENGTH = 250;
  
  // 1. Check preconditions (fail fast if strategy doesn't apply)
  if (!this._isApplicable(op)) {
    return null; // Strategy doesn't apply - return immediately
  }
  
  // 2. Execute specific search logic
  try {
    // Build search query
    const query = this._buildQuery(op);
    
    // Truncate if needed (use shared helper)
    const truncatedQuery = this._truncateAnchor(query, MAX_SEARCH_LENGTH);
    
    // Search with fallback (use shared helper)
    const result = await this._searchWithFallback(selectionRange, truncatedQuery, context);
    
    if (result) {
      logger.debug(`[Strategy: [Name]] Found anchor using [description]`);
      return { range: result, strategy: '[Name]' }; // Return strategy name!
    }
  } catch (error) {
    logger.debug(`[Strategy: [Name]] Error: ${error.message}`);
  }
  
  // 3. Return null if not found
  return null;
}

Key Principles

Fail Fast: Return null immediately if strategy doesn't apply
Use Shared Helpers: Always use _truncateAnchor() and _searchWithFallback()
Return Strategy Name: Always return { range, strategy } format (not just range)
Log Strategy Name: Include [Strategy: Name] prefix in debug logs
Single Responsibility: Each strategy handles ONE specific class of issue

Control API & Configuration

Endpoints

GET /api/e2e-loop/status

Returns: { canProceed: boolean, waitingForTrigger: boolean, lastIteration: number }
Used by: Loop polling for trigger

POST /api/e2e-loop/trigger

Action: Sets canProceed = true, waitingForTrigger = false
Used by: Coding agent to trigger next iteration

POST /api/e2e-loop/pause

Action: Sets canProceed = false, waitingForTrigger = true
Used by: Loop to pause after each iteration

POST /api/trace-log

Action: Receives trace logs from Word add-in
Stores: Trace files to logs/trace-log-{testRunNumber}.json

POST /api/fix-log

Action: Receives fix logs from coding agent
Stores: Fix logs to logs/fix-logs.json

Fix Log Schema:

Each fix entry must follow this JSON structure:

{
  "timestamp": "ISO-8601 String",
  "level": "INFO",
  "message": "Code Fix Applied",
  "metadata": {
    "type": "fix-applied",
    "file": "src/lib/word-adapter.js",
    "issue": "Description of the bug",
    "fix": "Description of the strategy added",
    "testRunNumber": 123,
    "strategyName": "_searchByTableCells"
  }
}

Required Fields:

timestamp: ISO-8601 formatted timestamp string
level: Log level (typically "INFO")
message: Fixed message "Code Fix Applied"
metadata.type: Must be "fix-applied"
metadata.file: File path that was modified
metadata.issue: Description of the bug being fixed
metadata.fix: Description of the fix/strategy applied
metadata.testRunNumber: Test run number when fix was applied (optional but recommended)
metadata.strategyName: Name of strategy method added (optional but recommended)

POST /log

Action: Receives general logs from Word add-in
Stores: Logs to logs/e2e-test-logs.json

GET /logs

Returns: All logs from logs/e2e-test-logs.json

POST /logs/clear

Action: Clears logs (one-time at start)

Loop Flow

1. Loop runs test iteration
2. Loop pauses (calls /api/e2e-loop/pause) - CRITICAL: Loop MUST pause after each iteration
3. Loop polls /api/e2e-loop/status every 2 seconds (waits until triggered)
4. Coding agent automatically:
   a. Analyzes logs from logs/e2e-test-logs.json
   b. Creates Jest tests (ReplayWordAdapter with trace replay) to reproduce issues found in Stability Loop logs
   c. Fixes code until Jest tests pass (offline verification)
   d. Automatically calls /api/e2e-loop/trigger after Jest tests pass
5. Loop detects canProceed = true
6. Loop continues to next iteration

IMPORTANT: The loop will NOT auto-continue on its own - it waits for the coding agent.
However, the coding agent will AUTOMATICALLY trigger the next iteration after:
- Analyzing logs
- Creating Jest tests
- Fixing code
- Verifying Jest tests pass

No manual intervention is needed - the coding agent handles the entire workflow automatically.
The loop will NOT auto-continue if the server is unreachable - it waits until the coding agent can trigger it.

Monitoring & Debugging

Key Metrics to Monitor

Test Run Count: Should increment sequentially
Consecutive Passes: Should increment on success, reset on failure
Validation Pass Rate: Percentage of successful validations
Anchor Behavior: Number of anchor failures per run
Strategy Distribution: Which strategies are succeeding (from trace logs)
Log Hash Patterns: Detect identical runs
Error Patterns: Track common error types
Fix History: Track what fixes were attempted and their outcomes

Debugging Tools

Log Analysis: src/e2e/analyze-logs.js
Test Runner: src/e2e/test-runner.js
Show Clean Copy: src/e2e/show-clean-copy.js
Trigger Control: src/e2e/trigger-next-iteration.js
Run All Tests: src/e2e/run-all-tests.js
Fix Logging: src/e2e/log-fix.js (for coding agent to log fixes)

Success Criteria Summary

✅ Test passes when:

Diff-ed output (clean copy) exactly matches LLM output
No anchor failures detected
Primary strategies preferred; secondary strategies (fuzzy/scan) logged but accepted
Document restored cleanly (no tracked changes)
Original test clause validated correctly

✅ Overall success when:

5 consecutive test runs pass
All 5 runs have correct anchor behavior
No stop conditions triggered

Stop Conditions

The loop stops when:

A. Max Runs: testRunNumber > 500
B. Word Errors: Invalid document state errors detected
C. Timeout: No logs for 5 minutes (after retry)
D. Oscillation: Same edit → same LLM output → same error pattern repeats
E. Identical Hashes: Same log hash appears 6 consecutive times

Summary

This manual consolidates all testing documentation into a single source of truth. Key principles:

Record & Replay: Always replay exact Word behavior from traces, never simulate
Strategy Pattern: All fixes must use standalone strategy methods, no ad-hoc patching
Coding Agent Control: The agent automatically handles the entire workflow
Observability: Strategy names are logged to detect silent degradation
Trace-Based Debugging: All failures are debugged offline using recorded traces

For test file organization and Jest test details, see tests/tests_index.md.