Documenting my test plan requirements and implementing.

1. In the testing script, the script should check that the original test clause is present after each document refresh and that the same original test clause is being passed to the AI LLM, before each test run. If not, throw an error and stop the loop.

2. After creating additional tests and passed, all previous tests should be re-run and passed (if not, debug until all tests passed). I recall there were unit tests and integration tests previously (28 original, 40+ after a few manual test runs). Add on to those, and number them so I know how many tests have been run.

3. Clear the existing logs. Going forward, each run in the logs should be numbered e.g. test run 1, test run 2 etc. with date and time stamps. It should also contain the diff-ed output from that run. Output the diff-ed output on this chat window too.

4. Add another final success condition: in addition to the 5 successful runs (where diff-ed output = AI LLM output), logs must confirm correct anchor behaviour.

5. Conditions for stopping the loop (hitting any condition should stop the loop):

A. Number of run loops (mentioned at point 3 above) hits 500.

B. Any error that indicates the Word document is no longer in a valid editable state e.g. rich API call failed, property not loaded, track changes mode not supported, failure to insert or delete content controls, failure to replace tokens or apply formatting, selection collapsed unexpectedly.

C. No new logs coming from MS Word machine detected in 5 minutes (timeout) after dev machine triggers new LLM run (allowed 1 retry after 2 minutes).

D. Same edit -> same LLM output -> same error repeatedly, or oscillation between two or more diff patterns (e.g. A -> B -> A -> Bcycles)

E. Identical log hashes for 6 retries. If system produces the exact log hash for 6 consecutive test runs.


Implemented features

1. Validation checks

  • Validates original test clause after document refresh
  • Validates original test clause before LLM call
  • Stops loop with error if validation fails

2. Test re-running logic

  • Created src/e2e/run-all-tests.js to run all unit and integration tests
  • Updated src/e2e/run-analysis.js to automatically run all tests after generating new test cases
  • Tests are numbered and tracked

3. Improved logging

  • Logs are cleared at start
  • Each run is numbered: "TEST RUN 1", "TEST RUN 2", etc.
  • Timestamps included (ISO format)
  • Diff-ed output logged and displayed in chat window (first 500 chars)
  • Full diff-ed output included in structured logs

4. Anchor behavior success condition

  • Added checkAnchorBehavior() function
  • Checks logs for anchor failures and fallback strategies
  • Success requires 5 consecutive passes AND correct anchor behavior
  • Resets consecutive passes if anchor behavior issues detected

5. Stop conditions

All stop conditions implemented:

  • A. Maximum runs: Stops at 500 test runs
  • B. Word document errors: Detects invalid state errors (property not loaded, API failures, track changes issues, etc.)
  • C. Timeout: Stops if no logs from Word machine for 5 minutes (with 1 retry after 2 minutes)
  • D. Oscillation: Detects same edit → same LLM output → same error patterns
  • E. Identical log hashes: Stops if same log hash appears 6 consecutive times

Additional improvements

  • Log hash tracking for oscillation detection
  • Error pattern tracking for debugging
  • Enhanced metadata in all log entries (testRunNumber, timestamps)
  • Better error messages with context