Bot Arena

Failure report

Playwright vs Bot Arena

Five plain Playwright tests, one per level. Each one tries to sign in. Every test fails. Below — what each test does, the error Playwright surfaces, and the detection signals that caused the failure.

Tests run
5
Failed
5
Detections that caught it
9

Headless Chromium driven by @playwright/test running against bot-arena.jhero.app. Source: playwright/levels.spec.ts.

Level 1

The honest tell

· Passive webdriver flags
Playwright: failed AIVA: also fails

What the test does

test('Level 1 sign in', async ({ page }) => {
  await page.goto('/level/1/');
  await page.getByLabel('Email').fill('user@example.com');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Browsers volunteer a lot about themselves to every site they visit — what version they are, what extensions are loaded, whether they are being controlled by an automation program. When Playwright drives a browser, the browser honestly admits "I am being automated" through a flag called navigator.webdriver that any site can read in a single line of JavaScript. Stock Playwright also has no plugins installed, no notification permissions set, and identifies itself as "HeadlessChrome" in its version string. Each of these is a yes/no question a site can ask in milliseconds.

Why a VNC-driven real browser passes

The browser inside a VNC session is a regular, fully-fledged Chrome that a regular user started. Nothing is automating it from the inside — the automation happens outside the browser, at the operating-system level, by moving a mouse and pressing keys on a remote desktop. The browser does not know it is being driven, so none of these flags get set, and it reports back the same values any real human visitor would.

Playwright context — could this test be fixed in Playwright? (click to expand)

Verdict: technically patchable, but it's an arms race that the page always wins eventually.

Each of the five remaining signals can in principle be spoofed from Playwright:

  • navigator.webdriver can be hidden via --disable-blink-features=AutomationControlled plus an addInitScript that redefines the property.
  • The User-Agent can be spoofed with --user-agent="..." to strip the HeadlessChrome token.
  • navigator.plugins, navigator.languages, and the Notification.permission / permissions.query pair can all be patched via Object.defineProperty in an init script.

Off-the-shelf stealth bundles (playwright-extra + puppeteer-extra-plugin-stealth) ship most of these patches already. The catch: every Chrome release introduces new tells, and commercial bot-detection vendors (Cloudflare, DataDome, PerimeterX, Imperva) maintain fingerprint databases of every known stealth-plugin signature. You spend more time updating your evasions than writing tests, and you only ever win temporarily.

AIVA context — what would need to change in AIVA to pass this (click to expand)
Fix complexity
1/5
Trivial with an init script
~30 minutes (vibe-codable in one prompt)
5-line patch in browser.ts No architectural change

AIVA fails this level because of one signal: navigator.webdriver = true. AIVA launches Chrome via Puppeteer in aiva-node/src/control-server/src/browser.ts:204 (puppeteer.launch({...})), and any browser attached via CDP has this flag set automatically by Chrome itself.

The pragmatic fix is a single init script. Add this to AIVA's page-setup flow (e.g., next to the existing hideCursorScript wiring):

await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined,
    configurable: true,
  });
});

Bot Arena's L1 check is literally navigator.webdriver === true → FAIL. Returning undefined makes the check pass. This is exactly what every stealth plugin does (puppeteer-extra-plugin-stealth, playwright-extra-stealth, etc.). The original "multi-week refactor" estimate was for the architecturally pure fix — replacing Puppeteer/CDP entirely with a non-CDP control plane. That's the right answer if you need to pass sophisticated bot-detection vendors that fingerprint the shape of navigator.webdriver (own vs prototype descriptor, getter behaviour, etc.). For Bot Arena and most "naive equality check" detection layers, the 5-line patch is sufficient.

Trade-off: the init-script patch is detectable by sites that audit property descriptors. If AIVA's target customers operate sites with enterprise-grade detection, the architectural path becomes the right long-term investment. For this demo and a wide class of real-world cases, the patch is the right answer today.

Why it failed — Detection Log

  • fail webdriver — navigator.webdriver = true
  • fail plugins — navigator.plugins.length = 0 (expected > 0)
  • pass languages — navigator.languages = [en-US]
  • fail ua-headless — User-Agent contains "HeadlessChrome/148.0.7778.96"
  • pass notif-permission — Notification.permission and permissions.query agreed
Level 2

CDP attached

· Headless / CDP-only tells
Playwright: failed AIVA: also fails

What the test does

test('Level 2 sign in', async ({ page }) => {
  await page.goto('/level/2/');
  await page.getByLabel('Email').fill('user@example.com');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

What "CDP" means: CDP stands for Chrome DevTools Protocol — the low-level remote-control interface Chrome exposes for tools like Chrome's own DevTools panel, Puppeteer, and Playwright. It is how those tools "drive" Chrome from outside the browser: clicking buttons, typing text, reading the DOM, taking screenshots — all without using a real keyboard and mouse. When CDP is attached, Chrome behaves slightly differently in measurable ways, and many of those differences also coincide with "I am running headless."

A real laptop has visible chrome around every browser window — toolbars at the top, tabs, a Windows taskbar at the bottom, a macOS menu bar — and this chrome takes up real pixels. The browser can ask the screen "how much of you is mine, versus the OS's?" and the answer comes back in pixels. A headless automated browser has no chrome and no visible window at all, so the honest answer is zero. There is no way to fake having toolbars that do not exist.

Why a VNC-driven real browser passes

A VNC session streams a real, fully visible Chrome window running on a real desktop. There are real toolbars, a real taskbar, real OS chrome. Every measurement the page makes returns the same numbers any human visitor on any laptop would produce. Crucially, the automation happens outside the browser (at the OS level, moving a real cursor) — no CDP is attached, so Chrome behaves like an ordinary Chrome being used by an ordinary person.

Playwright context — could this test be fixed in Playwright? (click to expand)

Verdict: partially patchable in script; the pixel measurements require effectively rebuilding what VNC-AIVA already is.

The JavaScript-level signals (chrome.app, chrome.csi, driver shims, toString integrity) can be polyfilled with an addInitScript at page load. Easy.

The window/screen pixel measurements are different. outerHeight - innerHeight = 0 is true because the headless browser literally has no toolbars. Two ways out, neither great:

  1. Run headed (headless: false) on a server with Xvfb/Xvnc. But then you need a real desktop environment with a window manager and a panel to populate screen.availHeight < screen.height, plus you need Chrome to actually display its chrome (not --kiosk). At that point, you have rebuilt the AIVA architecture from scratch.
  2. Spoof the values from JS — override window.outerHeight, screen.height, etc. via addInitScript. But the spoofs need to be internally consistent across signals: if you claim a 1080-pixel screen with a 40-pixel taskbar, the browser viewport's actual height needs to plausibly fit inside that. Cross-signal correlation catches these mismatches.

In practice: an automation team trying to fix L2 with Playwright ends up reinventing AIVA badly.

AIVA context — what would need to change in AIVA to pass this (click to expand)
Fix complexity
2/5
Easy
Half a day
Config: drop 2 flags Image: add desktop env

AIVA currently fails this level for two reasons:

  1. No visible browser chrome — AIVA's browserArgs.ts passes both --start-fullscreen and --kiosk. Both flags hide the toolbars, tabs, and address bar that any real Chrome window displays. With them dropped, outerHeight - innerHeight jumps from 0 px to the usual 80–120 px. Drop: --start-fullscreen, --kiosk
  2. No taskbar — this one is outside Chrome's launch flags. AIVA's VNC session (Xvfb/Xvnc) has no window manager or desktop panel reserving screen pixels, so the X server reports screen.availHeight === screen.height. Adding a lightweight desktop environment to the AIVA image — XFCE, LXDE, or even just OpenBox + tint2 — with a panel/dock visible at the bottom of the screen would close this gap.

Why it failed — Detection Log

  • pass driver-shims — no cdc_* globals (Playwright is not Selenium)
  • pass tostring-integrity — Function.prototype.toString is native
  • fail chrome-surface — window.chrome.app and chrome.csi both missing (app=false, csi=false)
  • fail browser-chrome-height — outerHeight - innerHeight = 0px (no toolbars/tabs visible)
  • fail screen-taskbar — screen.availHeight = screen.height = 720 (no taskbar reserved)
Level 3

Mouse trajectory

· Behavioural — mouse path and keystroke cadence
Playwright: failed AIVA: passes

What the test does

test('Level 3 sign in', async ({ page }) => {
  await page.goto('/level/3/');
  await page.getByLabel('Email').fill('user@example.com');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

When a human clicks a button on a web page, the mouse pointer travels there — left a bit, up a bit, curving naturally. That path leaves a trail of dozens of "I moved here" events along the way. Playwright does not do that. When you tell Playwright "click this button," the pointer instantly appears at the button's exact pixel and clicks. No travel, no curve. A page that records every mouse event notices that this click came out of nowhere — no human operates a computer like that.

Why a VNC-driven real browser passes

A VNC operator moves a real mouse cursor on a real operating system, generating the same continuous stream of mouse events any human would. Because the path is a physical movement (the cursor is dragged across the screen by a person or by image-recognition automation steering it), it has the same natural variation and curvature as any other user's.

Playwright context — could this test be fixed in Playwright? (click to expand)

Verdict: bypassable for the basic checks Bot Arena does, but only by hand-rolling humanized interactions everywhere — and any sophisticated behavioural model still wins.

Playwright does expose lower-level mouse APIs that can generate intermediate moves:

  • page.mouse.move(x, y, { steps: 30 }) emits 30 intermediate mousemove events along a straight line.
  • Wrap that in a Bezier-curve helper with randomized jitter and you produce trajectories with the right shape and curvature.
  • page.keyboard.type(text, { delay: rand(80, 200) }) dispatches one key at a time with randomized inter-key delays.

The catch: every interaction in the test suite needs this treatment. A one-line page.click() becomes a thirty-line "humanize" helper. And advanced behavioural fingerprinting (used by serious bot-detection vendors) trains ML models on real human mouse telemetry — they pick up on acceleration curves, overshoot-and-correct patterns, pause-before-click latency, and dozens of other features that synthetic Bezier curves don't replicate. So: bypassable here, in this demo. Increasingly hard against production-grade defenders.

AIVA context — why this level already passes for AIVA (click to expand)
✓ No fix needed — passes by construction

AIVA passes this level natively. The mouse cursor in AIVA's VNC session moves continuously across the screen at the OS level — exactly like any human user dragging a real mouse. No code or config change is needed here; this is one of the levels where running on a real machine wins by construction.

Why it failed — Detection Log

  • info level3-armed — recorder armed at page load
  • fail mouse-trajectory — only 1 mousemove point recorded between load and click (need ≥5 for a human-shaped curve)
  • pass keystroke-cadence — 0 keystrokes — page.fill() bypasses key events, so this check abstains
Level 4

Fingerprint battery

· Canvas, audio, WebGL renderer, font set
Playwright: failed AIVA: passes

What the test does

test('Level 4 sign in', async ({ page }) => {
  await page.goto('/level/4/');
  await page.getByLabel('Email').fill('user@example.com');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Real computers have real graphics cards from real manufacturers (Intel, NVIDIA, AMD), and each draws images and text in subtly different but characteristic ways. Real computers also have real font files installed by the operating system. Headless automated browsers have neither — they use a software-only graphics stack called SwiftShader that produces an obviously-different visual fingerprint, and they ship with a stripped-down set of fonts. A page can render a tiny invisible test image and hash the pixels; that single hash is usually enough to tell whether the browser is running on real silicon or a CI runner.

Why a VNC-driven real browser passes

A VNC session runs on a real machine with a real graphics stack and a real set of fonts. The fingerprints it produces match those of millions of other real desktop Chrome installations.

Playwright context — could this test be fixed in Playwright? (click to expand)

Verdict: effectively impossible to fix from inside Playwright without a real GPU and real fonts.

Every individual fingerprint can be spoofed with an addInitScript hook:

  • Override WebGLRenderingContext.prototype.getParameter to return a fake renderer string like "Intel Iris Xe Graphics".
  • Patch HTMLCanvasElement.prototype.toDataURL and getImageData to return a pre-computed "real GPU" hash.
  • Replace OfflineAudioContext.prototype.startRendering to return a pre-recorded waveform.
  • Spoof document.fonts.check and the font-width measurement trick to claim the right font set.

The trap: internal consistency. If you claim "Intel Iris Xe Graphics" for WebGL, your canvas pixel hash needs to match what an actual Intel iGPU produces — and that hash depends on subtle floating-point rounding, anti-aliasing kernels, and driver-specific quirks. Without the actual hardware you cannot reproduce it. Detection services maintain databases of valid combinations: GPU X must produce canvas hash within set Y for fonts Z. Spoofing one signal in isolation creates a contradiction with the others, which is itself a stronger signal than the original tell.

This is where Playwright fundamentally loses against any site doing real fingerprint-based bot detection.

AIVA context — why this level already passes for AIVA (click to expand)
Fix complexity
1/5
Trivial (hardening, not required)
A few hours, only if hardening is desired
Config: drop 3 flags Operational: harvest denylist hashes

AIVA passes this level — but partially by accident. AIVA's browserArgs.ts includes --disable-gpu, --disable-webgl, and --disable-features=Vulkan,webgpu, which make the WebGL renderer query return nothing. Bot Arena reports an empty renderer as INFO rather than FAIL, so AIVA slips past. Canvas, audio, and font fingerprints come from a real Linux Chrome on a real machine and look like any other desktop user.

Latent risk: if Bot Arena's canvas/audio denylists in src/detections/level4.ts were populated with hashes harvested from AIVA's Chrome (which is the operational follow-up flagged in the implementation plan), this level would fail for AIVA too. Long-term, AIVA should consider whether --disable-gpu/--disable-webgl are still needed — they're a tell to fingerprint-aware sites because most real Chromes do have GPU.

Why it failed — Detection Log

  • fail webgl-renderer — WebGL renderer = "ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device …))" — software rasteriser, no GPU
  • pass canvas-fp — sha256 = f66453e0… (not on denylist — denylist is empty in v1)
  • pass audio-fp — sha256 = 543fb8e0… (not on denylist — denylist is empty in v1)
  • pass font-probe — Segoe UI Emoji, Arial Black, Comic Sans MS — UA-consistent for the Windows runner
Level 5

Cloudflare Turnstile

· Real third-party challenge
Playwright: failed AIVA: also fails

What the test does

test('Level 5 sign in', async ({ page }) => {
  await page.goto('/level/5/');
  await page.getByLabel('Email').fill('user@example.com');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Cloudflare Turnstile is the modern, invisible replacement for "click all the bicycle pictures" CAPTCHAs. When a page asks for it, Turnstile silently runs all the kinds of checks the previous four levels illustrate — plus an additional stack of private signals only Cloudflare knows about — and decides whether the visitor looks human enough to be issued a one-time "yes, this is a human" token. For automated browsers it simply refuses to issue the token. The server-side login check then sees no token and rejects the submission before it ever reaches the application code.

Why a VNC-driven real browser passes

A real Chrome session with a real fingerprint, real interaction history, and real mouse movement looks like any other paying customer to Turnstile. The token gets issued silently, exactly the same way it would for someone working from a coffee-shop laptop.

Playwright context — could this test be fixed in Playwright? (click to expand)

Verdict: functionally impossible to bypass from inside Playwright. The only working "fix" is to outsource the problem.

Turnstile's logic is intentionally closed-source. It runs every kind of signal the previous four levels illustrate, plus a stack of private checks Cloudflare keeps to itself, plus IP reputation, plus behavioural analysis trained on the firehose of real human traffic across the Cloudflare network. Even a Playwright author who perfectly fixed levels 1-4, ran from a residential IP, and hand-rolled humanized interactions would still be classified as automated with high confidence — Cloudflare's behavioural model is too good.

The "solution" used in the wild is paid CAPTCHA-solver services (2Captcha, anti-captcha, CapMonster, etc.). They route the challenge through real-browser farms — either real humans or sophisticated stealth setups — and return a valid token in a few seconds, for a few cents each. Wire one of those into your test:

const token = await solver.solveTurnstile({
  sitekey: '0x4AAAAAADOBZMoei4aG9CNO',
  url: 'https://bot-arena.jhero.app/level/5/',
});
await page.evaluate((t) => {
  document.querySelector('input[name="cf-turnstile-response"]').value = t;
}, token);

This works — but it has defeated the original purpose of using Playwright. You have paid a third-party service to act as the human in front of the human-detector. Your "automated" tests now have a per-run cost and a human-in-the-loop dependency. This is exactly the kind of corner that VNC-AIVA, by being a real browser session at the OS level, avoids without any third-party dependency.

AIVA context — what would need to change in AIVA to pass this (click to expand)
Fix complexity
4/5
Hard — partially externally-bound
Inherits L1 + infra work; Cloudflare ML remains uncertain
Blocked on L1 Blocked on L2 Residential IP infrastructure

AIVA currently fails this level as a cascading consequence of L1 and L2. Cloudflare Turnstile silently runs many of the same signals — navigator.webdriver, browser-chrome dimensions, fingerprint plausibility — plus its own private checks, plus IP reputation. Two contributing causes inside AIVA's control:

  1. Signal leakage from L1 and L2. Fixing the Puppeteer/CDP attachment, dropping --incognito/--disable-extensions, and dropping --kiosk/--start-fullscreen would all reduce Turnstile's confidence that the visitor is automated. Closing L1 + L2 likely moves Turnstile from "refuse / interactive challenge" to "silent pass" for many sites.
  2. IP reputation. If AIVA runs on a datacenter or cloud-region IP, Turnstile downgrades by default. Running through a residential proxy or from end-user infrastructure improves the score meaningfully — and is independent of any AIVA code change.

Turnstile's logic is partially closed-source, so even a perfectly-configured AIVA may occasionally fail. This level is the only one where success isn't fully under AIVA's control.

Why it failed — Detection Log

  • fail turnstile — no token — widget did not solve. Cloudflare refused to issue a token for the automated browser; server-side siteverify never called.

What changes with VNC AIVA

Point the same five tests at the classic AIVA — a real headed Chrome on a Linux host, clicked through VNC at the OS level instead of CDP. None of the signals above fire. Every test passes. The form shows ✓ Access granted on every level.

The difference is not patches, plugins, or stealth tricks. It's that VNC-AIVA is a real browser session, driven by real OS-level input.