Building a Scalable Browser Automation Platform for Accessibility Scanning

Automated accessibility testing requires real browsers. The WCAG spec cares about the rendered DOM, computed styles, and layout — things you can't get from a curl request. But running headless browsers at scale is notoriously painful: memory leaks, zombie processes, flaky CDP connections, and the ever-present risk of OOM kills.

This post covers the browser automation layer of A11yNow — the subsystem I built at BarrierBreak to manage Chromium, Firefox, and WebKit instances across hundreds of concurrent scans without falling over. Every pattern here was forged in production fire.

1. The two-layer pooling architecture

There are two layers of resource pooling: browser instances (heavy, ~300MB each) and pages (light, ~10MB each). They're managed differently because they have different costs and lifetimes.

┌────────────────────────────────────────────────────────┐
│ CoreScannerService                                     │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │ LRU Cache: BrowserManager instances              │  │
│  │ Key: projectId:browserType:device                │  │
│  │ Max: 12, TTL: 10 min                             │  │
│  └────────────────────────┬─────────────────────────┘  │
│                           │                            │
│  ┌────────────────────────▼─────────────────────────┐  │
│  │ BrowserManager (per project)                     │  │
│  │                                                  │  │
│  │  ┌────────────────────────────────────────────┐  │  │
│  │  │ Active Page Pool (max 5)                   │  │  │
│  │  │ Request Queue (max 20, 120s)               │  │  │
│  │  └────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

Why per-project?

Different projects need different browser configurations. Project A might test a public marketing site (no auth, desktop only). Project B might test an authenticated admin dashboard (cookie auth, mobile viewport, ad blocking). Sharing one browser between them would mean constantly tearing down and rebuilding browser contexts — roughly as expensive as launching a new browser.

Instead, each project gets its own browser manager with a fixed configuration. It lives in an LRU cache so inactive projects get evicted, freeing memory for active ones.

2. LRU cache with safe disposal

The cache is a standard lru-cache instance with one production-hardened twist — the dispose handler:

this.projectBrowserManagers = new LRUCache<string, IBrowserManager>({
  max: parseInt(process.env.BROWSER_POOL_MAX_SIZE || '12', 10),
  ttl: 1000 * 60 * parseInt(process.env.BROWSER_POOL_TTL_MINUTES || '10', 10),
  ttlAutopurge: true,
  updateAgeOnGet: true, // Reset TTL on access — prevents mid-scan eviction
 
  dispose: (value, key) => {
    // LRUCache's dispose is synchronous, but shutdown is async.
    // Track the promise so graceful shutdown can await it.
    const p = (async () => {
      try {
        logger.info('LRU evicting browser manager, shutting down', { projectId: key });
        await value.shutdown();
      } catch (error) {
        logger.error('Error shutting down during LRU eviction', { projectId: key, error });
      } finally {
        this.pendingDisposals.delete(p);
      }
    })();
    this.pendingDisposals.add(p);
  },
});

Three details matter here:

updateAgeOnGet: true — Every acquirePage() call touches the cache entry, resetting its TTL. A project actively receiving scan requests won't have its browser evicted mid-scan.
max is the real limit, not ttl — TTL auto-purge cleans stale browsers, but max: 12 is the hard cap. The 13th project gets the least-recently-used entry evicted.
pendingDisposals set — The dispose callback is synchronous, but browser.close() is async. If the process receives SIGTERM during a disposal, the graceful-shutdown handler calls awaitPendingDisposals() to avoid leaking browser processes.

Concurrent creation guard

A subtle race: two scans for the same project arrive simultaneously. Without a guard, both would see a cache miss and create two browser-manager instances:

private getProjectBrowserManager(projectId, browserType, devicePreset) {
  const cacheKey = `${projectId}:${browserType}:${devicePreset || 'desktop'}`;
 
  if (this.projectBrowserManagers.has(cacheKey)) {
    return Promise.resolve(this.projectBrowserManagers.get(cacheKey)!);
  }
 
  // Check for an in-flight creation promise
  if (this.projectManagerPromises.has(cacheKey)) {
    return this.projectManagerPromises.get(cacheKey)!;
  }
 
  const creationPromise = (async () => {
    try {
      const browserConfig = await this.buildBrowserConfig(projectId, ...);
      const manager = new BrowserManager(browserConfig);
      this.projectBrowserManagers.set(cacheKey, manager);
      return manager;
    } finally {
      this.projectManagerPromises.delete(cacheKey);
    }
  })();
 
  this.projectManagerPromises.set(cacheKey, creationPromise);
  return creationPromise;
}

The first caller creates the promise and stores it in projectManagerPromises. Any subsequent caller before the promise resolves gets the same promise. Once resolved, the cache has the entry and the promise-map entry is cleaned.

3. Page pooling with request queuing

Each browser manager keeps up to 5 active pages. When a scan needs a page and the pool is full, it doesn't throw — it queues:

async acquirePage(): Promise<Page> {
  this.lastActivityTime = Date.now();
 
  // Under capacity: create a new page
  if (this.activePagesSet.size < this.maxPoolSize) {
    const browser = await this.getBrowser();
    const pageContext = await browser.newContext(this.buildContextOptions());
    const page = await pageContext.newPage();
 
    page.on('close', () => {
      pageContext.close().catch(e =>
        logger.warn('Failed to close browser context', { error: e.message })
      );
    });
 
    this.activePagesSet.add(page);
 
    const blocker = await this.getAdBlocker();
    if (blocker) {
      await blocker.enableBlockingInPage(page);
    }
 
    return page;
  }
 
  // Pool full — queue with timeout
  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      const idx = this.pageRequestQueue.findIndex(req => req.timeout === timeout);
      if (idx !== -1) this.pageRequestQueue.splice(idx, 1);
      reject(new ScanError(
        ErrorCode.BROWSER_LAUNCH_FAILED,
        `Timed out waiting for available page (${this.maxQueueWaitTimeMs / 1000}s)`,
        false
      ));
    }, this.maxQueueWaitTimeMs);
 
    this.pageRequestQueue.push({ resolve, reject, timestamp: Date.now(), timeout });
  });
}

Queue processing is event-driven + periodic

When a page is released, the queue is processed immediately:

async releasePage(page: Page): Promise<void> {
  this.activePagesSet.delete(page);
  await this.closePage(page);
 
  // Don't wait for the 5s timer — process now
  this.processQueue();
}

But there's also a fallback 5-second interval timer. Why both? If a page release triggers an error during close, processQueue() might not be called. The interval ensures queued requests don't starve.

processQueue() itself is simple: dequeue the oldest request, clear its timeout, call acquirePage() (which will now have capacity), and resolve/reject the promise:

private processQueue(): void {
  if (this.isShuttingDown || this.pageRequestQueue.length === 0 ||
      this.activePagesSet.size >= this.maxPoolSize) {
    return;
  }
 
  const request = this.pageRequestQueue.shift()!;
  clearTimeout(request.timeout);
 
  this.acquirePage()
    .then(page => request.resolve(page))
    .catch(error => request.reject(error));
}

4. Page isolation via browser contexts

Every page gets its own Playwright BrowserContext — not just its own page. This means:

Isolated cookies, localStorage, and sessionStorage — one scan's login state can't leak into another.
Per-page viewport, device scale, and user agent — mobile scan contexts use isMobile: true and touch emulation, desktop contexts don't.
Automatic cleanup — the page's close event auto-closes its context.

private buildContextOptions(): Record<string, any> {
  const opts = {
    viewport: this.config.viewport,
    deviceScaleFactor: this.config.deviceScaleFactor ?? 1,
    javaScriptEnabled: true,
    ignoreHTTPSErrors: true,
  };
 
  if (this.config.userAgent) opts.userAgent = this.config.userAgent;
 
  // Browser-type awareness
  if (this.browserType === 'firefox') {
    // Firefox doesn't support isMobile/hasTouch emulation
  } else if (this.browserType === 'webkit' && process.platform === 'linux') {
    // Linux WebKit doesn't support touch/mobile emulation
  } else {
    opts.isMobile = this.config.isMobile ?? false;
  }
 
  if (!(this.browserType === 'webkit' && process.platform === 'linux')) {
    opts.hasTouch = this.config.hasTouch ?? false;
  }
 
  return opts;
}

5. Ad blocking: shared engine, per-page enablement

Ad and cookie-banner blocking uses @ghostery/adblocker-playwright. The engine is a 30MB parsed filter list — creating one per page would be catastrophic. So it's shared at the module level:

// Module-level cache — one engine per filter list combination
const SHARED_BLOCKER_CACHE = new Map<string, Promise<PlaywrightBlocker>>();
 
private async getAdBlocker(): Promise<PlaywrightBlocker | null> {
  const filterUrls: string[] = [];
  if (this.blockAds) filterUrls.push('https://easylist.to/easylist/easylist.txt');
  if (this.blockCookieBanners) {
    filterUrls.push('https://secure.fanboy.co.nz/fanboy-cookiemonster.txt');
    filterUrls.push('https://secure.fanboy.co.nz/fanboy-annoyance.txt');
  }
  if (this.blockTrackers) filterUrls.push('https://easylist.to/easylist/easyprivacy.txt');
 
  if (filterUrls.length === 0) return null;
 
  const cacheKey = [...filterUrls].sort().join('|');
  let cached = SHARED_BLOCKER_CACHE.get(cacheKey);
  if (!cached) {
    cached = PlaywrightBlocker.fromLists(fetch, filterUrls).catch(error => {
      SHARED_BLOCKER_CACHE.delete(cacheKey); // Don't cache failures
      throw error;
    });
    SHARED_BLOCKER_CACHE.set(cacheKey, cached);
  }
  return cached;
}

The engine is then enabled per-page via enableBlockingInPage(page). This means 20 concurrent pages for the same project share one 30MB filter engine, not 600MB.

6. Multi-auth: five strategies, one interface

Authenticated scanning supports five strategies through a unified config:

Strategy	Config shape
basic	`{ type: 'basic', username, password }`
bearer	`{ type: 'bearer', token }`
cookie	`{ type: 'cookie', cookies: [{ name, value }] }`
ntlm	`{ type: 'ntlm', username, password }`
ui	`{ type: 'ui', usernameSelector, passwordSelector }` or `{ type: 'ui', steps: [...] }`

Each config is validated before use:

private validateAuthConfig(config: unknown): boolean {
  const type = (config as any)?.type;
 
  switch (type) {
    case 'basic':
    case 'ntlm':
      return typeof config.username === 'string'
        && typeof config.password === 'string';
    case 'bearer':
      return typeof config.token === 'string';
    case 'cookie':
      return Array.isArray(config.cookies)
        && config.cookies.every(c => c.name && c.value);
    case 'ui':
      return (
        (config.usernameSelector && config.passwordSelector) ||
        (Array.isArray(config.steps) && config.steps.length > 0)
      );
    default:
      return false;
  }
}

Session caching via Redis

After a successful login, the browser state (cookies, localStorage) is saved to Redis with a TTL:

async authenticate(page, url, authConfig, sessionId) {
  // 1. Try saved session first
  if (sessionId) {
    const hasSession = await authService.hasAuthSession(sessionId);
    if (hasSession) {
      const restored = await authService.restoreAuthSession(page, sessionId);
      if (restored.success) {
        await authService.refreshAuthSession(sessionId); // Extend TTL
        return { success: true, page: restored.page };
      }
      // Session stale — delete and fall through
      await authService.deleteAuthSession(sessionId);
    }
  }
 
  // 2. Fresh authentication
  await page.goto('about:blank'); // Clean slate
  const result = await authService.authenticate(page, url, authConfig);
 
  if (!result.success) {
    return { success: false, page, error: { code: 'AUTH_FAILED', ... } };
  }
 
  // 3. Cache the session for next scan
  if (sessionId) {
    await authService.saveAuthSession(result.page, sessionId);
  }
 
  return { success: true, page: result.page };
}

7. Stealth: evading bot detection

Many sites block headless browsers. For local Chromium, I use playwright-extra with the stealth plugin:

// Applied once at browser-manager construction
if (this.browserType === 'chromium') {
  playwrightExtraChromium.use(stealth());
}
 
// Later, at launch time:
if (this.browserType === 'chromium') {
  browser = await playwrightExtraChromium.launch({
    headless: this.config.headless,
    args: ['--no-sandbox', '--disable-setuid-sandbox',
           '--disable-dev-shm-usage', '--disable-gpu'],
  });
}

The stealth plugin patches navigator.webdriver, navigator.plugins, navigator.languages, window.chrome, and other fingerprints that sites use to detect automation. For remote Browserless deployments, stealth isn't needed — Browserless itself presents as a real browser.

8. Error resilience: discriminated errors + smart retry

Not all errors should be retried. A 401 Unauthorized won't fix itself. An ECONNRESET on the CDP connection probably will. The system uses discriminated scan errors:

type ErrorCode =
  | 'BROWSER_LAUNCH_FAILED'  // Retryable: browser processes crash
  | 'PAGE_LOAD_FAILED'       // Retryable: transient network issues
  | 'AUTH_FAILED'            // NOT retryable: bad credentials
  | 'SCAN_TIMEOUT'           // Retryable: slow page, might load next time
  | 'SCAN_CANCELLED'         // NOT retryable: user-requested
  | 'ADBLOCKER_INIT_FAILED'  // Depends: retryable if network, not if config
  | 'UNKNOWN_ERROR';         // Conservative: NOT retryable

The retry handler uses exponential backoff with 30% jitter:

new RetryHandler({
  maxRetries: 3,
  baseDelay: 1000,  // 1s
  maxDelay: 8000,   // 8s (1s × 2^3)
  retryableErrors: [
    ErrorCode.BROWSER_LAUNCH_FAILED,
    ErrorCode.PAGE_LOAD_FAILED,
    ErrorCode.SCAN_TIMEOUT
  ]
});

At the browser-launch level, there's a separate retry loop with configuration fallbacks:

// Browser launch retry loop with config fallbacks
while (retryCount <= 2) {
  try {
    const browser = await playwrightExtraChromium.launch({
      headless: this.config.headless,
      args: chromiumArgs,
      timeout: launchTimeout,
    });
    return browser;
  } catch (error) {
    retryCount++;
    if (retryCount === 2 && this.browserType === 'chromium') {
      // Final attempt: try the newer headless mode
      chromiumArgs.push('--headless=new');
    }
    await sleep(2000 * Math.pow(2, retryCount - 1));
  }
}

9. Remote Browserless: CDP with protocol fallback

In production, browsers run on a separate Browserless cluster. The connection code handles both CDP (Browserless native) and standard Playwright WebSocket protocols:

private async connectToRemoteBrowser(endpoint: string): Promise<Browser> {
  const wsEndpoint = endpoint
    .replace('http://', 'ws://')
    .replace('https://', 'wss://');
 
  for (let retry = 0; retry <= 3; retry++) {
    try {
      // Try CDP first (Browserless speaks this natively)
      const browser = await playwrightChromium.connectOverCDP(wsEndpoint, { timeout: 30000 });
      this.setupBrowserListeners(browser);
      return browser;
    } catch (cdpError) {
      const msg = cdpError.message;
      if (msg.includes('Protocol error') || msg.includes('undefined')) {
        // Not Browserless — try standard Playwright connect
        try {
          const browser = await playwrightChromium.connect(wsEndpoint, { timeout: 30000 });
          this.setupBrowserListeners(browser);
          return browser;
        } catch (pwError) {
          throw cdpError; // Both failed — throw original
        }
      }
      throw cdpError;
    }
  }
}

The retry loop (3 attempts with exponential backoff) handles transient connection failures. The disconnection listener cleans up state:

browser.on('disconnected', () => {
  logger.warn(`Browser disconnected`, {
    endpoint: this.lastRemoteEndpoint,
    activePages: this.activePagesSet.size,
    queueLength: this.pageRequestQueue.length,
  });
  if (this.browser === browser) {
    this.browser = undefined; // Force reconnection on next acquirePage
  }
});

10. Idle detection: auto-shutdown inactive browsers

A browser with no active pages for 5 minutes gets shut down:

constructor(config, maxPoolSize = 5, idleTimeoutMs = 300000) {
  this.idleCheckTimer = setInterval(
    () => this.checkIdleTimeout(idleTimeoutMs),
    60000 // Check every minute
  );
  this.idleCheckTimer.unref(); // Don't keep the process alive
}
 
private checkIdleTimeout(idleTimeoutMs: number): void {
  if (this.isShuttingDown || this.activePagesSet.size > 0) return;
 
  const idleTime = Date.now() - this.lastActivityTime;
  if (idleTime > idleTimeoutMs && this.browser) {
    const browserToClose = this.browser;
    this.browser = undefined; // Clear immediately
 
    browserToClose.close()
      .then(() => logger.info('Closed idle browser'))
      .catch(err => logger.error('Failed to close idle browser', { err }));
  }
}

The .unref() on the timer is important — it prevents the idle check from keeping the entire Node.js process alive if there's nothing else running.

Key takeaways

Two-layer pooling (browser LRU + page queues) matches the cost structure. Browsers are expensive and shared across scans; pages are cheap and recycled within a scan batch.
LRU dispose must handle async. Browser closing is async, but the LRU's dispose is sync. Track shutdown promises in a Set and await them during graceful shutdown.
A module-level adblocker cache saves ~30MB × N. One engine per filter config, shared across all pages via enableBlockingInPage().
Auth sessions in Redis eliminate re-logins across batch scans — critical for rate-limited or MFA-protected sites.
Error discrimination enables selective retry. Don't retry AUTH_FAILED (wrong password); do retry BROWSER_LAUNCH_FAILED (CDP hiccup).
CDP-first, Playwright-fallback connection handles both Browserless and generic Playwright servers without configuration flags.

Command Palette