Automated accessibility testing requires real browsers. The WCAG spec cares about the rendered DOM, computed styles, and layout — things you can't get from a curl request. But running headless browsers at scale is notoriously painful: memory leaks, zombie processes, flaky CDP connections, and the ever-present risk of OOM kills.
This post covers the browser automation layer of A11yNow — the subsystem I built at BarrierBreak to manage Chromium, Firefox, and WebKit instances across hundreds of concurrent scans without falling over. Every pattern here was forged in production fire.
1. The two-layer pooling architecture
There are two layers of resource pooling: browser instances (heavy, ~300MB each) and pages (light, ~10MB each). They're managed differently because they have different costs and lifetimes.
┌────────────────────────────────────────────────────────┐
│ CoreScannerService │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ LRU Cache: BrowserManager instances │ │
│ │ Key: projectId:browserType:device │ │
│ │ Max: 12, TTL: 10 min │ │
│ └────────────────────────┬─────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼─────────────────────────┐ │
│ │ BrowserManager (per project) │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────┐ │ │
│ │ │ Active Page Pool (max 5) │ │ │
│ │ │ Request Queue (max 20, 120s) │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘Why per-project?
Different projects need different browser configurations. Project A might test a public marketing site (no auth, desktop only). Project B might test an authenticated admin dashboard (cookie auth, mobile viewport, ad blocking). Sharing one browser between them would mean constantly tearing down and rebuilding browser contexts — roughly as expensive as launching a new browser.
Instead, each project gets its own browser manager with a fixed configuration. It lives in an LRU cache so inactive projects get evicted, freeing memory for active ones.
2. LRU cache with safe disposal
The cache is a standard lru-cache instance with one production-hardened twist — the dispose handler:
this.projectBrowserManagers = new LRUCache<string, IBrowserManager>({
max: parseInt(process.env.BROWSER_POOL_MAX_SIZE || '12', 10),
ttl: 1000 * 60 * parseInt(process.env.BROWSER_POOL_TTL_MINUTES || '10', 10),
ttlAutopurge: true,
updateAgeOnGet: true, // Reset TTL on access — prevents mid-scan eviction
dispose: (value, key) => {
// LRUCache's dispose is synchronous, but shutdown is async.
// Track the promise so graceful shutdown can await it.
const p = (async () => {
try {
logger.info('LRU evicting browser manager, shutting down', { projectId: key });
await value.shutdown();
} catch (error) {
logger.error('Error shutting down during LRU eviction', { projectId: key, error });
} finally {
this.pendingDisposals.delete(p);
}
})();
this.pendingDisposals.add(p);
},
});Three details matter here:
updateAgeOnGet: true— EveryacquirePage()call touches the cache entry, resetting its TTL. A project actively receiving scan requests won't have its browser evicted mid-scan.maxis the real limit, notttl— TTL auto-purge cleans stale browsers, butmax: 12is the hard cap. The 13th project gets the least-recently-used entry evicted.pendingDisposalsset — Thedisposecallback is synchronous, butbrowser.close()is async. If the process receives SIGTERM during a disposal, the graceful-shutdown handler callsawaitPendingDisposals()to avoid leaking browser processes.
Concurrent creation guard
A subtle race: two scans for the same project arrive simultaneously. Without a guard, both would see a cache miss and create two browser-manager instances:
private getProjectBrowserManager(projectId, browserType, devicePreset) {
const cacheKey = `${projectId}:${browserType}:${devicePreset || 'desktop'}`;
if (this.projectBrowserManagers.has(cacheKey)) {
return Promise.resolve(this.projectBrowserManagers.get(cacheKey)!);
}
// Check for an in-flight creation promise
if (this.projectManagerPromises.has(cacheKey)) {
return this.projectManagerPromises.get(cacheKey)!;
}
const creationPromise = (async () => {
try {
const browserConfig = await this.buildBrowserConfig(projectId, ...);
const manager = new BrowserManager(browserConfig);
this.projectBrowserManagers.set(cacheKey, manager);
return manager;
} finally {
this.projectManagerPromises.delete(cacheKey);
}
})();
this.projectManagerPromises.set(cacheKey, creationPromise);
return creationPromise;
}The first caller creates the promise and stores it in projectManagerPromises. Any subsequent caller before the promise resolves gets the same promise. Once resolved, the cache has the entry and the promise-map entry is cleaned.
3. Page pooling with request queuing
Each browser manager keeps up to 5 active pages. When a scan needs a page and the pool is full, it doesn't throw — it queues:
async acquirePage(): Promise<Page> {
this.lastActivityTime = Date.now();
// Under capacity: create a new page
if (this.activePagesSet.size < this.maxPoolSize) {
const browser = await this.getBrowser();
const pageContext = await browser.newContext(this.buildContextOptions());
const page = await pageContext.newPage();
page.on('close', () => {
pageContext.close().catch(e =>
logger.warn('Failed to close browser context', { error: e.message })
);
});
this.activePagesSet.add(page);
const blocker = await this.getAdBlocker();
if (blocker) {
await blocker.enableBlockingInPage(page);
}
return page;
}
// Pool full — queue with timeout
return new Promise((resolve, reject) => {
const timeout = setTimeout(() => {
const idx = this.pageRequestQueue.findIndex(req => req.timeout === timeout);
if (idx !== -1) this.pageRequestQueue.splice(idx, 1);
reject(new ScanError(
ErrorCode.BROWSER_LAUNCH_FAILED,
`Timed out waiting for available page (${this.maxQueueWaitTimeMs / 1000}s)`,
false
));
}, this.maxQueueWaitTimeMs);
this.pageRequestQueue.push({ resolve, reject, timestamp: Date.now(), timeout });
});
}Queue processing is event-driven + periodic
When a page is released, the queue is processed immediately:
async releasePage(page: Page): Promise<void> {
this.activePagesSet.delete(page);
await this.closePage(page);
// Don't wait for the 5s timer — process now
this.processQueue();
}But there's also a fallback 5-second interval timer. Why both? If a page release triggers an error during close, processQueue() might not be called. The interval ensures queued requests don't starve.
processQueue() itself is simple: dequeue the oldest request, clear its timeout, call acquirePage() (which will now have capacity), and resolve/reject the promise:
private processQueue(): void {
if (this.isShuttingDown || this.pageRequestQueue.length === 0 ||
this.activePagesSet.size >= this.maxPoolSize) {
return;
}
const request = this.pageRequestQueue.shift()!;
clearTimeout(request.timeout);
this.acquirePage()
.then(page => request.resolve(page))
.catch(error => request.reject(error));
}4. Page isolation via browser contexts
Every page gets its own Playwright BrowserContext — not just its own page. This means:
- Isolated cookies, localStorage, and sessionStorage — one scan's login state can't leak into another.
- Per-page viewport, device scale, and user agent — mobile scan contexts use
isMobile: trueand touch emulation, desktop contexts don't. - Automatic cleanup — the page's
closeevent auto-closes its context.
private buildContextOptions(): Record<string, any> {
const opts = {
viewport: this.config.viewport,
deviceScaleFactor: this.config.deviceScaleFactor ?? 1,
javaScriptEnabled: true,
ignoreHTTPSErrors: true,
};
if (this.config.userAgent) opts.userAgent = this.config.userAgent;
// Browser-type awareness
if (this.browserType === 'firefox') {
// Firefox doesn't support isMobile/hasTouch emulation
} else if (this.browserType === 'webkit' && process.platform === 'linux') {
// Linux WebKit doesn't support touch/mobile emulation
} else {
opts.isMobile = this.config.isMobile ?? false;
}
if (!(this.browserType === 'webkit' && process.platform === 'linux')) {
opts.hasTouch = this.config.hasTouch ?? false;
}
return opts;
}The browser-type gating avoids runtime errors. Setting isMobile on Firefox or hasTouch on Linux WebKit would cause Playwright to throw — so those flags are silently skipped with a warning.
5. Ad blocking: shared engine, per-page enablement
Ad and cookie-banner blocking uses @ghostery/adblocker-playwright. The engine is a 30MB parsed filter list — creating one per page would be catastrophic. So it's shared at the module level:
// Module-level cache — one engine per filter list combination
const SHARED_BLOCKER_CACHE = new Map<string, Promise<PlaywrightBlocker>>();
private async getAdBlocker(): Promise<PlaywrightBlocker | null> {
const filterUrls: string[] = [];
if (this.blockAds) filterUrls.push('https://easylist.to/easylist/easylist.txt');
if (this.blockCookieBanners) {
filterUrls.push('https://secure.fanboy.co.nz/fanboy-cookiemonster.txt');
filterUrls.push('https://secure.fanboy.co.nz/fanboy-annoyance.txt');
}
if (this.blockTrackers) filterUrls.push('https://easylist.to/easylist/easyprivacy.txt');
if (filterUrls.length === 0) return null;
const cacheKey = [...filterUrls].sort().join('|');
let cached = SHARED_BLOCKER_CACHE.get(cacheKey);
if (!cached) {
cached = PlaywrightBlocker.fromLists(fetch, filterUrls).catch(error => {
SHARED_BLOCKER_CACHE.delete(cacheKey); // Don't cache failures
throw error;
});
SHARED_BLOCKER_CACHE.set(cacheKey, cached);
}
return cached;
}The engine is then enabled per-page via enableBlockingInPage(page). This means 20 concurrent pages for the same project share one 30MB filter engine, not 600MB.
6. Multi-auth: five strategies, one interface
Authenticated scanning supports five strategies through a unified config:
| Strategy | Config shape |
|---|---|
| basic | { type: 'basic', username, password } |
| bearer | { type: 'bearer', token } |
| cookie | { type: 'cookie', cookies: [{ name, value }] } |
| ntlm | { type: 'ntlm', username, password } |
| ui | { type: 'ui', usernameSelector, passwordSelector } or { type: 'ui', steps: [...] } |
Each config is validated before use:
private validateAuthConfig(config: unknown): boolean {
const type = (config as any)?.type;
switch (type) {
case 'basic':
case 'ntlm':
return typeof config.username === 'string'
&& typeof config.password === 'string';
case 'bearer':
return typeof config.token === 'string';
case 'cookie':
return Array.isArray(config.cookies)
&& config.cookies.every(c => c.name && c.value);
case 'ui':
return (
(config.usernameSelector && config.passwordSelector) ||
(Array.isArray(config.steps) && config.steps.length > 0)
);
default:
return false;
}
}Session caching via Redis
After a successful login, the browser state (cookies, localStorage) is saved to Redis with a TTL:
async authenticate(page, url, authConfig, sessionId) {
// 1. Try saved session first
if (sessionId) {
const hasSession = await authService.hasAuthSession(sessionId);
if (hasSession) {
const restored = await authService.restoreAuthSession(page, sessionId);
if (restored.success) {
await authService.refreshAuthSession(sessionId); // Extend TTL
return { success: true, page: restored.page };
}
// Session stale — delete and fall through
await authService.deleteAuthSession(sessionId);
}
}
// 2. Fresh authentication
await page.goto('about:blank'); // Clean slate
const result = await authService.authenticate(page, url, authConfig);
if (!result.success) {
return { success: false, page, error: { code: 'AUTH_FAILED', ... } };
}
// 3. Cache the session for next scan
if (sessionId) {
await authService.saveAuthSession(result.page, sessionId);
}
return { success: true, page: result.page };
}This matters because a project might run 50 scheduled scans in a batch. Without session caching, every single scan would re-login — triggering rate limits, audit logs, and potentially MFA prompts.
7. Stealth: evading bot detection
Many sites block headless browsers. For local Chromium, I use playwright-extra with the stealth plugin:
// Applied once at browser-manager construction
if (this.browserType === 'chromium') {
playwrightExtraChromium.use(stealth());
}
// Later, at launch time:
if (this.browserType === 'chromium') {
browser = await playwrightExtraChromium.launch({
headless: this.config.headless,
args: ['--no-sandbox', '--disable-setuid-sandbox',
'--disable-dev-shm-usage', '--disable-gpu'],
});
}The stealth plugin patches navigator.webdriver, navigator.plugins, navigator.languages, window.chrome, and other fingerprints that sites use to detect automation. For remote Browserless deployments, stealth isn't needed — Browserless itself presents as a real browser.
8. Error resilience: discriminated errors + smart retry
Not all errors should be retried. A 401 Unauthorized won't fix itself. An ECONNRESET on the CDP connection probably will. The system uses discriminated scan errors:
type ErrorCode =
| 'BROWSER_LAUNCH_FAILED' // Retryable: browser processes crash
| 'PAGE_LOAD_FAILED' // Retryable: transient network issues
| 'AUTH_FAILED' // NOT retryable: bad credentials
| 'SCAN_TIMEOUT' // Retryable: slow page, might load next time
| 'SCAN_CANCELLED' // NOT retryable: user-requested
| 'ADBLOCKER_INIT_FAILED' // Depends: retryable if network, not if config
| 'UNKNOWN_ERROR'; // Conservative: NOT retryableThe retry handler uses exponential backoff with 30% jitter:
new RetryHandler({
maxRetries: 3,
baseDelay: 1000, // 1s
maxDelay: 8000, // 8s (1s × 2^3)
retryableErrors: [
ErrorCode.BROWSER_LAUNCH_FAILED,
ErrorCode.PAGE_LOAD_FAILED,
ErrorCode.SCAN_TIMEOUT
]
});At the browser-launch level, there's a separate retry loop with configuration fallbacks:
// Browser launch retry loop with config fallbacks
while (retryCount <= 2) {
try {
const browser = await playwrightExtraChromium.launch({
headless: this.config.headless,
args: chromiumArgs,
timeout: launchTimeout,
});
return browser;
} catch (error) {
retryCount++;
if (retryCount === 2 && this.browserType === 'chromium') {
// Final attempt: try the newer headless mode
chromiumArgs.push('--headless=new');
}
await sleep(2000 * Math.pow(2, retryCount - 1));
}
}The --headless=new fallback is important — some sites detect and block the old headless mode, but the newer mode (which uses the native browser UI under the hood) often passes through.
9. Remote Browserless: CDP with protocol fallback
In production, browsers run on a separate Browserless cluster. The connection code handles both CDP (Browserless native) and standard Playwright WebSocket protocols:
private async connectToRemoteBrowser(endpoint: string): Promise<Browser> {
const wsEndpoint = endpoint
.replace('http://', 'ws://')
.replace('https://', 'wss://');
for (let retry = 0; retry <= 3; retry++) {
try {
// Try CDP first (Browserless speaks this natively)
const browser = await playwrightChromium.connectOverCDP(wsEndpoint, { timeout: 30000 });
this.setupBrowserListeners(browser);
return browser;
} catch (cdpError) {
const msg = cdpError.message;
if (msg.includes('Protocol error') || msg.includes('undefined')) {
// Not Browserless — try standard Playwright connect
try {
const browser = await playwrightChromium.connect(wsEndpoint, { timeout: 30000 });
this.setupBrowserListeners(browser);
return browser;
} catch (pwError) {
throw cdpError; // Both failed — throw original
}
}
throw cdpError;
}
}
}The retry loop (3 attempts with exponential backoff) handles transient connection failures. The disconnection listener cleans up state:
browser.on('disconnected', () => {
logger.warn(`Browser disconnected`, {
endpoint: this.lastRemoteEndpoint,
activePages: this.activePagesSet.size,
queueLength: this.pageRequestQueue.length,
});
if (this.browser === browser) {
this.browser = undefined; // Force reconnection on next acquirePage
}
});10. Idle detection: auto-shutdown inactive browsers
A browser with no active pages for 5 minutes gets shut down:
constructor(config, maxPoolSize = 5, idleTimeoutMs = 300000) {
this.idleCheckTimer = setInterval(
() => this.checkIdleTimeout(idleTimeoutMs),
60000 // Check every minute
);
this.idleCheckTimer.unref(); // Don't keep the process alive
}
private checkIdleTimeout(idleTimeoutMs: number): void {
if (this.isShuttingDown || this.activePagesSet.size > 0) return;
const idleTime = Date.now() - this.lastActivityTime;
if (idleTime > idleTimeoutMs && this.browser) {
const browserToClose = this.browser;
this.browser = undefined; // Clear immediately
browserToClose.close()
.then(() => logger.info('Closed idle browser'))
.catch(err => logger.error('Failed to close idle browser', { err }));
}
}The .unref() on the timer is important — it prevents the idle check from keeping the entire Node.js process alive if there's nothing else running.
Key takeaways
- Two-layer pooling (browser LRU + page queues) matches the cost structure. Browsers are expensive and shared across scans; pages are cheap and recycled within a scan batch.
- LRU dispose must handle async. Browser closing is async, but the LRU's
disposeis sync. Track shutdown promises in aSetand await them during graceful shutdown. - A module-level adblocker cache saves ~30MB × N. One engine per filter config, shared across all pages via
enableBlockingInPage(). - Auth sessions in Redis eliminate re-logins across batch scans — critical for rate-limited or MFA-protected sites.
- Error discrimination enables selective retry. Don't retry
AUTH_FAILED(wrong password); do retryBROWSER_LAUNCH_FAILED(CDP hiccup). - CDP-first, Playwright-fallback connection handles both Browserless and generic Playwright servers without configuration flags.
