Skip to main content
All Writing
accessibilityarchitectureTypeScriptinfrastructureengineering

Building a Production-Grade Accessibility Scanning Platform

An architectural deep-dive into A11yNow — the distributed web accessibility auditing backend I built at BarrierBreak. Browser pooling, distributed scheduling, multi-auth, error resilience, and a 600MB Docker image.

By Criston Mascarenhas, Senior Software EngineerUpdated 13 min read
Architecture deep-dive header showing the scan pipeline: HTTP to Postgres to Redis and BullMQ to Worker and Browser to Results

When your job is to scan thousands of web pages for WCAG 2.1 compliance across multiple browsers, device viewports, and authentication schemes — all while handling scheduled cron jobs, queued workloads, distributed locking, and screenshots — you quickly learn that the hard part isn't the accessibility rules. It's the infrastructure.

This post walks through the architecture of A11yNow, the backend I designed and built at BarrierBreak to automate accessibility auditing at scale. Every decision here was driven by real production pain: browser memory leaks, duplicate scheduled executions, OOM kills, and flaky auth sessions.

1. The scan pipeline (at 10,000 feet)

A scan request flows through five stages:

HTTP POST /scan  →  PostgreSQL (record)  →  Redis/BullMQ (queue)  →  Worker (browser + scan engine)  →  PostgreSQL (results)

Here's the exact call chain:

  1. The controller validates input, checks usage quotas, and fetches project settings once (no N+1).
  2. The core scanner service writes a scan-result row in PostgreSQL with status pending, then enqueues the job into BullMQ.
  3. BullMQ stores the job in Redis with 3 retry attempts, exponential backoff, and bounded retention (1h completed / 24h failed).
  4. The scan worker unpacks the job payload into a scan context and calls into the scanner service.
  5. The scanner service runs the real work:
private async executeScan(context: ScanContext, persistence: IScanPersistence) {
  // 1. Get or create a per-project browser manager (LRU-cached)
  const browserManager = await this.getProjectBrowserManager(
    context.projectId, context.browserType, context.devicePreset
  );
  const page = await browserManager.acquirePage();
 
  // 2. Authenticate if needed (basic, bearer, cookie, NTLM, or multi-step UI)
  if (authConfig) {
    const result = await this.authHandler.authenticate(page, url, authConfig, sessionId);
    page = result.page;
  }
 
  // 3. Navigate and execute the accessibility engine
  await page.goto(url);
  const result = await this.scanExecutor.execute(page, context, this.config);
 
  // 4. Store issues with SHA-256 fingerprinting (deduplication)
  const createdIssues = await storeIssues(result.issues, projectId, pageId, ...);
 
  // 5. Capture screenshots of each issue (sequential — can't parallelise DOM highlights)
  if (context.takeScreenshot) {
    page = await this.processScreenshotsSequential(page, createdIssues, context);
  }
 
  // 6. Update status to COMPLETED only after all data is persisted
  await persistence.updateStatus(context.scanId, ScanStatus.COMPLETED);
}

The key insight: status is only set to COMPLETED after issues are stored and screenshots are uploaded to S3. There's no "partial success" state — the scan is either fully done or it's still in progress.

2. Browser pooling with LRU + page queues

You can't launch a new Chromium instance for every scan. A single browser process eats ~300MB+. With dozens of concurrent scans, you'd OOM in minutes. My solution: per-project browser pooling with LRU eviction and a page request queue.

LRU cache of browser managers

Each project gets its own browser manager instance, cached by projectId:browserType:devicePreset:

this.projectBrowserManagers = new LRUCache<string, IBrowserManager>({
  max: 12,                // Max 12 concurrent browser instances
  ttl: 1000 * 60 * 10,    // 10-minute TTL
  ttlAutopurge: true,     // Auto-evict stale browsers
  updateAgeOnGet: true,   // Reset TTL on access (prevent mid-scan eviction)
  dispose: (value, key) => {
    // Async shutdown tracked via pendingDisposals set
    const p = value.shutdown().finally(() => this.pendingDisposals.delete(p));
    this.pendingDisposals.add(p);
  },
});

When the LRU reaches max, the least-recently-used browser is evicted and gracefully shut down. The dispose handler tracks shutdown promises so the graceful-shutdown handler can await them.

Page request queue (not just another pool)

Each browser manager maintains a configurable pool of Playwright Page objects (default 5). When the pool is full, requests are queued rather than throwing:

async acquirePage(): Promise<Page> {
  if (this.activePagesSet.size < this.maxPoolSize) {
    const browser = await this.getBrowser();
    const pageContext = await browser.newContext(contextOptions);
    const page = await pageContext.newPage();
    this.activePagesSet.add(page);
    return page;
  }
 
  // Pool full — queue the request with a 120s timeout
  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      this.pageRequestQueue.splice(/* remove this request */);
      reject(new Error('Timed out waiting for available page'));
    }, this.maxQueueWaitTimeMs);
 
    this.pageRequestQueue.push({ resolve, reject, timestamp: Date.now(), timeout });
  });
}

When a page is released, the queue is processed immediately (not just on a timer). This gives sub-5ms response when capacity is available, but can backpressure up to 20 queued requests before logging warnings.

Why not a generic connection pool?

Generic pools (like generic-pool) work for database connections. Browser pages are different:

  • Each page needs its own browser context (isolated cookies, localStorage).
  • Ad blocking is enabled per-page via a shared PlaywrightBlocker engine (30MB, cached at module level to avoid N× duplication).
  • Page-close events auto-close their context for clean teardown.
  • Context options are browser-type aware (Firefox skips mobile emulation, Linux WebKit skips touch).

3. The scheduling system: distributed locks done right

Scheduled scans were the hardest production bug to fix. The original implementation had race conditions: two instances or two cron ticks would both pick up the same "due" schedule and queue duplicate jobs. The fix was three-pronged.

3a. Distributed locking via Redis SET NX

async acquire(options: LockOptions = {}): Promise<boolean> {
  for (let attempt = 0; attempt <= retries; attempt++) {
    const result = await redis.set(
      `lock:${this.lockKey}`,
      this.lockValue,
      'PX', ttl,  // Millisecond expiry
      'NX'        // Only set if not exists
    );
    if (result === 'OK') { this.acquired = true; return true; }
    await this.sleep(retryDelay);
  }
  return false;
}
 
async release(): Promise<boolean> {
  // Lua script: atomic compare-and-delete (only the lock owner can release)
  const script = `
    if redis.call("get", KEYS[1]) == ARGV[1] then
      return redis.call("del", KEYS[1])
    else return 0 end
  `;
  const result = await redis.eval(script, 1, this.lockKey, this.lockValue);
  // ...
}

Lock acquisition uses SET NX PX — Redis's native atomic "set if not exists with expiry." The release uses a Lua script for atomic compare-and-delete, preventing a stale client from releasing someone else's lock.

3b. Idempotent job IDs

Each scheduled execution gets a deterministic job ID:

const jobId = ScheduleCalculator.generateJobId(schedule.id, now);
// e.g. "schedule-abc123-2026-06-16"
 
const existingExecution = await prisma.scheduleExecution.findUnique({
  where: { jobId }
});
 
if (existingExecution) {
  logger.info('Schedule already executed today, skipping');
  return;
}

Even if the lock fails, the database enforces idempotency: the execution record has a unique constraint on jobId.

3c. Atomic nextRunAt update (before queuing)

The critical race: update nextRunAt then queue scans. If the update happens after queuing and the process crashes, the schedule appears "due" again on the next tick.

// Transaction: atomically update schedule + create execution record
const execution = await prisma.$transaction(async (tx) => {
  await tx.scanSchedule.update({
    where: { id: schedule.id },
    data: { lastRunAt: now, nextRunAt }
  });
  return await tx.scheduleExecution.create({
    data: { scheduleId: schedule.id, status: 'PENDING', jobId, ... }
  });
});
 
// Only NOW queue the actual scans
for (const page of schedule.project.pages) {
  await this.scanQueue.addScan({ ... });
}

nextRunAt is advanced before anything is queued, inside the same database transaction as the execution record. If the process dies mid-queue, the next tick sees a future nextRunAt and skips it.

3d. First-run skip

On startup, the cron fires immediately (within 1 minute). Without protection, every deployment would queue every due schedule:

if (this.isFirstRun) {
  logger.info('First run - skipping all schedules');
  this.isFirstRun = false;
  return;
}

4. Multi-auth: five authentication strategies

Authenticated pages are common in enterprise a11y testing — internal dashboards, staging environments, client portals. The authentication handler supports five strategies:

TypeMechanismValidation
basicHTTP Basic Auth headerusername + password required
bearerAuthorization: Bearer headertoken required
cookieSet named cookiesarray of {name, value} objects
ntlmWindows Integrated Authusername + password (CNTLM proxy)
uiMulti-step form loginusernameSelector + passwordSelector, or a steps[] array

Sessions are persisted to Redis with a TTL. On the next scan, the saved session is restored — no need to re-login:

if (sessionId) {
  const hasSavedSession = await authService.hasAuthSession(sessionId);
  if (hasSavedSession) {
    const restored = await authService.restoreAuthSession(page, sessionId);
    if (restored.success) return { success: true, page: restored.page };
    // Session stale — delete and fall through to fresh login
    await authService.deleteAuthSession(sessionId);
  }
}
 
// Fresh authentication
await page.goto('about:blank'); // Clean slate
const authResult = await authService.authenticate(page, url, authConfig);
if (authResult.success && sessionId) {
  await authService.saveAuthSession(authResult.page, sessionId);
}

5. Error resilience: discriminated errors + retry with jitter

Errors in browser automation are messy: a page load might fail because of a network hiccup (retryable), or because the URL is a 404 (not retryable). The system uses discriminated scan errors:

type ErrorCode =
  | 'BROWSER_LAUNCH_FAILED'
  | 'PAGE_LOAD_FAILED'
  | 'AUTH_FAILED'
  | 'SCAN_TIMEOUT'
  | 'SCAN_CANCELLED'
  | 'ADBLOCKER_INIT_FAILED'
  | 'UNKNOWN_ERROR';
 
interface ScanError {
  code: ErrorCode;
  message: string;
  retryable: boolean;
}

The retry handler wraps scan execution with exponential backoff + 30% jitter:

const result = await this.retryHandler.execute(
  async () => this.executeScan(context, persistence),
  `scan-${context.scanId}`,
  {
    maxRetries: 3,
    retryableErrors: [
      ErrorCode.BROWSER_LAUNCH_FAILED,
      ErrorCode.PAGE_LOAD_FAILED,
      ErrorCode.SCAN_TIMEOUT
    ]
  }
);
  • BROWSER_LAUNCH_FAILED is retryable — browser processes crash, CDP endpoints drop.
  • AUTH_FAILED is not retryable — wrong credentials won't fix themselves.
  • Local browser launch has its own retry loop (2 retries), falling back to --headless=new on the final attempt.

6. Multi-browser + remote Browserless

The scanner supports Chromium, Firefox, and WebKit via Playwright. Browser selection is environment-driven:

BROWSER_PROVIDER=auto         # prefers remote Browserless, falls back to local
BROWSER_PROVIDER=local        # always launches locally
BROWSER_PROVIDER=browserless  # always connects to remote CDP

Remote mode connects to Browserless via connectOverCDP():

if (shouldUseRemoteBrowser() && browserlessEndpoint) {
  // CDP first (Browserless native), Playwright protocol fallback
  try {
    const browser = await playwrightChromium.connectOverCDP(wsEndpoint, { timeout: 30000 });
    return browser;
  } catch (cdpError) {
    if (errorMessage.includes('Protocol error')) {
      // Not Browserless — try standard Playwright connect
      return await playwrightChromium.connect(wsEndpoint, { timeout: 30000 });
    }
    throw cdpError;
  }
}

Local Chromium uses playwright-extra with the stealth plugin to evade bot detection — critical for scanning sites that block headless browsers. Context options are browser-aware:

private buildContextOptions() {
  const opts = { viewport, deviceScaleFactor, javaScriptEnabled: true, ignoreHTTPSErrors: true };
  if (this.browserType === 'firefox') {
    // Skip isMobile — Firefox doesn't support it
  } else if (this.browserType === 'webkit' && process.platform === 'linux') {
    // Skip hasTouch — Linux WebKit doesn't support it
  } else {
    opts.isMobile = this.config.isMobile;
  }
  return opts;
}

7. The data model: issue fingerprinting for deduplication

With 35+ Prisma models, the schema is comprehensive. The core innovation is issue fingerprinting:

model Issue {
  id              String   @id @default(cuid())
  fingerprint     String?  // SHA-256 hash of code + selector + context
  code            String   // rule code (e.g. "BB10447")
  selector        String   // CSS selector of offending element
  context         String?  // Surrounding HTML snippet
  severity        String   // Critical | Major | Minor
  successCriteria String?  // WCAG SC reference (e.g. "1.1.1")
  screenshotKey   String?  // S3 key
  screenshotUrl   String?  // Presigned URL
  assignee        Int[]    @default([])
  reviewStatus    String   @default("open")
  activities      Json[]   @default([])
 
  occurrences     IssueOccurrence[] // Which scans found this issue
  project         Project   @relation(...)
  page            Page      @relation(...)
}

The fingerprint is a SHA-256 of code + selector + context. When the same <img> without alt text appears in scan #47, it's recognized as the same issue from scan #1 — no duplicate row, just a new IssueOccurrence record.

The scan-result model carries a batchId for grouping related scans, browserType and devicePreset for environment tracking, and a computed batchStatus for efficient querying.

8. The Docker build: 5 stages to 600MB

A Node.js app with Playwright, Prisma, adblocker engines, and three workspace packages is heavy. The monolithic node_modules alone can exceed 1GB. The Dockerfile uses five stages to strip it to ~600MB:

Stage 1 (base):         Node 22 Alpine + build tools (python3, make, g++)
Stage 2 (dependencies): Install ALL deps, skip Playwright browser downloads
Stage 3 (builder):      Compile TypeScript, build workspace packages, generate Prisma client
Stage 4 (prod-deps):    yarn workspaces focus --production, then aggressively clean
Stage 5 (production):   Node 22 Alpine runtime, non-root user, no browser binaries

The aggressive cleaning in stage 4 is worth looking at:

RUN yarn workspaces focus --production && \
    find node_modules -name "*.md" -delete && \
    find node_modules -name "*.ts" ! -name "*.d.ts" -delete && \
    find node_modules -name "*.map" -delete && \
    find node_modules -type d -name "test" -exec rm -rf {} + && \
    find node_modules -type d -name "tests" -exec rm -rf {} + && \
    find node_modules -type d -name "docs" -exec rm -rf {} + && \
    find node_modules -type d -name "examples" -exec rm -rf {} +

TypeScript source, source maps, tests, docs, examples, benchmarks, changelogs — all stripped. The final image:

  • Runs as a non-root nodejs user (UID 1001).
  • Has a health-check endpoint (curl /health).
  • Uses NODE_OPTIONS="--max-old-space-size=4096" for memory headroom.
  • Binds to 0.0.0.0 on the configured port.
  • Removes Playwright browser binaries (uses remote Browserless).

9. Graceful shutdown

Production deploys aren't polite. SIGTERM arrives with a deadline. The shutdown handler runs a five-step sequence:

const gracefulShutdown = async (signal: string) => {
  // 1. Stop polling services (JIRA, GitHub)
  jiraPollingService.stop();
  githubPollingService.stop();
 
  // 2. Stop JIRA sync worker
  await workerServices.jiraSyncWorker.stop();
 
  // 3. Stop the scheduler (prevents new cron triggers)
  workerServices.schedulerService.stop();
 
  // 4. Close HTTP server (stop accepting new requests)
  await new Promise(resolve => server.close(resolve));
 
  // 5. Wait 5s for active jobs to finish
  await new Promise(resolve => setTimeout(resolve, 5000));
 
  process.exit(0);
};
 
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

The 5-second grace period gives BullMQ workers time to mark their current job as completed or failed before Redis loses the lock. In-flight browser shutdowns from LRU eviction can also be awaited before exit.

10. What I'd do differently

  1. Structured concurrency. The active-scan and pending-disposal sets work but are fragile. A proper structured-concurrency primitive (like Effect, or explicit Promise.race with cleanup) would be safer.
  2. Browser manager as a standalone service. Right now, browser managers live inside the scanner service's LRU. A separate browser-pool service with its own lifecycle would be cleaner — the scanner shouldn't own browser state.
  3. More aggressive TypeScript strictness. strict: false was pragmatic for velocity, but it's hiding bugs. Gradual adoption of strictNullChecks would catch null-pointer issues that currently surface as runtime errors.
  4. Observability. Structured logging (Pino) is there, but there are no OpenTelemetry traces. A scan that takes 30 seconds should be traceable through queue → worker → browser → engine → DB — right now you grep logs.
  5. Worker autoscaling. Worker concurrency is static (default 5). Real scan workloads spike — a KEDA or custom autoscaler based on queue depth would handle bursts better than a fixed pool.

Key takeaways

  • LRU-cached browser pooling beats launching per-scan. ~300MB per browser process adds up fast.
  • Distributed locks need three layers: Redis SET NX (speed), idempotent job IDs (correctness), and atomic nextRunAt updates (race prevention).
  • Update state before side effects. Advancing nextRunAt in a transaction before queuing scans prevents duplicate execution on crash.
  • Docker image size matters for CI/CD velocity. Stripping TypeScript, docs, and tests from node_modules cut the image by ~40%.
  • Error discrimination enables smart retries. Not all failures should be retried — bad auth shouldn't, but a CDP disconnect should.