Building Production AI: A Multi-Tenant Platform Architecture

Building Production AI: A Multi-Tenant Platform Architecture

ai-platformmulti-tenantproduction-architecturecost-controlcircuit-breakersobservability

Most AI products start with something simple: a single provider, hardcoded prompts, and minimal error handling. These prototypes demo well but collapse under scale: costs spiral, outages cascade, and teams lose confidence in making changes.

Matter & Gas was built as a multi-tenant AI platform that treats these problems as first-class concerns. From cost enforcement to prompt versioning, from real-time streaming to tenant isolation, the architecture is production-ready by design.

Multi-Provider Model Registry

Instead of scattering model configs across the codebase, Matter & Gas centralizes them in a Model Registry:

// amplify/functions/workflow-runner/src/modelCapabilities.ts
export const MODEL_REGISTRY: Record<string, ModelCapability> = {
  "gpt-4o": {
    provider: "openai",
    pricing: { inputCostPerUnit: 0.005, outputCostPerUnit: 0.02, unit: "1K tokens" },
    contextWindow: 128000,
    apiConventions: { supportsStreaming: true, supportsJSONMode: true }
  },
  "anthropic.claude-3-7-sonnet-20250219-v1:0": {
    provider: "anthropic",
    pricing: { inputCostPerUnit: 0.003, outputCostPerUnit: 0.015, unit: "1K tokens" },
    contextWindow: 200000,
    apiConventions: { supportsStreaming: true }
  }
};

Each entry defines pricing, context windows, tokenizer behavior, and capability flags.

  • A global DEFAULT_MODEL_ID is set in backend.ts and used if a workflow omits a modelId
  • If a modelId is invalid or missing from the registry, the runner fails fast (no silent fallback)
  • Bedrock-style IDs are normalized (e.g. us.anthropic...anthropic...)

This abstraction makes switching providers straightforward while preserving correctness.

Circuit Breaker Protection

Every model in the registry is wrapped with circuit breakers:

  • Thresholds → breaker opens if error rates exceed ~5% over 5 minutes
  • Fail-open → if health-check logic itself errors, traffic is allowed (prefer availability)
  • Persistence → state stored in DynamoDB (ModelCircuitBreaker), surviving cold starts
  • Manual overrides → operators can trip/reset with audit logging

This prevents cascading failures during provider instability.

Token Budget Enforcement

Uncontrolled token use is one of the fastest ways to lose money.

The TokenBudget enforces limits before requests are sent:

const result = await TokenBudget.enforce(
  modelConfig,
  requestedOutputTokens,
  estimatedInputTokens
);

if (!result.allowed) {
  throw new Error(`Budget exceeded: ${result.reason}`);
}
  • ViolationsTOKEN_LIMIT_EXCEEDED, COST_LIMIT_EXCEEDED, CONTEXT_WINDOW_EXCEEDED
  • Estimation → provider-specific tokenizers when available (e.g. tiktoken); otherwise registry fallbacks
  • Truncation strategy → preserve system + user input, drop oldest memory next, truncate system only last
  • Buffer margin → ~10% headroom ensures no overflow at runtime

Prompt Management (Git-Like Versioning)

Prompts are versioned artifacts, not inline strings.

The system includes:

  • BasePromptVersion → immutable, SHA-256 hashed, stored in DynamoDB or S3 (KMS encrypted)
  • ActivePromptPointer → mutable selector, with rollback to previousVersionId
  • CAS Update Lambda → atomic Compare-And-Set for safe deployments
  • PromptArchiveBucket → S3 archival with 1-year retention
  • AuditLog → every create/update/rollback recorded

Resolution hierarchy:

  1. Tenant + Workflow + Model
  2. Workflow + Model
  3. Tenant + Model
  4. Model (global)
  5. Emergency fallback → neutral base prompt for DEFAULT_MODEL_ID

Resolution never fails: if nothing matches, the neutral base prompt is returned.

Asynchronous Document Processing

Documents move through a fully async pipeline:

  1. Upload via GraphQL API
  2. Storage in S3 + enqueue job in SQS
  3. Processing worker:
    • Textract for OCR
    • Titan v2 embeddings for vectors
    • Pinecone upsert for storage
  4. Tracking → DynamoDB updates document status
  5. Resilience → DLQs at every stage (DocumentProcessingDLQ, TextractProcessingDLQ)

All failures are captured and retriable — no silent data loss.

Real-Time Streaming & Collections

User experience is stream-first:

  • StreamToClient nodes send incremental responses with metadata:
    • tokensUsed
    • generationTimeMs
    • chunkNumber
    • isStreamingChunk

For retrieval, Collections enforce enterprise-grade ACLs:

  • Docs can belong to multiple collections
  • VectorSearch filters results against state.allowedDocumentIds
  • deleteCollectionWithCascade ensures no orphaned data (cleans S3, Pinecone, DynamoDB together)

Multi-Tenant Security

Tenant boundaries are enforced consistently:

  • S3 prefixesdocuments/{tenantId}/*
  • Pinecone namespaces → one per tenant
  • WorkflowAccess table → governs workflow visibility/sharing
  • Encryption → all S3 objects use KMS
  • Secrets Manager → OpenAI + Pinecone keys injected at runtime

This ensures strong isolation and safe collaboration.

Workflow Graph Execution

Workflows are directed graphs with nine node types:

  • ModelInvoke, VectorSearch, Router, SlotTracker, ConversationMemory, Format, StreamToClient, VectorWrite, IntentClassifier

The runner enforces:

  • Virtual START/END edges auto-added
  • Schema validation: all nodes/edges valid, all router/slot targets connected
  • Router expressions use a safe DSL (e.g., state.intent === 'greeting')
  • SlotTracker supports partial slots and fallback routes

Developer Safety Nets

  • Schema validation → invalid workflows fail fast
  • CAS protection → prevents race conditions on prompt updates
  • Rollback paths → ActivePromptPointer tracks previousVersionId
  • Emergency fallbacks everywhere → prompts, breakers, and streaming

This allows rapid iteration without outages.

Observability & Monitoring

Every Lambda uses AWS Powertools (Logger, Metrics, Tracer):

  • Metrics: workflow runs, CAS conflicts, token enforcement, costs
  • Tracing: API Gateway → Lambda → Bedrock/OpenAI/Pinecone
  • Annotations: workflowId, modelId, requestId for correlation
  • Prompt cache: hit/miss ratios, evictions logged

Audit trails cover prompts, breaker operations, and document processing.

Production Lessons

Running AI in production requires discipline:

  • Without budgets → costs spike
  • Without versioning → prompts drift silently
  • Without circuit breakers → outages cascade

Matter & Gas solves these problems as first-class concerns:

  • Model Registry for portability
  • Token budgets for cost control
  • Circuit breakers + DLQs for resilience
  • Collections + tenancy enforcement for enterprise safety
  • Streaming for real-time UX
  • Observability for operational visibility

Result: A foundation that scales from prototype to production without rewrites, enabling fast iteration and safe deployments.

Have questions or want to collaborate?

We'd love to hear from you about this technical approach or discuss how it might apply to your project.

Get in touch

Ready to start?

Tell us about your workflow needs and we'll set up a quick fit check.