Building Production AI: A Multi-Tenant Platform Architecture

Most AI products start with something simple: a single provider, hardcoded prompts, and minimal error handling. These prototypes demo well but collapse under scale: costs spiral, outages cascade, and teams lose confidence in making changes.

Matter & Gas was built as a multi-tenant AI platform that treats these problems as first-class concerns. From cost enforcement to prompt versioning, from real-time streaming to tenant isolation, the architecture is production-ready by design.

Multi-Provider Model Registry

Instead of scattering model configs across the codebase, Matter & Gas centralizes them in a Model Registry:

// amplify/functions/workflow-runner/src/modelCapabilities.ts
export const MODEL_REGISTRY: Record<string, ModelCapability> = {
  "gpt-4o": {
    provider: "openai",
    pricing: { inputCostPerUnit: 0.005, outputCostPerUnit: 0.02, unit: "1K tokens" },
    contextWindow: 128000,
    apiConventions: { supportsStreaming: true, supportsJSONMode: true }
  },
  "anthropic.claude-3-7-sonnet-20250219-v1:0": {
    provider: "anthropic",
    pricing: { inputCostPerUnit: 0.003, outputCostPerUnit: 0.015, unit: "1K tokens" },
    contextWindow: 200000,
    apiConventions: { supportsStreaming: true }
  }
};

Each entry defines pricing, context windows, tokenizer behavior, and capability flags.

A global DEFAULT_MODEL_ID is set in backend.ts and used if a workflow omits a modelId
If a modelId is invalid or missing from the registry, the runner fails fast (no silent fallback)
Bedrock-style IDs are normalized (e.g. us.anthropic... → anthropic...)

This abstraction makes switching providers straightforward while preserving correctness.

Circuit Breaker Protection

Every model in the registry is wrapped with circuit breakers:

Thresholds → breaker opens if error rates exceed ~5% over 5 minutes
Fail-open → if health-check logic itself errors, traffic is allowed (prefer availability)
Persistence → state stored in DynamoDB (ModelCircuitBreaker), surviving cold starts
Manual overrides → operators can trip/reset with audit logging

This prevents cascading failures during provider instability.

Token Budget Enforcement

Uncontrolled token use is one of the fastest ways to lose money.

The TokenBudget enforces limits before requests are sent:

const result = await TokenBudget.enforce(
  modelConfig,
  requestedOutputTokens,
  estimatedInputTokens
);

if (!result.allowed) {
  throw new Error(`Budget exceeded: ${result.reason}`);
}

Violations → TOKEN_LIMIT_EXCEEDED, COST_LIMIT_EXCEEDED, CONTEXT_WINDOW_EXCEEDED
Estimation → provider-specific tokenizers when available (e.g. tiktoken); otherwise registry fallbacks
Truncation strategy → preserve system + user input, drop oldest memory next, truncate system only last
Buffer margin → ~10% headroom ensures no overflow at runtime

Prompt Management (Git-Like Versioning)

Prompts are versioned artifacts, not inline strings.

The system includes:

BasePromptVersion → immutable, SHA-256 hashed, stored in DynamoDB or S3 (KMS encrypted)
ActivePromptPointer → mutable selector, with rollback to previousVersionId
CAS Update Lambda → atomic Compare-And-Set for safe deployments
PromptArchiveBucket → S3 archival with 1-year retention
AuditLog → every create/update/rollback recorded

Resolution hierarchy:

Tenant + Workflow + Model
Workflow + Model
Tenant + Model
Model (global)
Emergency fallback → neutral base prompt for DEFAULT_MODEL_ID

Resolution never fails: if nothing matches, the neutral base prompt is returned.

Asynchronous Document Processing

Documents move through a fully async pipeline:

Upload via GraphQL API
Storage in S3 + enqueue job in SQS
Processing worker:
- Textract for OCR
- Titan v2 embeddings for vectors
- Pinecone upsert for storage
Tracking → DynamoDB updates document status
Resilience → DLQs at every stage (DocumentProcessingDLQ, TextractProcessingDLQ)

All failures are captured and retriable — no silent data loss.

Real-Time Streaming & Collections

User experience is stream-first:

StreamToClient nodes send incremental responses with metadata:
- tokensUsed
- generationTimeMs
- chunkNumber
- isStreamingChunk

For retrieval, Collections enforce enterprise-grade ACLs:

Docs can belong to multiple collections
VectorSearch filters results against state.allowedDocumentIds
deleteCollectionWithCascade ensures no orphaned data (cleans S3, Pinecone, DynamoDB together)

Multi-Tenant Security

Tenant boundaries are enforced consistently:

S3 prefixes → documents/{tenantId}/*
Pinecone namespaces → one per tenant
WorkflowAccess table → governs workflow visibility/sharing
Encryption → all S3 objects use KMS
Secrets Manager → OpenAI + Pinecone keys injected at runtime

This ensures strong isolation and safe collaboration.

Workflow Graph Execution

Workflows are directed graphs with nine node types:

ModelInvoke, VectorSearch, Router, SlotTracker, ConversationMemory, Format, StreamToClient, VectorWrite, IntentClassifier

The runner enforces:

Virtual START/END edges auto-added
Schema validation: all nodes/edges valid, all router/slot targets connected
Router expressions use a safe DSL (e.g., state.intent === 'greeting')
SlotTracker supports partial slots and fallback routes

Developer Safety Nets

Schema validation → invalid workflows fail fast
CAS protection → prevents race conditions on prompt updates
Rollback paths → ActivePromptPointer tracks previousVersionId
Emergency fallbacks everywhere → prompts, breakers, and streaming

This allows rapid iteration without outages.

Observability & Monitoring

Every Lambda uses AWS Powertools (Logger, Metrics, Tracer):

Metrics: workflow runs, CAS conflicts, token enforcement, costs
Tracing: API Gateway → Lambda → Bedrock/OpenAI/Pinecone
Annotations: workflowId, modelId, requestId for correlation
Prompt cache: hit/miss ratios, evictions logged

Audit trails cover prompts, breaker operations, and document processing.

Production Lessons

Running AI in production requires discipline:

Without budgets → costs spike
Without versioning → prompts drift silently
Without circuit breakers → outages cascade

Matter & Gas solves these problems as first-class concerns:

Model Registry for portability
Token budgets for cost control
Circuit breakers + DLQs for resilience
Collections + tenancy enforcement for enterprise safety
Streaming for real-time UX
Observability for operational visibility

Result: A foundation that scales from prototype to production without rewrites, enabling fast iteration and safe deployments.