Most AI products start with something simple: a single provider, hardcoded prompts, and minimal error handling. These prototypes demo well but collapse under scale: costs spiral, outages cascade, and teams lose confidence in making changes.
Matter & Gas was built as a multi-tenant AI platform that treats these problems as first-class concerns. From cost enforcement to prompt versioning, from real-time streaming to tenant isolation, the architecture is production-ready by design.
Multi-Provider Model Registry
Instead of scattering model configs across the codebase, Matter & Gas centralizes them in a Model Registry:
// amplify/functions/workflow-runner/src/modelCapabilities.ts
export const MODEL_REGISTRY: Record<string, ModelCapability> = {
"gpt-4o": {
provider: "openai",
pricing: { inputCostPerUnit: 0.005, outputCostPerUnit: 0.02, unit: "1K tokens" },
contextWindow: 128000,
apiConventions: { supportsStreaming: true, supportsJSONMode: true }
},
"anthropic.claude-3-7-sonnet-20250219-v1:0": {
provider: "anthropic",
pricing: { inputCostPerUnit: 0.003, outputCostPerUnit: 0.015, unit: "1K tokens" },
contextWindow: 200000,
apiConventions: { supportsStreaming: true }
}
};
Each entry defines pricing, context windows, tokenizer behavior, and capability flags.
- A global
DEFAULT_MODEL_ID
is set inbackend.ts
and used if a workflow omits a modelId - If a modelId is invalid or missing from the registry, the runner fails fast (no silent fallback)
- Bedrock-style IDs are normalized (e.g.
us.anthropic...
→anthropic...
)
This abstraction makes switching providers straightforward while preserving correctness.
Circuit Breaker Protection
Every model in the registry is wrapped with circuit breakers:
- Thresholds → breaker opens if error rates exceed ~5% over 5 minutes
- Fail-open → if health-check logic itself errors, traffic is allowed (prefer availability)
- Persistence → state stored in DynamoDB (
ModelCircuitBreaker
), surviving cold starts - Manual overrides → operators can trip/reset with audit logging
This prevents cascading failures during provider instability.
Token Budget Enforcement
Uncontrolled token use is one of the fastest ways to lose money.
The TokenBudget enforces limits before requests are sent:
const result = await TokenBudget.enforce(
modelConfig,
requestedOutputTokens,
estimatedInputTokens
);
if (!result.allowed) {
throw new Error(`Budget exceeded: ${result.reason}`);
}
- Violations →
TOKEN_LIMIT_EXCEEDED
,COST_LIMIT_EXCEEDED
,CONTEXT_WINDOW_EXCEEDED
- Estimation → provider-specific tokenizers when available (e.g. tiktoken); otherwise registry fallbacks
- Truncation strategy → preserve system + user input, drop oldest memory next, truncate system only last
- Buffer margin → ~10% headroom ensures no overflow at runtime
Prompt Management (Git-Like Versioning)
Prompts are versioned artifacts, not inline strings.
The system includes:
- BasePromptVersion → immutable, SHA-256 hashed, stored in DynamoDB or S3 (KMS encrypted)
- ActivePromptPointer → mutable selector, with rollback to
previousVersionId
- CAS Update Lambda → atomic Compare-And-Set for safe deployments
- PromptArchiveBucket → S3 archival with 1-year retention
- AuditLog → every create/update/rollback recorded
Resolution hierarchy:
- Tenant + Workflow + Model
- Workflow + Model
- Tenant + Model
- Model (global)
- Emergency fallback → neutral base prompt for
DEFAULT_MODEL_ID
Resolution never fails: if nothing matches, the neutral base prompt is returned.
Asynchronous Document Processing
Documents move through a fully async pipeline:
- Upload via GraphQL API
- Storage in S3 + enqueue job in SQS
- Processing worker:
- Textract for OCR
- Titan v2 embeddings for vectors
- Pinecone upsert for storage
- Tracking → DynamoDB updates document status
- Resilience → DLQs at every stage (
DocumentProcessingDLQ
,TextractProcessingDLQ
)
All failures are captured and retriable — no silent data loss.
Real-Time Streaming & Collections
User experience is stream-first:
- StreamToClient nodes send incremental responses with metadata:
tokensUsed
generationTimeMs
chunkNumber
isStreamingChunk
For retrieval, Collections enforce enterprise-grade ACLs:
- Docs can belong to multiple collections
- VectorSearch filters results against
state.allowedDocumentIds
deleteCollectionWithCascade
ensures no orphaned data (cleans S3, Pinecone, DynamoDB together)
Multi-Tenant Security
Tenant boundaries are enforced consistently:
- S3 prefixes →
documents/{tenantId}/*
- Pinecone namespaces → one per tenant
- WorkflowAccess table → governs workflow visibility/sharing
- Encryption → all S3 objects use KMS
- Secrets Manager → OpenAI + Pinecone keys injected at runtime
This ensures strong isolation and safe collaboration.
Workflow Graph Execution
Workflows are directed graphs with nine node types:
- ModelInvoke, VectorSearch, Router, SlotTracker, ConversationMemory, Format, StreamToClient, VectorWrite, IntentClassifier
The runner enforces:
- Virtual START/END edges auto-added
- Schema validation: all nodes/edges valid, all router/slot targets connected
- Router expressions use a safe DSL (e.g.,
state.intent === 'greeting'
) - SlotTracker supports partial slots and fallback routes
Developer Safety Nets
- Schema validation → invalid workflows fail fast
- CAS protection → prevents race conditions on prompt updates
- Rollback paths → ActivePromptPointer tracks
previousVersionId
- Emergency fallbacks everywhere → prompts, breakers, and streaming
This allows rapid iteration without outages.
Observability & Monitoring
Every Lambda uses AWS Powertools (Logger, Metrics, Tracer):
- Metrics: workflow runs, CAS conflicts, token enforcement, costs
- Tracing: API Gateway → Lambda → Bedrock/OpenAI/Pinecone
- Annotations:
workflowId
,modelId
,requestId
for correlation - Prompt cache: hit/miss ratios, evictions logged
Audit trails cover prompts, breaker operations, and document processing.
Production Lessons
Running AI in production requires discipline:
- Without budgets → costs spike
- Without versioning → prompts drift silently
- Without circuit breakers → outages cascade
Matter & Gas solves these problems as first-class concerns:
- Model Registry for portability
- Token budgets for cost control
- Circuit breakers + DLQs for resilience
- Collections + tenancy enforcement for enterprise safety
- Streaming for real-time UX
- Observability for operational visibility
Result: A foundation that scales from prototype to production without rewrites, enabling fast iteration and safe deployments.