Skip to main content

Performance & Scalability

ICE uses a stateless kernel design with a shared-nothing data layer (PostgreSQL + Redis), enabling horizontal compute scaling without session stickiness.

1. Scalability Architecture

DimensionDesign
Compute ScalingStateless nodes — add/remove without downtime or session migration
Data LayerCentralized PostgreSQL (Semantic Ledger) + Redis (Hot-Cache) cluster shared across nodes
Session AffinityNone required — any node can serve any session
StreamingSSE passthrough — ICE does not buffer LLM output tokens
Retrieval PathAsync pgvector HNSW query, executed before prompt assembly

2. Resource Governance

ICE enforces hard resource caps via environment variables. These prevent runaway processes under load.

  • ICE_MEMORY_CAP_GB: Hard RAM ceiling for the ICE process. Engine terminates cleanly if exceeded.
  • ICE_MAX_STITCH_CONCURRENCY: Maximum parallel context assembly operations. Prevents CPU saturation during high-concurrency bursts.
  • ICE_POST_COMPRESSION_LIMIT: Maximum final prompt size (tokens) submitted to the upstream LLM. Enforced unconditionally.

3. High Availability

Fallback Mode

If PostgreSQL or Redis become unreachable, ICE bypasses context injection and routes the raw prompt directly to the upstream LLM. API availability is maintained. Context augmentation is suspended until backing services recover.

Horizontal Scaling

Add compute nodes and point them at the shared PostgreSQL + Redis cluster. No coordination required between nodes. Load balancing is handled at the network layer (e.g., K8s Service or a reverse proxy).