Performance & Scalability

ICE uses a stateless kernel design with a shared-nothing data layer (PostgreSQL + Redis), enabling horizontal compute scaling without session stickiness.

1. Scalability Architecture

Dimension	Design
Compute Scaling	Stateless nodes — add/remove without downtime or session migration
Data Layer	Centralized PostgreSQL (Semantic Ledger) + Redis (Hot-Cache) cluster shared across nodes
Session Affinity	None required — any node can serve any session
Streaming	SSE passthrough — ICE does not buffer LLM output tokens
Retrieval Path	Async pgvector HNSW query, executed before prompt assembly

2. Resource Governance

ICE enforces hard resource caps via environment variables. These prevent runaway processes under load.

ICE_MEMORY_CAP_GB: Hard RAM ceiling for the ICE process. Engine terminates cleanly if exceeded.
ICE_MAX_STITCH_CONCURRENCY: Maximum parallel context assembly operations. Prevents CPU saturation during high-concurrency bursts.
ICE_POST_COMPRESSION_LIMIT: Maximum final prompt size (tokens) submitted to the upstream LLM. Enforced unconditionally.

3. High Availability

Fallback Mode

If PostgreSQL or Redis become unreachable, ICE bypasses context injection and routes the raw prompt directly to the upstream LLM. API availability is maintained. Context augmentation is suspended until backing services recover.

Horizontal Scaling

Add compute nodes and point them at the shared PostgreSQL + Redis cluster. No coordination required between nodes. Load balancing is handled at the network layer (e.g., K8s Service or a reverse proxy).

1. Scalability Architecture​

2. Resource Governance​

3. High Availability​

Fallback Mode​

Horizontal Scaling​

1. Scalability Architecture

2. Resource Governance

3. High Availability

Fallback Mode

Horizontal Scaling