Skip to main content

Core Memory Architecture

ICE implements a 3-tier storage architecture that decouples the LLM's physical context window from the application's session history. The active context window is assembled on each request from whichever tier holds the relevant data.

1. Storage Tiers

TierComponentPurpose
Hot-CacheRedisActive session prefix and recent tool-call state.
Semantic LedgerPostgreSQL + pgvectorLong-term vectorized interaction history and ingested document fragments.
Cold ArchiveObject storage (via webhook)Archived sessions and audit logs after the retention window expires.

2. Retrieval Mechanics

Tiered Prompt Injection

ICE assembles the final prompt using a multi-rank retrieval system. High-relevance fragments are placed at the prompt prefix and suffix — positions where attention is strongest — while lower-ranked content is compressed or summarized.

KV-Cache Alignment

The engine maintains stable prompt prefixes to maximize KV-Cache hit rates in inference engines (vLLM, TGI). This significantly reduces Time-To-First-Token (TTFT) for repeated session interactions.

Recursive Completion

For outputs that exceed the model's physical output limit, ICE executes recursive re-prompting. It captures the partial output, re-injects the current state, and triggers a continuation until the ICE_MAX_CONTINUATIONS limit or task completion.

3. Data Integrity

  • Context Pinning: Explicitly pins critical session metadata or recent tool results to the active window.
  • Semantic Partitioning: Automatic chunking and overlap management for high-fidelity retrieval.
  • Prefetching: Asynchronous pre-retrieval of likely context based on session trajectory.