Skip to main content

Multimodal Ingestion

The ICE ingestion pipeline decomposes multimodal data into a unified Semantic Ledger, providing a persistent substrate for context-augmented reasoning.

1. Knowledge Extraction Pipeline

ICE natively processes diverse file formats and URLs, converting them into structured ContentLists for vectorization and retrieval.

Supported Formats

  • Documentation: PDF, DOCX, TXT, MD
  • Structured Data: JSON, YAML, XLSX, CSV
  • Source Code: Python, JavaScript, TypeScript, C++, Rust, etc.
  • Web Content: Direct URL ingestion (including YouTube transcripts).
  • Multimodal: Image-to-text extraction and diagram analysis.

2. Technical Mechanics

Semantic Partitioning

The engine automatically executes text chunking based on ICE_CHUNK_SIZE and ICE_CHUNK_OVERLAP. It maintains document hierarchy, ensuring that headers, tables, and nested structures are preserved in the retrieval index.

Multimodal Extraction

  • ICE_INGEST_ENABLE_IMAGES: Extracts semantic data from diagrams and charts.
  • ICE_INGEST_ENABLE_TABLES: High-fidelity conversion of complex tables to Markdown.
  • ICE_INGEST_ENABLE_EQUATIONS: Mathematical equation extraction (LaTeX support).

Configurable Parsers

ICE integrates with specialized parsing engines for complex document layouts.

  • Docling: Optimized for enterprise PDFs and high-fidelity layout preservation.
  • MinerU: Specialized for academic papers and mathematical dense content.

3. Implementation Specification

Ingestion Request

Ingestion is triggered via the SDK or the direct REST API. Files must be located within the sandboxed ICE_UPLOAD_DIR.

# Ingestion into the Semantic Ledger — local file
ice.ingest(
session_id="research_01",
file_path="market_report.pdf",
metadata={"source": "q3_projections"}
)

# Ingestion from cloud object storage — AWS S3
ice.ingest(
session_id="research_01",
uri="s3://my-enterprise-bucket/project_alpha_docs/",
metadata={"source": "q3_projections"}
)

# Ingestion from cloud object storage — Google Cloud Storage
ice.ingest(
session_id="research_01",
uri="gs://my-gcp-bucket/project_alpha_docs/",
metadata={"source": "q3_projections"}
)

4. Operational Guardrails

  • Sandboxed Uploads: Local file ingestion is restricted to ICE_UPLOAD_DIR. Cloud URI ingestion (s3://, gs://) bypasses this restriction — credentials are resolved from the host environment.
  • Compliance Scrubbing: Automatic PII redaction during the ingestion phase (ICE_INGEST_ENABLE_COMPLIANCE).
  • Atomic Transactions: Ensures the Semantic Ledger is only updated upon successful vectorization and indexing.