Multimodal Ingestion
The ICE ingestion pipeline decomposes multimodal data into a unified Semantic Ledger, providing a persistent substrate for context-augmented reasoning.
1. Knowledge Extraction Pipeline
ICE natively processes diverse file formats and URLs, converting them into structured ContentLists for vectorization and retrieval.
Supported Formats
- Documentation: PDF, DOCX, TXT, MD
- Structured Data: JSON, YAML, XLSX, CSV
- Source Code: Python, JavaScript, TypeScript, C++, Rust, etc.
- Web Content: Direct URL ingestion (including YouTube transcripts).
- Multimodal: Image-to-text extraction and diagram analysis.
2. Technical Mechanics
Semantic Partitioning
The engine automatically executes text chunking based on ICE_CHUNK_SIZE and ICE_CHUNK_OVERLAP. It maintains document hierarchy, ensuring that headers, tables, and nested structures are preserved in the retrieval index.
Multimodal Extraction
- ICE_INGEST_ENABLE_IMAGES: Extracts semantic data from diagrams and charts.
- ICE_INGEST_ENABLE_TABLES: High-fidelity conversion of complex tables to Markdown.
- ICE_INGEST_ENABLE_EQUATIONS: Mathematical equation extraction (LaTeX support).
Configurable Parsers
ICE integrates with specialized parsing engines for complex document layouts.
- Docling: Optimized for enterprise PDFs and high-fidelity layout preservation.
- MinerU: Specialized for academic papers and mathematical dense content.
3. Implementation Specification
Ingestion Request
Ingestion is triggered via the SDK or the direct REST API. Files must be located within the sandboxed ICE_UPLOAD_DIR.
# Ingestion into the Semantic Ledger — local file
ice.ingest(
session_id="research_01",
file_path="market_report.pdf",
metadata={"source": "q3_projections"}
)
# Ingestion from cloud object storage — AWS S3
ice.ingest(
session_id="research_01",
uri="s3://my-enterprise-bucket/project_alpha_docs/",
metadata={"source": "q3_projections"}
)
# Ingestion from cloud object storage — Google Cloud Storage
ice.ingest(
session_id="research_01",
uri="gs://my-gcp-bucket/project_alpha_docs/",
metadata={"source": "q3_projections"}
)
4. Operational Guardrails
- Sandboxed Uploads: Local file ingestion is restricted to
ICE_UPLOAD_DIR. Cloud URI ingestion (s3://,gs://) bypasses this restriction — credentials are resolved from the host environment. - Compliance Scrubbing: Automatic PII redaction during the ingestion phase (
ICE_INGEST_ENABLE_COMPLIANCE). - Atomic Transactions: Ensures the Semantic Ledger is only updated upon successful vectorization and indexing.