Concepts

The Universal Intermediate Representation

helix_ir represents all data in one universal intermediate representation: a typed, lineage-aware, schema-annotated DAG. This collapses the modern data stack from four tools (ingestion, transformation, DDL, lineage) into one library.

The Type System

Every value is classified into a HelixType — an Apache Arrow logical type extended with Helix metadata:

  • null_ratio — observed fraction of nulls

  • cardinality_estimate — HyperLogLog estimate of unique values

  • confidence — how certain the inference engine is about this type

  • pii_class — PII classification (email, phone, PAN, etc.)

  • semantic — semantic hint (email, url, uuid, enum)

The Type Lattice

When the inference engine sees the same field take two different types across documents, it merges them using the type lattice:

  • join(Int32, Int64)Int64 (numeric widening)

  • join(String, Int64)String (string fallback)

  • join(Struct, Struct) → merged Struct (recursive)

  • join(T1, T2)Union(T1, T2) if no rule applies

The Schema

A Schema is a named, ordered, immutable collection of (field_name, HelixType) pairs. Schemas are hashable, JSON-serializable, and Arrow-compatible.

Normalization Strategies

Strategy

Behaviour

1nf

Strict first normal form — all arrays and complex structs become child tables with foreign keys

mongo

Structs stored as JSON blobs, arrays split

``inline_small``| Structs with ≤ N leaves inlined; larger ones split

custom

Caller-supplied rules per path