Concepts¶
The Universal Intermediate Representation¶
helix_ir represents all data in one universal intermediate representation:
a typed, lineage-aware, schema-annotated DAG. This collapses the modern data
stack from four tools (ingestion, transformation, DDL, lineage) into one library.
The Type System¶
Every value is classified into a HelixType — an
Apache Arrow logical type extended with Helix metadata:
null_ratio — observed fraction of nulls
cardinality_estimate — HyperLogLog estimate of unique values
confidence — how certain the inference engine is about this type
pii_class — PII classification (email, phone, PAN, etc.)
semantic — semantic hint (email, url, uuid, enum)
The Type Lattice¶
When the inference engine sees the same field take two different types across documents, it merges them using the type lattice:
join(Int32, Int64)→Int64(numeric widening)join(String, Int64)→String(string fallback)join(Struct, Struct)→ mergedStruct(recursive)join(T1, T2)→Union(T1, T2)if no rule applies
The Schema¶
A Schema is a named, ordered, immutable collection
of (field_name, HelixType) pairs. Schemas are hashable, JSON-serializable,
and Arrow-compatible.
Normalization Strategies¶
Strategy |
Behaviour |
|---|---|
|
Strict first normal form — all arrays and complex structs become child tables with foreign keys |
|
Structs stored as JSON blobs, arrays split |
``inline_small``| Structs with ≤ N leaves inlined; larger ones split |
|
|
Caller-supplied rules per path |