helix_ir.infer¶
helix_ir.infer — schema inference from document streams.
- helix_ir.infer.infer(documents: Iterable[dict[str, Any]], name: str = 'inferred', sample_size: int = 2000, seed: int | None = None, detect_pii: bool = True, pii_locale: str = 'in', max_union_members: int = 4, fail_on_empty: bool = True) Schema[source]¶
Infer a Schema from an iterable of documents.
- Parameters:
documents – Iterable of dicts representing rows/documents.
name – Name for the resulting Schema.
sample_size – Maximum number of documents to sample (Algorithm R).
seed – Random seed for reproducible sampling.
detect_pii – If True, annotate fields with PII classification.
pii_locale – Locale string for PII detection (‘in’, ‘us’, ‘eu’, ‘all’).
max_union_members – Max union members before falling back to JsonBlob.
fail_on_empty – If True, raise EmptySourceError when input is empty.
- Returns:
A Schema inferred from the document stream.
- Raises:
EmptySourceError – If documents is empty and fail_on_empty is True.