helix_ir.infer

helix_ir.infer — schema inference from document streams.

helix_ir.infer.infer(documents: Iterable[dict[str, Any]], name: str = 'inferred', sample_size: int = 2000, seed: int | None = None, detect_pii: bool = True, pii_locale: str = 'in', max_union_members: int = 4, fail_on_empty: bool = True) Schema[source]

Infer a Schema from an iterable of documents.

Parameters:
  • documents – Iterable of dicts representing rows/documents.

  • name – Name for the resulting Schema.

  • sample_size – Maximum number of documents to sample (Algorithm R).

  • seed – Random seed for reproducible sampling.

  • detect_pii – If True, annotate fields with PII classification.

  • pii_locale – Locale string for PII detection (‘in’, ‘us’, ‘eu’, ‘all’).

  • max_union_members – Max union members before falling back to JsonBlob.

  • fail_on_empty – If True, raise EmptySourceError when input is empty.

Returns:

A Schema inferred from the document stream.

Raises:

EmptySourceError – If documents is empty and fail_on_empty is True.