helix_ir.sources

helix_ir.sources — data source connectors.

class helix_ir.sources.Source(*args, **kwargs)[source]

Bases: Protocol

Protocol for all Helix IR data sources.

read() Iterable[dict[str, Any]][source]

Yield documents from this source.

schema_hint() dict[str, Any] | None[source]

Return an optional schema hint dict, or None.

class helix_ir.sources.JSONSource(path: str, format: str = 'auto')[source]

Bases: object

Read documents from a JSON or NDJSON file.

read() Iterable[dict[str, Any]][source]
schema_hint() dict[str, Any] | None[source]
class helix_ir.sources.ParquetSource(path: str, batch_size: int = 1000)[source]

Bases: object

Read documents from a Parquet file using PyArrow.

read() Iterable[dict[str, Any]][source]
schema_hint() dict[str, Any] | None[source]
class helix_ir.sources.RestSource(url: str, headers: dict[str, str] | None = None, data_key: str | None = None, pagination: dict[str, Any] | None = None, max_pages: int = 100)[source]

Bases: object

Read documents from a paginated REST API.

read() Iterable[dict[str, Any]][source]
schema_hint() dict[str, Any] | None[source]