helix_ir.sources
helix_ir.sources — data source connectors.
-
class helix_ir.sources.Source(*args, **kwargs)[source]
Bases: Protocol
Protocol for all Helix IR data sources.
-
read() → Iterable[dict[str, Any]][source]
Yield documents from this source.
-
schema_hint() → dict[str, Any] | None[source]
Return an optional schema hint dict, or None.
-
class helix_ir.sources.JSONSource(path: str, format: str = 'auto')[source]
Bases: object
Read documents from a JSON or NDJSON file.
-
read() → Iterable[dict[str, Any]][source]
-
schema_hint() → dict[str, Any] | None[source]
-
class helix_ir.sources.ParquetSource(path: str, batch_size: int = 1000)[source]
Bases: object
Read documents from a Parquet file using PyArrow.
-
read() → Iterable[dict[str, Any]][source]
-
schema_hint() → dict[str, Any] | None[source]
-
class helix_ir.sources.RestSource(url: str, headers: dict[str, str] | None = None, data_key: str | None = None, pagination: dict[str, Any] | None = None, max_pages: int = 100)[source]
Bases: object
Read documents from a paginated REST API.
-
read() → Iterable[dict[str, Any]][source]
-
schema_hint() → dict[str, Any] | None[source]