Skip to main content
Version: 1.2.0 (latest)

extract.extractors

MaterializedEmptyList Objects

class MaterializedEmptyList(List[Any])

[view_source]

A list variant that will materialize tables even if empty list was yielded

materialize_schema_item

def materialize_schema_item() -> MaterializedEmptyList

[view_source]

Yield this to materialize schema in the destination, even if there's no data.

with_file_import

def with_file_import(
file_path: str,
file_format: TLoaderFileFormat,
items_count: int = 0,
hints: Union[TResourceHints, TDataItem] = None) -> DataItemWithMeta

[view_source]

Marks file under file_path to be associated with current resource and imported into the load package as a file of type file_format.

You can provide optional hints that will be applied to the current resource. Note that you should avoid schema inference at runtime if possible and if that is not possible - to do that only once per extract process. Use make_hints in mark module to create hints. You can also pass Arrow table or Pandas data frame form which schema will be taken (but content discarded). Create TResourceHints with make_hints.

If number of records in file_path is known, pass it in items_count so dlt can generate correct extract metrics.

Note that dlt does not sniff schemas from data and will not guess right file format for you.

Extractor Objects

class Extractor()

[view_source]

write_items

def write_items(resource: DltResource, items: TDataItems, meta: Any) -> None

[view_source]

Write items to resource optionally computing table schemas and revalidating/filtering data

ObjectExtractor Objects

class ObjectExtractor(Extractor)

[view_source]

Extracts Python object data items into typed jsonl

ArrowExtractor Objects

class ArrowExtractor(Extractor)

[view_source]

Extracts arrow data items into parquet. Normalizes arrow items column names. Compares the arrow schema to actual dlt table schema to reorder the columns and to insert missing columns (without data). Adds _dlt_load_id column to the table if add_dlt_load_id is set to True in normalizer config.

We do things that normalizer should do here so we do not need to load and save parquet files again later.

Handles the following types:

  • pyarrow.Table
  • pyarrow.RecordBatch
  • pandas.DataFrame (is converted to arrow Table before processing)

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.