Deploy with Dagster
Introduction to Dagster
Dagster is an orchestrator designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports. Dagster ensures these processes are reliable and focuses on using software-defined assets (SDAs) to simplify complex data management, enhance the ability to reuse code, and provide a better understanding of data.
To read more, please refer to Dagster’s documentation.
Dagster Cloud Features
Dagster Cloud offers enterprise-level orchestration service with serverless or hybrid deployment options. It incorporates native branching and built-in CI/CD to prioritize the developer experience. It enables scalable, cost-effective operations without the hassle of infrastructure management.
Dagster deployment options: Serverless versus Hybrid:
The serverless option fully hosts the orchestration engine, while the hybrid model offers flexibility to use your computing resources, with Dagster managing the control plane. Reducing operational overhead and ensuring security.
For more info, please refer to the Dagster Cloud docs.
Using Dagster for Free
Dagster offers a 30-day free trial during which you can explore its features, such as pipeline orchestration, data quality checks, and embedded ELTs. You can try Dagster using its open source or by signing up for the trial.
Building Data Pipelines with dlt
dlt
is an open-source Python library that allows you to declaratively load data sources into
well-structured tables or datasets through automatic schema inference and evolution. It simplifies
building data pipelines with support for extract and load processes.
How does dlt
integrate with Dagster for pipeline orchestration?
dlt
integrates with Dagster for pipeline orchestration, providing a streamlined process for
building, enhancing, and managing data pipelines. This enables developers to leverage dlt
's
capabilities for handling data extraction and load and Dagster's orchestration features to efficiently manage and monitor data pipelines.
Orchestrating dlt
pipeline on Dagster
Here's a concise guide to orchestrating a dlt
pipeline with Dagster, using the project "Ingesting
GitHub issues data from a repository and storing it in BigQuery" as an example.
More details can be found in the article “Orchestrating unstructured data pipelines with dagster and dlt."
The steps are as follows:
Create a
dlt
pipeline. For more, please refer to the documentation: Creating a pipeline.Set up a Dagster project, configure resources, and define the asset as follows:
To create a Dagster project:
mkdir dagster_github_issues
cd dagster_github_issues
dagster project scaffold --name github-issuesDefine
dlt
as a Dagster resource:from dagster import ConfigurableResource
from dagster import ConfigurableResource
import dlt
class DltPipeline(ConfigurableResource):
pipeline_name: str
dataset_name: str
destination: str
def create_pipeline(self, resource_data, table_name):
# configure the pipeline with your destination details
pipeline = dlt.pipeline(
pipeline_name=self.pipeline_name,
destination=self.destination,
dataset_name=self.dataset_name
)
# run the pipeline with your parameters
load_info = pipeline.run(resource_data, table_name=table_name)
return load_infoDefine the asset as:
@asset
def issues_pipeline(pipeline: DltPipeline):
logger = get_dagster_logger()
results = pipeline.create_pipeline(github_issues_resource, table_name='github_issues')
logger.info(results)For more information, please refer to Dagster’s documentation.
Next, define Dagster definitions as follows:
all_assets = load_assets_from_modules([assets])
simple_pipeline = define_asset_job(name="simple_pipeline", selection= ['issues_pipeline'])
defs = Definitions(
assets=all_assets,
jobs=[simple_pipeline],
resources={
"pipeline": DltPipeline(
pipeline_name = "github_issues",
dataset_name = "dagster_github_issues",
destination = "bigquery",
),
}
)Finally, start the web server as:
dagster dev
For the complete hands-on project on “Orchestrating unstructured data pipelines with dagster and
dlt
", please refer to article. The author offers a
detailed overview and steps for ingesting GitHub issue data from a repository and storing it in
BigQuery. You can use a similar approach to build your pipelines.
Additional Resources
A general configurable
dlt
resource orchestrated on Dagster: dlt resource.Configure
dlt
pipelines for Dagster: dlt pipelines.Configure MongoDB source as an Asset factory:
Dagster provides the feature of @multi_asset declaration that will allow us to convert each collection under a database into a separate asset. This will make our pipeline easy to debug in case of failure and the collections independent of each other.
These are external repositories and are subject to change.