Skip to main content
Version: devel

Packaging a dlt+ Project tutorial

dlt+

This page is for dlt+, which requires a license. Join our early access program for a trial license.

Packaging a dlt+ Project simplifies distribution across teams or stakeholders, such as data analysts or data science teams, without requiring direct access to the project’s internal code. Once installed, the package can be used to run pipelines and access production data through a standardized Python interface.

In this tutorial, you will learn how to package your dlt+ project for reuse and distribution and make it pip-installable.

Prerequisites

Before you begin, ensure the following requirements are met:

Additionally, install the required Python packages:

pip install pandas numpy pyarrow streamlit dlt[duckdb] uv

Packaging a project

dlt+ provides tools to help you package a project for distribution. This makes your project installable via pip and easier to share across your organization.

To create the project structure required for a package, add the --package option when initializing:

dlt project init arrow duckdb --package my_dlt_project

This creates the same basic project as in the basic tutorial, but places it inside a module named my_dlt_project, and includes a basic pyproject.toml file following PEP standards. You’ll also get a default __init__.py file to make the package usable after installation:

.
├── my_dlt_project/ # Your project module
│ ├── __init__.py # Package entry point
│ ├── dlt.yml # dlt+ project manifest
│ └── ... # Other project files
├── .gitignore
└── pyproject.toml # the main project manifest

Your dlt.yml works exactly the same as in non-packaged projects. The key difference is the module structure and the presence of the pyproject.toml file. The file includes a special entry point setting to let dlt+ discover your project:

[project.entry-points.dlt_package]
dlt-project = "my_project"

You can still run the pipeline as usual with the CLI commands from the root folder:

dlt pipeline my_pipeline run

If you open the __init__.py file inside your project module, you'll see the full interface that users of your package will interact with. This interface is very similar to the current interface used in flat (non-packaged) projects. The main difference is that it automatically uses the access profile by default. You can customize the __init__.py file to your project's needs.

Using the packaged project

To demonstrate how your packaged project can be used, let's simulate a real-world scenario where a data scientist installs and runs your project in a separate Python environment. In this example, we'll use the uv package manager, but the same steps apply when using poetry or pip. You can find installation instructions here. Assume your packaged dlt+ project is located at: /Volumes/my_drive/my_folder/pyproject.toml. Navigate to a new directory and initialize your project:

uv init

Install your packaged project directly from the local path:

uv pip install /Volumes/my_drive/my_folder

Your dlt+ project is now available for use in this environment.

As an example, create a new Python file named test_project.py, use your packaged project, and define the environment variables it needs:

# import the packaged project
import my_dlt_project
import os
import pandas as pd

os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__ROW_COUNT"] = "0"
os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__SOME_SECRET"] = "0"

if __name__ == "__main__":
# should print "access" as defined in your dlt package
print(my_dlt_project.config().current_profile)
# Run the pipeline from the packaged project
my_dlt_project.runner().run_pipeline("my_pipeline")
# should list the defined destinations
print(my_dlt_project.config().destinations)
# get a dataset from the catalog
dataset = my_dlt_project.catalog().dataset("my_pipeline_dataset")
# Write a DataFrame to the "my_table" table in the dataset
dataset.save(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
# get the row counts of all tables in the dataset as a dataframe
print(dataset.row_counts().df())

Run the script inside the uv virtual environment:

uv run python test_project.py

Once your pipeline has run, you can explore and share the loaded data using various access methods provided by dlt+. Learn more about it in the Secure data access and sharing.

info

In a real-world setup, a data scientist wouldn't install the package from a local path. Instead, it would typically come from a private PyPI repository or a Git URL.

Next steps

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.