Packaging a dlt+ Project tutorial
This page is for dlt+, which requires a license. Join our early access program for a trial license.
Packaging a dlt+ Project simplifies distribution across teams or stakeholders, such as data analysts or data science teams, without requiring direct access to the project’s internal code. Once installed, the package can be used to run pipelines and access production data through a standardized Python interface.
In this tutorial, you will learn how to package your dlt+ project for reuse and distribution and make it pip-installable.
Prerequisites
Before you begin, ensure the following requirements are met:
- dlt+ is installed and set up according to the installation guide
- You are familiar with the core concepts of dlt
- You have completed the basic project tutorial
Additionally, install the required Python packages:
pip install pandas numpy pyarrow streamlit dlt[duckdb] uv
Packaging a project
dlt+
provides tools to help you package a project for distribution. This makes your project installable via pip
and easier to share across your organization.
To create the project structure required for a package, add the --package
option when initializing:
dlt project init arrow duckdb --package my_dlt_project
This creates the same basic project as in the basic tutorial, but places it inside a module named my_dlt_project
, and includes a basic pyproject.toml
file following PEP standards.
You’ll also get a default __init__.py
file to make the package usable after installation:
.
├── my_dlt_project/ # Your project module
│ ├── __init__.py # Package entry point
│ ├── dlt.yml # dlt+ project manifest
│ └── ... # Other project files
├── .gitignore
└── pyproject.toml # the main project manifest
Your dlt.yml
works exactly the same as in non-packaged projects.
The key difference is the module structure and the presence of the pyproject.toml
file.
The file includes a special entry point setting to let dlt+ discover your project:
[project.entry-points.dlt_package]
dlt-project = "my_project"
You can still run the pipeline as usual with the CLI commands from the root folder:
dlt pipeline my_pipeline run
If you open the __init__.py
file inside your project module, you'll see the full interface that users of your package will interact with.
This interface is very similar to the current
interface used in flat (non-packaged) projects. The main difference is that it automatically uses the access
profile by default.
You can customize the __init__.py
file to your project's needs.
Using the packaged project
To demonstrate how your packaged project can be used, let's simulate a real-world scenario where a data scientist installs and runs your project in a separate Python environment.
In this example, we'll use the uv package manager, but the same steps apply when using poetry or pip. You can find installation instructions here.
Assume your packaged dlt+ project is located at: /Volumes/my_drive/my_folder/pyproject.toml
.
Navigate to a new directory and initialize your project:
uv init
Install your packaged project directly from the local path:
uv pip install /Volumes/my_drive/my_folder
Your dlt+ project is now available for use in this environment.
As an example, create a new Python file named test_project.py
, use your packaged project, and define the environment variables it needs:
# import the packaged project
import my_dlt_project
import os
import pandas as pd
os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__ROW_COUNT"] = "0"
os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__SOME_SECRET"] = "0"
if __name__ == "__main__":
# should print "access" as defined in your dlt package
print(my_dlt_project.config().current_profile)
# Run the pipeline from the packaged project
my_dlt_project.runner().run_pipeline("my_pipeline")
# should list the defined destinations
print(my_dlt_project.config().destinations)
# get a dataset from the catalog
dataset = my_dlt_project.catalog().dataset("my_pipeline_dataset")
# Write a DataFrame to the "my_table" table in the dataset
dataset.save(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
# get the row counts of all tables in the dataset as a dataframe
print(dataset.row_counts().df())
Run the script inside the uv virtual environment:
uv run python test_project.py
Once your pipeline has run, you can explore and share the loaded data using various access methods provided by dlt+. Learn more about it in the Secure data access and sharing.
In a real-world setup, a data scientist wouldn't install the package from a local path. Instead, it would typically come from a private PyPI repository or a Git URL.