Welcome to the DAGWorks Platform Documentation

Here’s a brief overview of capabilities you get by using the DAGWorks platform:

  • Data & Code observability — see what’s changing in your code and when it runs
  • Lineage — understand your code and how it relates to your data
  • Catalog — see code, and what data it produces in one place

For more detailed information see the Capabilities page. Otherwise, let’s get started!

Log in

Navigate to app.dagworks.io and login with your credentials. If you do not have an account, click “sign up” and you can get started with the 14-day trial of the “team” plan for free. After that, you can choose to continue with the team plan, select another, or automatically downgrade to the community plan for free.

Create a Project

You should be directed to the projects page (four squares). Click the + New Project button, and add a name, a description, and set visibility. You can add your team (if you have one) to the “Visible by” and “Modifiable by” sections, as well as anyone you wish to share the project with. Once the project is created, select it and find the ID — you’ll need it for later.

You can always find the name/ID of your project by going back to the projects page — its the first when you log in.

Create an API Key

Navigate to the “API Keys” page (key icon) and click + Create new key. Copy it by clickin on the copy icon and write it down.

Be careful with your API keys! These are like passwords — store them in a safe place and never check them in. You can always generate a new one if you forget, or delete the ones you have.

Install the DAGWorks Tracking Adapter

This assumes you have pip installed/know how to manage python environments. If not, it is worth a read here. To install the library/CLI, run:

pip install dagworks-sdk

Initialize your project

If you’re already using Hamilton in your project you can skip this and go directly to the DAGWorks Tracking Adapter page.

Now that you’ve installed the CLI, you can run the following command to generate from a template:

dagworks init \
  --template hello_world \
  --project-id "project_id" \
  --api-key "api_key" \
  --username "your_email" \
  --location project_dir

We have a few sample templates to choose from:

Run the code

Now you’re ready for the fun part! Navigate into the directory, and run it!

./run.sh

Go back to the UI, and click Select on the project you just created (or click the DAG icon in the table row). You should see the DAG!

A quick note on Hamilton

Hamilton is the framework that helps you write and organize python functions. It is entirely open-source, meaning that any code that you write to work with DAGWorks can be used outside of the platform as well!

The basics of it are simple — you write a collection of functions, each of which have a specific shape. The literal parameter names refer to the upstream dependencies (either nodes in the DAG or external inputs), and the return value is the output of the function (which can then be referred to as well).

For example, the function:

import pandas as pd

def my_data(upstream_data: pd.DataFrame, upstream_param: float) -> pd.DataFrame:
    return upstream_data[upstream_data["col"] > upstream_param]

Will be a transformation node in a DAG. This transformation will depend on two “upstream” transformations:

  1. upstream_data — a dataframe
  2. upstream_param — an integer

By using the “driver”, you specify the inputs you want, and hamilton will execute it.

from hamilton import driver
import my_module

dr = (
  driver.Builder()
    .with_config({})
    .with_modules(my_module)
    # note that integrating DAGWorks just requires adding the adapter to the line below
    #.with_adapters()
)

dataframe = dr.execute(
  ['my_data'], inputs={'upstream_data' : load_my_data(), 'upstream_param': 10})

Organizing your code in Hamilton gives you a lot of power. Since it is also very lightweight, you can therefore:

  1. Run your code in any context
  2. Easily determine the dependencies (both direct and transitive) of any data
  3. Understand how any piece of data was generated/link it to code.
  4. Unit test your code
  5. Develop just the pieces of your pipeline you care about
  6. And do a lot more, including adding data quality checks, managing documentation, etc…

Hamilton gives you, the data/ML pipeline creator, a lot of power. Read more about it here, and try it out (in the browser!) here: tryhamilton.dev.

Next Steps

Congrats, you’ve done it! Feel free to stop reading and start building. If you want to learn more, check out the following resources: