Supercharging Databricks Asset Bundles

Lomanu4 · 16 Май 2025

Multi-Environment Workflows with CI/CD

In my previous post, I introduced the Databricks Asset Bundle (DAB) template project that helps you get started quickly with a structured approach to Databricks development. Today, I want to dive deeper into how DAB handles variables, parameterization, and CI/CD automation across multiple environments.

The Power of Parameterization

One of the most powerful features of Databricks Asset Bundles is the ability to parameterize nearly everything using variables. This allows us to define workflows once and deploy them to multiple environments with different configurations.

Variable Structure in DAB

The template organizes variables in a clear, hierarchical structure:

variables/
├── common.yml # Variables shared across all environments
├── {workflowName}.dev.yml # Worklow specific variables for development
├── {workflowName}.test.yml # Worklow specific variables for testing
├── {workflowName}.prod.yml # Worklow specific variables for production

This organization gives us several benefits:

Clear separation of concerns (common vs specific)
Environment-specific configurations (.dev, .test, .prod)
Logical grouping of related variables per workflow

Including Variables in Your Project

The main databricks.yml file includes these variable files based on the target environment:

include:
- resources/**/*.yml
- variables/common.yml
- variables/*.dev.yml # This changes based on environment

NOTE:
Unfortunately, the Databricks CLI does not support ${bundle.target} placeholder variable files yet. This is a bit of a pain, but it's a known issue and I trust this will be fixed in a future release. So for now, we need to manually update the databricks.yml file to include the correct variables/*.{environment}.yml variable file for each environment. To do that, we can use the yq command to update the databricks.yml file when running the CI/CD pipeline.

When deploying to different environments, we simply swap out which environment-specific variable files to include. For example, in our CI/CD pipeline for test deployment:

# Update include path for test environment
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.test.yml")' databricks.yml
Real-World Example: Parameterizing SharePoint Workflows

Let's look at a real example from the template. The SharePoint Excel refresh workflow connects to SharePoint, processes Excel files, and loads the data to Delta tables. Here's how we parameterize it:

Define environment-specific variables:

# In variables/sharepoint.dev.yml
variables:
sharepoint:
type: complex
default:
secret_scope: "azure"
tenant_id_key: "azure-tenant-id"
client_id_key: "azure-app-client-id"
client_secret_key: "azure-app-client-secret"
site_id_key: "sharepoint-site-id"
drive_id_key: "sharepoint-drive-id"
modified_in_last_hours: 240
target_catalog: "bronze"
target_schema: "sharepoint_dev"
sync_schedule: "0 0 0 * * ?"
concurrency: 10

Take note of the target_schema variable. We can use this to deploy the same workflow to different environments with different schema names. Same goes for the target_catalog variable and any other variables that are used in the workflow definition.

Reference variables in workflow definition:

# In resources/sharepoint/sharepoint_excel_refresh.yml
resources:
jobs:
sharepoint_excel_refresh:
name: "${bundle.name} Sharepoint Excel Refresh"
tasks:
- task_key: sharepoint_excel_file_list
notebook_task:
notebook_path: "${workspace.file_path}/notebooks/sharepoint/excel_list_process"
base_parameters:
secret_scope: "${var.sharepoint.secret_scope}"
tenant_id_key: "${var.sharepoint.tenant_id_key}"
# More parameters...
modified_in_last_hours: "${var.sharepoint.modified_in_last_hours}"
schedule:
quartz_cron_expression: "${var.sharepoint.sync_schedule}"
timezone_id: "${var.timezone_id}"

By using this approach, we can deploy the same workflow to different environments with environment-specific configurations. For instance, in production we might have different:

Target schema names (sharepoint_prod vs sharepoint_dev)
Sync schedules (hourly in production, daily in dev)
Lookback periods (24 hours in production, 240 hours in dev)

The complex type allows us to define a complex object with multiple properties, which is perfect for our SharePoint example that allows us to define all the workflow specific variables in one place.

Automated CI/CD Pipeline

The real magic happens when we automate deployment across environments. The template includes GitHub Actions workflows that:

Validate on PRs and feature branches:
- Run unit tests
- Validate DAB bundle configuration
- Check code quality
Auto-deploy to test environment:
- Triggered on pushes to the develop branch
- Updates variable includes for test environment
- Authenticates with service principal
- Deploys the DAB bundle
Deploy to production:
- Triggered on pushes to the main branch
- Updates variable includes for production environment
- Adds approval steps for production deployment
- Deploys with production-specific settings

Authentication with Service Principals

A key part of the CI/CD automation is using service principals for authentication. In the GitHub workflows, we:

Obtain an OAuth token using the service principal credentials
Use that token for Databricks CLI authentication
Pass the service principal ID as a variable during deployment

# Get OAuth token for service principal
response=$(curl -s -X POST \
-u "${{ secrets.SERVICE_PRINCIPAL_APP_ID }}:${{ secrets.SERVICE_PRINCIPAL_SECRET }}" \
"$DATABRICKS_HOST/oidc/v1/token" \
-d "grant_type=client_credentials&scope=all-apis")

# Extract token and set environment variables
token=$(echo $response | jq -r '.access_token')
export DATABRICKS_TOKEN="$token"

# Deploy with service principal ID variable
databricks bundle deploy --target test --var service_principal_id=${{ secrets.SERVICE_PRINCIPAL_APP_ID }}
Advanced Techniques

Here are some advanced techniques you can use with this setup:

1. Dynamic Configuration Based on Branch

You can make your CI/CD pipeline smarter by adjusting configuration based on the Git branch:

- name: Set environment variables based on branch
run: |
if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
echo "TARGET_ENV=prod" >> $GITHUB_ENV
elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
echo "TARGET_ENV=test" >> $GITHUB_ENV
else
echo "TARGET_ENV=dev" >> $GITHUB_ENV
fi

- name: Update variable includes
run: |
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.${{ env.TARGET_ENV }}.yml")' databricks.yml
2. Feature Flags via Variables

You can implement simple feature flags using variables:

variables:
features:
enable_advanced_analytics: true
enable_real_time_processing: false

Then in your workflows:

tasks:
- task_key: optional_analytics_step
notebook_task:
notebook_path: "/path/to/analytics"
if: ${var.features.enable_advanced_analytics}
3. Template Workflows with Parameters

You can create reusable workflow templates by parameterizing common patterns:

# resources/templates/ingest_template.yml
resources:
jobs:
${var.job_name}: # Dynamic job name
name: "Ingest ${var.source_name} Data"
tasks:
- task_key: ingest_data
notebook_task:
notebook_path: "/Shared/ingest/${var.source_type}"
base_parameters:
source_config: ${var.source_config}
target_table: ${var.target_table}
Conclusion

By combining DAB's parameterization capabilities with automated CI/CD pipelines, you can create a robust, maintainable system for deploying Databricks resources across environments. This approach gives you:

Clear separation of configuration from implementation
Environment-specific settings without code duplication
Automated testing and deployment
Consistent deployment process across environments
Version-controlled infrastructure and configuration

What's your experience with Databricks Asset Bundles? Have you found other useful patterns for managing multi-environment deployments? Let me know in the comments!

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Supercharging Databricks Asset Bundles

Lomanu4