• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

Supercharging Databricks Asset Bundles

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155
Multi-Environment Workflows with CI/CD
In my previous post, I introduced the Databricks Asset Bundle (DAB) template project that helps you get started quickly with a structured approach to Databricks development. Today, I want to dive deeper into how DAB handles variables, parameterization, and CI/CD automation across multiple environments.

The Power of Parameterization


One of the most powerful features of Databricks Asset Bundles is the ability to parameterize nearly everything using variables. This allows us to define workflows once and deploy them to multiple environments with different configurations.

Variable Structure in DAB


The template organizes variables in a clear, hierarchical structure:


variables/
├── common.yml # Variables shared across all environments
├── {workflowName}.dev.yml # Worklow specific variables for development
├── {workflowName}.test.yml # Worklow specific variables for testing
├── {workflowName}.prod.yml # Worklow specific variables for production

This organization gives us several benefits:

  • Clear separation of concerns (common vs specific)
  • Environment-specific configurations (.dev, .test, .prod)
  • Logical grouping of related variables per workflow
Including Variables in Your Project


The main databricks.yml file includes these variable files based on the target environment:


include:
- resources/**/*.yml
- variables/common.yml
- variables/*.dev.yml # This changes based on environment
NOTE:
Unfortunately, the Databricks CLI does not support ${bundle.target} placeholder variable files yet. This is a bit of a pain, but it's a known issue and I trust this will be fixed in a future release. So for now, we need to manually update the databricks.yml file to include the correct variables/*.{environment}.yml variable file for each environment. To do that, we can use the yq command to update the databricks.yml file when running the CI/CD pipeline.
When deploying to different environments, we simply swap out which environment-specific variable files to include. For example, in our CI/CD pipeline for test deployment:


# Update include path for test environment
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.test.yml")' databricks.yml
Real-World Example: Parameterizing SharePoint Workflows


Let's look at a real example from the template. The SharePoint Excel refresh workflow connects to SharePoint, processes Excel files, and loads the data to Delta tables. Here's how we parameterize it:

  1. Define environment-specific variables:

# In variables/sharepoint.dev.yml
variables:
sharepoint:
type: complex
default:
secret_scope: "azure"
tenant_id_key: "azure-tenant-id"
client_id_key: "azure-app-client-id"
client_secret_key: "azure-app-client-secret"
site_id_key: "sharepoint-site-id"
drive_id_key: "sharepoint-drive-id"
modified_in_last_hours: 240
target_catalog: "bronze"
target_schema: "sharepoint_dev"
sync_schedule: "0 0 0 * * ?"
concurrency: 10

Take note of the target_schema variable. We can use this to deploy the same workflow to different environments with different schema names. Same goes for the target_catalog variable and any other variables that are used in the workflow definition.

  1. Reference variables in workflow definition:

# In resources/sharepoint/sharepoint_excel_refresh.yml
resources:
jobs:
sharepoint_excel_refresh:
name: "${bundle.name} Sharepoint Excel Refresh"
tasks:
- task_key: sharepoint_excel_file_list
notebook_task:
notebook_path: "${workspace.file_path}/notebooks/sharepoint/excel_list_process"
base_parameters:
secret_scope: "${var.sharepoint.secret_scope}"
tenant_id_key: "${var.sharepoint.tenant_id_key}"
# More parameters...
modified_in_last_hours: "${var.sharepoint.modified_in_last_hours}"
schedule:
quartz_cron_expression: "${var.sharepoint.sync_schedule}"
timezone_id: "${var.timezone_id}"

By using this approach, we can deploy the same workflow to different environments with environment-specific configurations. For instance, in production we might have different:

  • Target schema names (sharepoint_prod vs sharepoint_dev)
  • Sync schedules (hourly in production, daily in dev)
  • Lookback periods (24 hours in production, 240 hours in dev)

The complex type allows us to define a complex object with multiple properties, which is perfect for our SharePoint example that allows us to define all the workflow specific variables in one place.

Automated CI/CD Pipeline


The real magic happens when we automate deployment across environments. The template includes GitHub Actions workflows that:


  1. Validate on PRs and feature branches:
    • Run unit tests
    • Validate DAB bundle configuration
    • Check code quality

  2. Auto-deploy to test environment:
    • Triggered on pushes to the develop branch
    • Updates variable includes for test environment
    • Authenticates with service principal
    • Deploys the DAB bundle

  3. Deploy to production:
    • Triggered on pushes to the main branch
    • Updates variable includes for production environment
    • Adds approval steps for production deployment
    • Deploys with production-specific settings
Authentication with Service Principals


A key part of the CI/CD automation is using service principals for authentication. In the GitHub workflows, we:

  1. Obtain an OAuth token using the service principal credentials
  2. Use that token for Databricks CLI authentication
  3. Pass the service principal ID as a variable during deployment

# Get OAuth token for service principal
response=$(curl -s -X POST \
-u "${{ secrets.SERVICE_PRINCIPAL_APP_ID }}:${{ secrets.SERVICE_PRINCIPAL_SECRET }}" \
"$DATABRICKS_HOST/oidc/v1/token" \
-d "grant_type=client_credentials&scope=all-apis")

# Extract token and set environment variables
token=$(echo $response | jq -r '.access_token')
export DATABRICKS_TOKEN="$token"

# Deploy with service principal ID variable
databricks bundle deploy --target test --var service_principal_id=${{ secrets.SERVICE_PRINCIPAL_APP_ID }}
Advanced Techniques


Here are some advanced techniques you can use with this setup:

1. Dynamic Configuration Based on Branch


You can make your CI/CD pipeline smarter by adjusting configuration based on the Git branch:


- name: Set environment variables based on branch
run: |
if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
echo "TARGET_ENV=prod" >> $GITHUB_ENV
elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
echo "TARGET_ENV=test" >> $GITHUB_ENV
else
echo "TARGET_ENV=dev" >> $GITHUB_ENV
fi

- name: Update variable includes
run: |
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.${{ env.TARGET_ENV }}.yml")' databricks.yml
2. Feature Flags via Variables


You can implement simple feature flags using variables:


variables:
features:
enable_advanced_analytics: true
enable_real_time_processing: false

Then in your workflows:


tasks:
- task_key: optional_analytics_step
notebook_task:
notebook_path: "/path/to/analytics"
if: ${var.features.enable_advanced_analytics}
3. Template Workflows with Parameters


You can create reusable workflow templates by parameterizing common patterns:


# resources/templates/ingest_template.yml
resources:
jobs:
${var.job_name}: # Dynamic job name
name: "Ingest ${var.source_name} Data"
tasks:
- task_key: ingest_data
notebook_task:
notebook_path: "/Shared/ingest/${var.source_type}"
base_parameters:
source_config: ${var.source_config}
target_table: ${var.target_table}
Conclusion


By combining DAB's parameterization capabilities with automated CI/CD pipelines, you can create a robust, maintainable system for deploying Databricks resources across environments. This approach gives you:

  • Clear separation of configuration from implementation
  • Environment-specific settings without code duplication
  • Automated testing and deployment
  • Consistent deployment process across environments
  • Version-controlled infrastructure and configuration

What's your experience with Databricks Asset Bundles? Have you found other useful patterns for managing multi-environment deployments? Let me know in the comments!


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу