Skip to main content

DRY Principles for msh

This guide shows you how to follow DRY (Don't Repeat Yourself) principles when building data pipelines with msh. Learn how to eliminate repetition across hundreds of .msh files.

Overview

As your msh project grows, you'll find yourself repeating the same credentials, SQL patterns, and configurations across multiple files. Following DRY principles helps you:

  • Maintain consistency across your project
  • Reduce errors from copy-paste mistakes
  • Simplify updates by changing things in one place
  • Improve readability by keeping files focused

Currently Available DRY Features

1. Source Definitions (dbt-style)

Problem: Repeating credentials in every .msh file

# ❌ Repetitive
# models/staging/stg_orders.msh
ingest:
type: sql_database
credentials: "postgresql://user:pass@prod-db.com/sales"
table: "public.orders"

# models/staging/stg_customers.msh
ingest:
type: sql_database
credentials: "postgresql://user:pass@prod-db.com/sales" # Same credentials!
table: "public.customers"

Solution: Define sources once in msh.yaml

# ✅ DRY - Define once
# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}" # Environment variable
schema: public
tables:
- name: orders
- name: customers
- name: products

# models/staging/stg_orders.msh
ingest:
source: prod_db
table: orders

# models/staging/stg_customers.msh
ingest:
source: prod_db
table: customers

Benefits:

  • ✅ Change credentials once, update everywhere
  • ✅ Use environment variables for sensitive data
  • ✅ Clear source catalog in one place

See msh.yaml Configuration Reference for complete documentation.


2. Model References ({{ ref() }})

Problem: Hard-coding table names in SQL

-- ❌ Repetitive and fragile
SELECT * FROM model_orders_a1b2
SELECT * FROM model_orders_a1b2 -- Breaks when hash changes!

Solution: Use {{ ref() }} for model dependencies

-- ✅ DRY - Automatic dependency resolution
SELECT * FROM {{ ref('stg_orders') }}
SELECT * FROM {{ ref('stg_customers') }}

Benefits:

  • ✅ Automatic dependency resolution
  • ✅ Works with Blue/Green deployment (hash changes)
  • ✅ Builds DAG automatically
  • ✅ Run upstream: msh run +fct_orders (runs all dependencies)

3. External SQL Files (transform_file)

Problem: Repeating complex SQL transformations

# ❌ Repetitive SQL
# models/staging/stg_orders.msh
transform: |
SELECT
id,
TRIM(UPPER(customer_name)) as customer_name,
CAST(amount AS DECIMAL(10,2)) as amount,
DATE_TRUNC('day', created_at) as order_date
FROM {{ source }}

# models/staging/stg_customers.msh
transform: |
SELECT
id,
TRIM(UPPER(customer_name)) as customer_name, # Same transformation!
email,
DATE_TRUNC('day', created_at) as signup_date
FROM {{ source }}

Solution: Extract common SQL to shared files

-- ✅ DRY - Shared SQL
-- sql/common_cleaning.sql
SELECT
id,
TRIM(UPPER(customer_name)) as customer_name,
CAST(amount AS DECIMAL(10,2)) as amount,
DATE_TRUNC('day', created_at) as order_date
FROM {{ source }}
# models/staging/stg_orders.msh
name: stg_orders
ingest:
source: prod_db
table: orders
transform_file: sql/common_cleaning.sql

# models/staging/stg_customers.msh
name: stg_customers
ingest:
source: prod_db
table: customers
transform_file: sql/common_cleaning.sql # Reuse!

Benefits:

  • ✅ Share complex SQL patterns
  • ✅ Update once, change everywhere
  • ✅ Keep .msh files clean and focused

4. Environment Variables

Problem: Hard-coding values that change per environment

# ❌ Hard-coded values
ingest:
type: sql_database
credentials: "postgresql://prod-user:prod-pass@prod-db.com/sales"
# Different for dev/staging/prod!

Solution: Use environment variables

# ✅ DRY - Environment-aware
# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_CREDENTIALS}" # Set per environment
schema: "${DB_SCHEMA}" # Different per env

# .env (dev)
DB_CREDENTIALS=postgresql://dev-user:dev-pass@dev-db.com/sales
DB_SCHEMA=dev_schema

# .env (prod)
DB_CREDENTIALS=postgresql://prod-user:prod-pass@prod-db.com/sales
DB_SCHEMA=public

Benefits:

  • ✅ Same config, different environments
  • ✅ No hard-coded secrets
  • ✅ Easy environment switching

See Environments Guide for more details.


5. Layered Architecture (DAGs)

Problem: Repeating joins and transformations

-- ❌ Repetitive joins
-- models/marts/fct_orders.msh
SELECT
o.id,
o.amount,
c.name as customer_name, # Join repeated everywhere
p.name as product_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

-- models/marts/fct_revenue.msh
SELECT
o.id,
o.amount,
c.name as customer_name, # Same join again!
p.name as product_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

Solution: Build intermediate layers

# ✅ DRY - Intermediate layer
# models/intermediate/int_order_customer_product.msh
name: int_order_customer_product
transform: |
SELECT
o.id,
o.amount,
c.name as customer_name,
p.name as product_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

# models/marts/fct_orders.msh
name: fct_orders
transform: |
SELECT * FROM {{ ref('int_order_customer_product') }}

# models/marts/fct_revenue.msh
name: fct_revenue
transform: |
SELECT
customer_name,
SUM(amount) as total_revenue
FROM {{ ref('int_order_customer_product') }} # Reuse intermediate!
GROUP BY customer_name

Benefits:

  • ✅ Define joins once in intermediate layer
  • ✅ Reuse in multiple marts
  • ✅ Clear data lineage
  • ✅ Easier to maintain

6. Shared Test Suites

Problem: Repeating test definitions

# ❌ Repetitive tests
# models/staging/stg_orders.msh
tests:
- assert: "id IS NOT NULL"
- assert: "amount > 0"

# models/staging/stg_customers.msh
tests:
- assert: "id IS NOT NULL" # Same test!
- assert: "email IS NOT NULL"

Solution: Define test suites once in msh.yaml

# ✅ DRY - Define once
# msh.yaml
test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id
financial:
- assert: "amount > 0"

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
- staging
- financial
# Automatically expands to all tests from both suites

# models/staging/stg_customers.msh
name: stg_customers
test_suites:
- staging
# Reuses the same test suite!

Benefits:

  • ✅ Define tests once, reuse everywhere
  • ✅ Consistent quality checks across similar models
  • ✅ Easy to update tests in one place
  • ✅ Can combine multiple suites

See msh.yaml Configuration Reference for complete documentation.


7. Variable Substitution in SQL

Problem: Hard-coding values that change per environment

-- ❌ Hard-coded values
SELECT * FROM {{ source }}
WHERE status = 'active' -- Hard-coded!
AND created_at > '2024-01-01' -- Changes per environment!

Solution: Use variables from msh.yaml

# ✅ DRY - Define once
# msh.yaml
vars:
active_status: "active"
start_date: "${START_DATE}" # Environment variable
batch_size: 1000
-- ✅ DRY - Use variables
SELECT * FROM {{ source }}
WHERE status = '{{ var("active_status") }}'
AND created_at > '{{ var("start_date") }}'
LIMIT {{ var("batch_size") }}

Benefits:

  • ✅ No hard-coded values in SQL
  • ✅ Environment-specific values via env vars
  • ✅ Change values in one place
  • ✅ Variables resolved at compile time

See msh.yaml Configuration Reference for complete documentation.


8. Shared Config Defaults

Problem: Repeating config values

# ❌ Repetitive configs
# models/staging/stg_orders.msh
write_disposition: merge
primary_key: id

# models/staging/stg_customers.msh
write_disposition: merge
primary_key: id # Same config!

Solution: Define defaults by layer in msh.yaml

# ✅ DRY - Define once
# msh.yaml
defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging
marts:
write_disposition: replace

# models/staging/stg_orders.msh
name: stg_orders
layer: staging # Or auto-detected from models/staging/ path
ingest:
source: prod_db
table: orders
# Automatically inherits: write_disposition, primary_key, test_suites

Benefits:

  • ✅ Define configs once per layer
  • ✅ Auto-detection from directory path
  • ✅ Override defaults when needed
  • ✅ Consistent patterns across layers

See msh.yaml Configuration Reference for complete documentation.


9. SQL Macros

Problem: Repeating SQL functions/patterns

-- ❌ Repetitive functions
SELECT TRIM(UPPER(name)) FROM ...
SELECT TRIM(UPPER(email)) FROM ...
SELECT TRIM(UPPER(address)) FROM ...

Solution: Define macros in macros/ directory

-- ✅ DRY - Define once
-- macros/common.sql
{% macro clean_string(column_name) %}
TRIM(UPPER({{ column_name }}))
{% endmacro %}

{% macro format_date(column_name, format='YYYY-MM-DD') %}
DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}
-- ✅ DRY - Use macros
SELECT {{ clean_string('name') }} FROM ...
SELECT {{ clean_string('email') }} FROM ...
SELECT {{ format_date('created_at', 'YYYY-MM') }} FROM ...

Benefits:

  • ✅ Reusable SQL patterns
  • ✅ Supports parameters and defaults
  • ✅ Macros can call other macros
  • ✅ Can use variables in macros

See msh.yaml Configuration Reference for complete documentation.


Best Practices Summary

✅ DO:

  1. Use source definitions for shared credentials
  2. Use {{ ref() }} for model dependencies
  3. Extract common SQL to transform_file
  4. Use environment variables for config values
  5. Build intermediate layers for shared transformations
  6. Organize by layers (staging → intermediate → marts)
  7. Use test suites for common quality checks
  8. Use variables instead of hard-coding values in SQL
  9. Define config defaults by layer
  10. Create SQL macros for repeated patterns

❌ DON'T:

  1. Hard-code credentials in every file
  2. Hard-code table names with hashes
  3. Repeat complex SQL patterns
  4. Hard-code environment-specific values
  5. Repeat joins across multiple models
  6. Mix concerns (ingestion + transformation + business logic)

Example: Complete DRY Project Structure

my_project/
├── msh.yaml # Sources, vars, test_suites, defaults
├── .env # Environment variables
├── macros/
│ └── common.sql # SQL macros
├── models/
│ ├── staging/
│ │ ├── stg_orders.msh # References source, uses macros, inherits defaults
│ │ ├── stg_customers.msh # References source, uses macros, inherits defaults
│ │ └── stg_products.msh # References source, uses macros, inherits defaults
│ ├── intermediate/
│ │ └── int_order_customer_product.msh # Shared joins
│ └── marts/
│ ├── fct_orders.msh # Uses intermediate, inherits marts defaults
│ └── fct_revenue.msh # Uses intermediate, inherits marts defaults
└── sql/
└── common_cleaning.sql # Shared SQL patterns

Complete msh.yaml Example

# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers

vars:
active_status: "active"
start_date: "${START_DATE}"

test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id

defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging
marts:
write_disposition: replace

Migration Path

  1. Start: Identify repetitive patterns
  2. Extract: Move to shared locations (sources, SQL files)
  3. Refactor: Update .msh files to reference shared resources
  4. Test: Verify everything still works
  5. Iterate: Continue identifying patterns

Frequently Asked Questions

Q: Can I mix direct credentials and source references?

A: Yes! Both work in the same project. Migrate gradually.

Q: What if I need different configs per model?

A: You can override in individual .msh files. Source provides defaults.

Q: How do I share SQL between models?

A: Use transform_file to reference external SQL files.

Q: Can I use environment variables in SQL?

A: Yes! Define variables in msh.yaml using vars: section, then use {{ var("variable_name") }} in SQL. Variables can reference environment variables using ${VAR_NAME} syntax.

Q: How do I know if I'm repeating too much?

A: If you find yourself copying the same code/config to multiple files, it's time to extract it to a shared location.