DRY Principles for msh

This guide shows you how to follow DRY (Don't Repeat Yourself) principles when building data pipelines with msh. Learn how to eliminate repetition across hundreds of .msh files.

Overview

As your msh project grows, you'll find yourself repeating the same credentials, SQL patterns, and configurations across multiple files. Following DRY principles helps you:

Maintain consistency across your project
Reduce errors from copy-paste mistakes
Simplify updates by changing things in one place
Improve readability by keeping files focused

Currently Available DRY Features

1. Source Definitions (dbt-style)

Problem: Repeating credentials in every .msh file

# ❌ Repetitive
# models/staging/stg_orders.msh
ingest:
  type: sql_database
  credentials: "postgresql://user:pass@prod-db.com/sales"
  table: "public.orders"

# models/staging/stg_customers.msh
ingest:
  type: sql_database
  credentials: "postgresql://user:pass@prod-db.com/sales"  # Same credentials!
  table: "public.customers"

Solution: Define sources once in msh.yaml

# ✅ DRY - Define once
# msh.yaml
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"  # Environment variable
    schema: public
    tables:
      - name: orders
      - name: customers
      - name: products

# models/staging/stg_orders.msh
ingest:
  source: prod_db
  table: orders

# models/staging/stg_customers.msh
ingest:
  source: prod_db
  table: customers

Benefits:

✅ Change credentials once, update everywhere
✅ Use environment variables for sensitive data
✅ Clear source catalog in one place

See msh.yaml Configuration Reference for complete documentation.

2. Model References (`{{ ref() }}`)

Problem: Hard-coding table names in SQL

-- ❌ Repetitive and fragile
SELECT * FROM model_orders_a1b2
SELECT * FROM model_orders_a1b2  -- Breaks when hash changes!

Solution: Use {{ ref() }} for model dependencies

-- ✅ DRY - Automatic dependency resolution
SELECT * FROM {{ ref('stg_orders') }}
SELECT * FROM {{ ref('stg_customers') }}

Benefits:

✅ Automatic dependency resolution
✅ Works with Blue/Green deployment (hash changes)
✅ Builds DAG automatically
✅ Run upstream: msh run +fct_orders (runs all dependencies)

3. External SQL Files (`transform_file`)

Problem: Repeating complex SQL transformations

# ❌ Repetitive SQL
# models/staging/stg_orders.msh
transform: |
  SELECT 
    id,
    TRIM(UPPER(customer_name)) as customer_name,
    CAST(amount AS DECIMAL(10,2)) as amount,
    DATE_TRUNC('day', created_at) as order_date
  FROM {{ source }}

# models/staging/stg_customers.msh
transform: |
  SELECT 
    id,
    TRIM(UPPER(customer_name)) as customer_name,  # Same transformation!
    email,
    DATE_TRUNC('day', created_at) as signup_date
  FROM {{ source }}

Solution: Extract common SQL to shared files

-- ✅ DRY - Shared SQL
-- sql/common_cleaning.sql
SELECT 
  id,
  TRIM(UPPER(customer_name)) as customer_name,
  CAST(amount AS DECIMAL(10,2)) as amount,
  DATE_TRUNC('day', created_at) as order_date
FROM {{ source }}

# models/staging/stg_orders.msh
name: stg_orders
ingest:
  source: prod_db
  table: orders
transform_file: sql/common_cleaning.sql

# models/staging/stg_customers.msh
name: stg_customers
ingest:
  source: prod_db
  table: customers
transform_file: sql/common_cleaning.sql  # Reuse!

Benefits:

✅ Share complex SQL patterns
✅ Update once, change everywhere
✅ Keep .msh files clean and focused

4. Environment Variables

Problem: Hard-coding values that change per environment

# ❌ Hard-coded values
ingest:
  type: sql_database
  credentials: "postgresql://prod-user:prod-pass@prod-db.com/sales"
  # Different for dev/staging/prod!

Solution: Use environment variables

# ✅ DRY - Environment-aware
# msh.yaml
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_CREDENTIALS}"  # Set per environment
    schema: "${DB_SCHEMA}"           # Different per env

# .env (dev)
DB_CREDENTIALS=postgresql://dev-user:dev-pass@dev-db.com/sales
DB_SCHEMA=dev_schema

# .env (prod)
DB_CREDENTIALS=postgresql://prod-user:prod-pass@prod-db.com/sales
DB_SCHEMA=public

Benefits:

✅ Same config, different environments
✅ No hard-coded secrets
✅ Easy environment switching

See Environments Guide for more details.

5. Layered Architecture (DAGs)

Problem: Repeating joins and transformations

-- ❌ Repetitive joins
-- models/marts/fct_orders.msh
SELECT 
  o.id,
  o.amount,
  c.name as customer_name,  # Join repeated everywhere
  p.name as product_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

-- models/marts/fct_revenue.msh
SELECT 
  o.id,
  o.amount,
  c.name as customer_name,  # Same join again!
  p.name as product_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

Solution: Build intermediate layers

# ✅ DRY - Intermediate layer
# models/intermediate/int_order_customer_product.msh
name: int_order_customer_product
transform: |
  SELECT 
    o.id,
    o.amount,
    c.name as customer_name,
    p.name as product_name
  FROM {{ ref('stg_orders') }} o
  JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
  JOIN {{ ref('stg_products') }} p ON o.product_id = p.id

# models/marts/fct_orders.msh
name: fct_orders
transform: |
  SELECT * FROM {{ ref('int_order_customer_product') }}

# models/marts/fct_revenue.msh
name: fct_revenue
transform: |
  SELECT 
    customer_name,
    SUM(amount) as total_revenue
  FROM {{ ref('int_order_customer_product') }}  # Reuse intermediate!
  GROUP BY customer_name

Benefits:

✅ Define joins once in intermediate layer
✅ Reuse in multiple marts
✅ Clear data lineage
✅ Easier to maintain

6. Shared Test Suites

Problem: Repeating test definitions

# ❌ Repetitive tests
# models/staging/stg_orders.msh
tests:
  - assert: "id IS NOT NULL"
  - assert: "amount > 0"

# models/staging/stg_customers.msh
tests:
  - assert: "id IS NOT NULL"  # Same test!
  - assert: "email IS NOT NULL"

Solution: Define test suites once in msh.yaml

# ✅ DRY - Define once
# msh.yaml
test_suites:
  staging:
    - assert: "id IS NOT NULL"
    - assert: "created_at IS NOT NULL"
    - unique: id
  financial:
    - assert: "amount > 0"

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
  - staging
  - financial
# Automatically expands to all tests from both suites

# models/staging/stg_customers.msh
name: stg_customers
test_suites:
  - staging
# Reuses the same test suite!

Benefits:

✅ Define tests once, reuse everywhere
✅ Consistent quality checks across similar models
✅ Easy to update tests in one place
✅ Can combine multiple suites

See msh.yaml Configuration Reference for complete documentation.

7. Variable Substitution in SQL

Problem: Hard-coding values that change per environment

-- ❌ Hard-coded values
SELECT * FROM {{ source }}
WHERE status = 'active'  -- Hard-coded!
AND created_at > '2024-01-01'  -- Changes per environment!

Solution: Use variables from msh.yaml

# ✅ DRY - Define once
# msh.yaml
vars:
  active_status: "active"
  start_date: "${START_DATE}"  # Environment variable
  batch_size: 1000

-- ✅ DRY - Use variables
SELECT * FROM {{ source }}
WHERE status = '{{ var("active_status") }}'
AND created_at > '{{ var("start_date") }}'
LIMIT {{ var("batch_size") }}

Benefits:

✅ No hard-coded values in SQL
✅ Environment-specific values via env vars
✅ Change values in one place
✅ Variables resolved at compile time

See msh.yaml Configuration Reference for complete documentation.

8. Shared Config Defaults

Problem: Repeating config values

# ❌ Repetitive configs
# models/staging/stg_orders.msh
write_disposition: merge
primary_key: id

# models/staging/stg_customers.msh
write_disposition: merge
primary_key: id  # Same config!

Solution: Define defaults by layer in msh.yaml

# ✅ DRY - Define once
# msh.yaml
defaults:
  staging:
    write_disposition: merge
    primary_key: id
    test_suites:
      - staging
  marts:
    write_disposition: replace

# models/staging/stg_orders.msh
name: stg_orders
layer: staging  # Or auto-detected from models/staging/ path
ingest:
  source: prod_db
  table: orders
# Automatically inherits: write_disposition, primary_key, test_suites

Benefits:

✅ Define configs once per layer
✅ Auto-detection from directory path
✅ Override defaults when needed
✅ Consistent patterns across layers

See msh.yaml Configuration Reference for complete documentation.

9. SQL Macros

Problem: Repeating SQL functions/patterns

-- ❌ Repetitive functions
SELECT TRIM(UPPER(name)) FROM ...
SELECT TRIM(UPPER(email)) FROM ...
SELECT TRIM(UPPER(address)) FROM ...

Solution: Define macros in macros/ directory

-- ✅ DRY - Define once
-- macros/common.sql
{% macro clean_string(column_name) %}
  TRIM(UPPER({{ column_name }}))
{% endmacro %}

{% macro format_date(column_name, format='YYYY-MM-DD') %}
  DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}

-- ✅ DRY - Use macros
SELECT {{ clean_string('name') }} FROM ...
SELECT {{ clean_string('email') }} FROM ...
SELECT {{ format_date('created_at', 'YYYY-MM') }} FROM ...

Benefits:

✅ Reusable SQL patterns
✅ Supports parameters and defaults
✅ Macros can call other macros
✅ Can use variables in macros

See msh.yaml Configuration Reference for complete documentation.

Best Practices Summary

✅ DO:

Use source definitions for shared credentials
Use {{ ref() }} for model dependencies
Extract common SQL to transform_file
Use environment variables for config values
Build intermediate layers for shared transformations
Organize by layers (staging → intermediate → marts)
Use test suites for common quality checks
Use variables instead of hard-coding values in SQL
Define config defaults by layer
Create SQL macros for repeated patterns

❌ DON'T:

Hard-code credentials in every file
Hard-code table names with hashes
Repeat complex SQL patterns
Hard-code environment-specific values
Repeat joins across multiple models
Mix concerns (ingestion + transformation + business logic)

Example: Complete DRY Project Structure

my_project/
├── msh.yaml                    # Sources, vars, test_suites, defaults
├── .env                        # Environment variables
├── macros/
│   └── common.sql              # SQL macros
├── models/
│   ├── staging/
│   │   ├── stg_orders.msh      # References source, uses macros, inherits defaults
│   │   ├── stg_customers.msh   # References source, uses macros, inherits defaults
│   │   └── stg_products.msh    # References source, uses macros, inherits defaults
│   ├── intermediate/
│   │   └── int_order_customer_product.msh  # Shared joins
│   └── marts/
│       ├── fct_orders.msh      # Uses intermediate, inherits marts defaults
│       └── fct_revenue.msh      # Uses intermediate, inherits marts defaults
└── sql/
    └── common_cleaning.sql      # Shared SQL patterns

Complete msh.yaml Example

# msh.yaml
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders
      - name: customers

vars:
  active_status: "active"
  start_date: "${START_DATE}"

test_suites:
  staging:
    - assert: "id IS NOT NULL"
    - assert: "created_at IS NOT NULL"
    - unique: id

defaults:
  staging:
    write_disposition: merge
    primary_key: id
    test_suites:
      - staging
  marts:
    write_disposition: replace

Migration Path

Start: Identify repetitive patterns
Extract: Move to shared locations (sources, SQL files)
Refactor: Update .msh files to reference shared resources
Test: Verify everything still works
Iterate: Continue identifying patterns

Frequently Asked Questions

Q: Can I mix direct credentials and source references?

A: Yes! Both work in the same project. Migrate gradually.

Q: What if I need different configs per model?

A: You can override in individual .msh files. Source provides defaults.

Q: How do I share SQL between models?

A: Use transform_file to reference external SQL files.

Q: Can I use environment variables in SQL?

A: Yes! Define variables in msh.yaml using vars: section, then use {{ var("variable_name") }} in SQL. Variables can reference environment variables using ${VAR_NAME} syntax.

Q: How do I know if I'm repeating too much?

A: If you find yourself copying the same code/config to multiple files, it's time to extract it to a shared location.

msh.yaml Configuration Reference - Complete source definitions guide
.msh File Format Reference - How to use source references
Universal Connectivity - Connecting to various data sources
Environments Guide - Managing multiple environments

Overview​

Currently Available DRY Features​

1. Source Definitions (dbt-style)​

2. Model References ({{ ref() }})​

3. External SQL Files (transform_file)​

4. Environment Variables​

5. Layered Architecture (DAGs)​

6. Shared Test Suites​

7. Variable Substitution in SQL​

8. Shared Config Defaults​

9. SQL Macros​

Best Practices Summary​

✅ DO:​

❌ DON'T:​

Example: Complete DRY Project Structure​

Complete msh.yaml Example​

Migration Path​

Frequently Asked Questions​

Related Documentation​

Overview

Currently Available DRY Features

1. Source Definitions (dbt-style)

2. Model References (`{{ ref() }}`)

3. External SQL Files (`transform_file`)

4. Environment Variables

5. Layered Architecture (DAGs)

6. Shared Test Suites

7. Variable Substitution in SQL

8. Shared Config Defaults

9. SQL Macros

Best Practices Summary

✅ DO:

❌ DON'T:

Example: Complete DRY Project Structure

Complete msh.yaml Example

Migration Path

Frequently Asked Questions

Related Documentation