Skip to main content

msh.yaml Configuration Reference

The msh.yaml file is the central configuration file for your msh project. It allows you to define data sources once and reference them from individual .msh files, following the DRY (Don't Repeat Yourself) principle.

File Location

Place msh.yaml in the root of your msh project:

my-project/
├── msh.yaml # Project configuration
├── .env # Environment variables
└── models/ # Your .msh files
└── ...

Sources Section

The sources section defines data sources that can be referenced from .msh files. This eliminates credential repetition across hundreds of files.

Basic Structure

sources:
- name: source_name
type: source_type
credentials: "${ENV_VAR}"
# Additional type-specific fields

SQL Database Sources

Define SQL database connections once and reference tables from multiple .msh files.

Basic SQL Database Source

sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: products

Using Environment Variables

Always use environment variables for credentials:

sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}" # Resolved from .env
schema: public
tables:
- name: orders

In .env file:

DB_PROD_CREDENTIALS="postgresql://user:pass@prod-db.com/sales"

Schema and Table Resolution

When referencing a source in a .msh file, you can specify the table in two ways:

Option 1: Simple table name (uses source's schema)

# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders

# models/stg_orders.msh
ingest:
source: prod_db
table: orders # Resolves to public.orders

Option 2: Schema-qualified table name

# models/stg_orders.msh
ingest:
source: prod_db
table: public.orders # Explicit schema.table format

Multiple Schemas

You can define multiple sources for different schemas:

sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers

- name: analytics_db
type: sql_database
credentials: "${DB_ANALYTICS_CREDENTIALS}"
schema: analytics
tables:
- name: events
- name: sessions

REST API Sources

Define REST API endpoints once and reference resources from multiple .msh files.

Basic REST API Source

sources:
- name: jsonplaceholder
type: rest_api
endpoint: "https://jsonplaceholder.typicode.com"
resources:
- name: users
- name: posts
- name: comments

REST API with Authentication

sources:
- name: github_api
type: rest_api
endpoint: "https://api.github.com"
credentials:
token: "${GITHUB_TOKEN}"
resources:
- name: repos
- name: issues
- name: pull_requests

In .env file:

GITHUB_TOKEN="ghp_your_token_here"

Resource Resolution

When referencing a REST API source in a .msh file:

# msh.yaml
sources:
- name: jsonplaceholder
type: rest_api
endpoint: "https://jsonplaceholder.typicode.com"
resources:
- name: users

# models/stg_users.msh
ingest:
source: jsonplaceholder
resource: users # References the 'users' resource

Environment Variable Syntax

msh supports ${VAR_NAME} syntax for environment variable substitution:

sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: "${DB_SCHEMA}" # Can use env vars for any field

Environment variables are resolved:

  1. From .env file in project root
  2. From system environment variables
  3. Recursively (nested ${VAR} references are supported)

Layered Project Example

A complete example showing staging → intermediate → marts layers:

# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: products

- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges
- name: customers

Staging layer:

# models/staging/stg_orders.msh
name: stg_orders
ingest:
source: prod_db
table: orders
transform: |
SELECT * FROM {{ source }}

Intermediate layer:

# models/intermediate/int_order_customer.msh
name: int_order_customer
transform: |
SELECT
o.*,
c.name as customer_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id

Marts layer:

# models/marts/fct_orders.msh
name: fct_orders
transform: |
SELECT * FROM {{ ref('int_order_customer') }}

Backward Compatibility

Direct credentials still work! You can mix source references and direct credentials in the same project:

# ✅ Using source reference
# models/stg_orders.msh
ingest:
source: prod_db
table: orders

# ✅ Using direct credentials (still works)
# models/stg_legacy.msh
ingest:
type: sql_database
credentials: "${LEGACY_DB_URI}"
table: legacy_table

This allows gradual migration:

  1. Start with existing files using direct credentials
  2. Add sources to msh.yaml
  3. Migrate .msh files one at a time to use source references
  4. Both styles can coexist

Error Handling

msh provides clear error messages for source resolution issues:

Source Not Found

[ERROR] Source 'prod_db' not found in msh.yaml.
Available sources: staging_db, analytics_db

Fix: Check the source name in msh.yaml matches what you're referencing.

Table/Resource Not Found

[ERROR] Table 'orders' not found in source 'prod_db'.
Available tables: customers, products

Fix: Add the table to the source's tables list in msh.yaml.

Mixing Source Reference with Direct Credentials

[ERROR] Cannot specify both 'source' reference and direct 'type'/'credentials' in ingest block

Fix: Use either a source reference OR direct credentials, not both.

Missing Required Fields

[ERROR] Source reference requires either 'table' (for SQL sources) or 'resource' (for API sources)

Fix: Add table: for SQL database sources or resource: for REST API sources.


Best Practices

  1. Use Environment Variables: Never hardcode credentials in msh.yaml
  2. Organize by Source: Group related tables/resources under the same source
  3. Document Sources: Add comments explaining what each source represents
  4. Version Control: Commit msh.yaml but never commit .env files
  5. Gradual Migration: Migrate from direct credentials to source references over time

Example with Comments

sources:
# Production PostgreSQL database
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders # Customer orders
- name: customers # Customer master data
- name: products # Product catalog

# Stripe API for payment data
- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges # Payment charges
- name: customers # Stripe customer records


Variables Section

The vars section defines variables that can be used in SQL transformations using {{ var("variable_name") }}. This eliminates hard-coded values that change per environment.

Basic Structure

vars:
variable_name: "value"
env_var_ref: "${ENV_VAR_NAME}"

Using Variables in SQL

Define in msh.yaml:

vars:
active_status: "active"
start_date: "${START_DATE}" # Environment variable
batch_size: 1000

Use in .msh files:

transform: |
SELECT * FROM {{ source }}
WHERE status = '{{ var("active_status") }}'
AND created_at > '{{ var("start_date") }}'
LIMIT {{ var("batch_size") }}

Environment Variable Support

Variables can reference environment variables using ${VAR_NAME} syntax:

vars:
start_date: "${START_DATE}" # From .env
end_date: "${END_DATE}" # From .env
active_status: "active" # Static value

In .env file:

START_DATE=2024-01-01
END_DATE=2024-12-31

Variable Resolution

Variables are resolved at compile time:

  1. Environment variables are resolved first
  2. Variables are then available in SQL via {{ var("name") }}
  3. Clear error if variable not found

Test Suites Section

The test_suites section defines reusable test suites that can be referenced from multiple .msh files, eliminating test repetition.

Basic Structure

test_suites:
suite_name:
- assert: "SQL assertion"
- unique: column_name
- not_null: column_name

Defining Test Suites

test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id

financial:
- assert: "amount > 0"
- assert: "amount IS NOT NULL"

quality:
- not_null: [id, email, created_at]

Using Test Suites

Reference test suites in .msh files:

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
- staging
- financial
# Automatically expands to all tests from both suites

Combining Suites with Individual Tests

You can combine test suites with individual tests:

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
- staging
- financial
tests:
- assert: "order_date >= '2024-01-01'" # Additional test
# All tests are merged together

Test Suite Benefits

  • ✅ Define tests once, reuse everywhere
  • ✅ Consistent quality checks across similar models
  • ✅ Easy to update tests in one place
  • ✅ Automatic deduplication of duplicate tests

Defaults Section

The defaults section defines default configuration values by layer (staging, marts, etc.), eliminating config repetition.

Basic Structure

defaults:
layer_name:
write_disposition: merge | replace | append
primary_key: id
tests:
- assert: "..."
test_suites:
- suite_name

Defining Layer Defaults

defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging

marts:
write_disposition: replace
test_suites:
- quality

intermediate:
write_disposition: append

Using Layer Defaults

Specify layer: in .msh file or let it auto-detect from directory path:

Explicit layer:

# models/staging/stg_orders.msh
name: stg_orders
layer: staging # Inherits staging defaults
ingest:
source: prod_db
table: orders
# Automatically inherits: write_disposition, primary_key, test_suites

Auto-detected layer:

# models/staging/stg_orders.msh (in models/staging/ directory)
name: stg_orders
# Layer auto-detected as "staging" from path
ingest:
source: prod_db
table: orders

Layer Auto-Detection

msh automatically detects layers from directory paths:

  • models/staging/staging
  • models/marts/marts
  • models/intermediate/intermediate

Overriding Defaults

Asset values override defaults (deep merge):

# msh.yaml
defaults:
staging:
write_disposition: merge
primary_key: id

# models/staging/stg_orders.msh
name: stg_orders
layer: staging
primary_key: order_id # Overrides default
# Still inherits: write_disposition: merge

Complete Configuration Example

A complete msh.yaml example showing all features:

# msh.yaml

# Source definitions
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers

- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges
- name: customers

# Variables
vars:
active_status: "active"
start_date: "${START_DATE}"
batch_size: 1000

# Test suites
test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id
financial:
- assert: "amount > 0"

# Config defaults
defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging
marts:
write_disposition: replace
test_suites:
- quality

SQL Macros

SQL macros allow you to define reusable SQL functions/patterns in macros/ directory and use them across multiple .msh files.

Creating Macros

Create a macros/ directory in your project root:

my-project/
├── macros/
│ └── common.sql
├── msh.yaml
└── models/

Defining Macros

Define macros in .sql files using Jinja2 syntax:

-- macros/common.sql
{% macro clean_string(column_name) %}
TRIM(UPPER({{ column_name }}))
{% endmacro %}

{% macro format_date(column_name, format='YYYY-MM-DD') %}
DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}

{% macro calculate_age(birth_date) %}
DATEDIFF(YEAR, {{ birth_date }}, CURRENT_DATE)
{% endmacro %}

Using Macros

Macros are automatically loaded and available in all .msh files:

# models/staging/stg_orders.msh
name: stg_orders
transform: |
SELECT
id,
{{ clean_string('customer_name') }} as customer_name,
{{ format_date('created_at', 'YYYY-MM') }} as created_month,
{{ calculate_age('birth_date') }} as age
FROM {{ source }}

Macro Features

  • ✅ Supports parameters (positional and keyword)
  • ✅ Supports default parameter values
  • ✅ Macros can call other macros
  • ✅ Macros can use var() function
  • ✅ Clear error if macro not found

Macro Examples

With parameters:

{% macro clean_string(column_name) %}
TRIM(UPPER({{ column_name }}))
{% endmacro %}

With default values:

{% macro format_date(column_name, format='YYYY-MM-DD') %}
DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}

Using variables:

{% macro filter_by_date(column_name) %}
{{ column_name }} > '{{ var("start_date") }}'
{% endmacro %}

Error Handling

Variable Errors

[ERROR] Variable 'active_status' not found in msh.yaml vars.
Available variables: start_date, end_date

Fix: Add the variable to vars: section in msh.yaml.

Test Suite Errors

[ERROR] Test suite 'staging' not found in msh.yaml.
Available suites: financial, quality

Fix: Check the suite name spelling or add it to test_suites: section.

Default Layer Errors

[ERROR] Layer 'staging' not found in defaults.
Available layers: marts, intermediate

Fix: Add the layer to defaults: section or check the layer: field in your .msh file.

Macro Errors

[ERROR] Error rendering macro 'clean_string': ...

Fix: Check macro syntax and ensure the macro file is in macros/ directory.