msh.yaml Configuration Reference

The msh.yaml file is the central configuration file for your msh project. It allows you to define data sources once and reference them from individual .msh files, following the DRY (Don't Repeat Yourself) principle.

File Location

Place msh.yaml in the root of your msh project:

my-project/
├── msh.yaml          # Project configuration
├── .env              # Environment variables
└── models/           # Your .msh files
    └── ...

Sources Section

The sources section defines data sources that can be referenced from .msh files. This eliminates credential repetition across hundreds of files.

Basic Structure

sources:
  - name: source_name
    type: source_type
    credentials: "${ENV_VAR}"
    # Additional type-specific fields

SQL Database Sources

Define SQL database connections once and reference tables from multiple .msh files.

Basic SQL Database Source

sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders
      - name: customers
      - name: products

Using Environment Variables

Always use environment variables for credentials:

sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"  # Resolved from .env
    schema: public
    tables:
      - name: orders

In .env file:

DB_PROD_CREDENTIALS="postgresql://user:pass@prod-db.com/sales"

Schema and Table Resolution

When referencing a source in a .msh file, you can specify the table in two ways:

Option 1: Simple table name (uses source's schema)

# msh.yaml
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders

# models/stg_orders.msh
ingest:
  source: prod_db
  table: orders  # Resolves to public.orders

Option 2: Schema-qualified table name

# models/stg_orders.msh
ingest:
  source: prod_db
  table: public.orders  # Explicit schema.table format

Multiple Schemas

You can define multiple sources for different schemas:

sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders
      - name: customers

  - name: analytics_db
    type: sql_database
    credentials: "${DB_ANALYTICS_CREDENTIALS}"
    schema: analytics
    tables:
      - name: events
      - name: sessions

REST API Sources

Define REST API endpoints once and reference resources from multiple .msh files.

Basic REST API Source

sources:
  - name: jsonplaceholder
    type: rest_api
    endpoint: "https://jsonplaceholder.typicode.com"
    resources:
      - name: users
      - name: posts
      - name: comments

REST API with Authentication

sources:
  - name: github_api
    type: rest_api
    endpoint: "https://api.github.com"
    credentials:
      token: "${GITHUB_TOKEN}"
    resources:
      - name: repos
      - name: issues
      - name: pull_requests

In .env file:

GITHUB_TOKEN="ghp_your_token_here"

Resource Resolution

When referencing a REST API source in a .msh file:

# msh.yaml
sources:
  - name: jsonplaceholder
    type: rest_api
    endpoint: "https://jsonplaceholder.typicode.com"
    resources:
      - name: users

# models/stg_users.msh
ingest:
  source: jsonplaceholder
  resource: users  # References the 'users' resource

Environment Variable Syntax

msh supports ${VAR_NAME} syntax for environment variable substitution:

sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: "${DB_SCHEMA}"  # Can use env vars for any field

Environment variables are resolved:

From .env file in project root
From system environment variables
Recursively (nested ${VAR} references are supported)

Layered Project Example

A complete example showing staging → intermediate → marts layers:

# msh.yaml
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders
      - name: customers
      - name: products

  - name: stripe_api
    type: rest_api
    endpoint: "https://api.stripe.com"
    credentials:
      api_key: "${STRIPE_API_KEY}"
    resources:
      - name: charges
      - name: customers

Staging layer:

# models/staging/stg_orders.msh
name: stg_orders
ingest:
  source: prod_db
  table: orders
transform: |
  SELECT * FROM {{ source }}

Intermediate layer:

# models/intermediate/int_order_customer.msh
name: int_order_customer
transform: |
  SELECT 
    o.*,
    c.name as customer_name
  FROM {{ ref('stg_orders') }} o
  JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id

Marts layer:

# models/marts/fct_orders.msh
name: fct_orders
transform: |
  SELECT * FROM {{ ref('int_order_customer') }}

Backward Compatibility

Direct credentials still work! You can mix source references and direct credentials in the same project:

# ✅ Using source reference
# models/stg_orders.msh
ingest:
  source: prod_db
  table: orders

# ✅ Using direct credentials (still works)
# models/stg_legacy.msh
ingest:
  type: sql_database
  credentials: "${LEGACY_DB_URI}"
  table: legacy_table

This allows gradual migration:

Start with existing files using direct credentials
Add sources to msh.yaml
Migrate .msh files one at a time to use source references
Both styles can coexist

Error Handling

msh provides clear error messages for source resolution issues:

Source Not Found

[ERROR] Source 'prod_db' not found in msh.yaml.
Available sources: staging_db, analytics_db

Fix: Check the source name in msh.yaml matches what you're referencing.

Table/Resource Not Found

[ERROR] Table 'orders' not found in source 'prod_db'.
Available tables: customers, products

Fix: Add the table to the source's tables list in msh.yaml.

Mixing Source Reference with Direct Credentials

[ERROR] Cannot specify both 'source' reference and direct 'type'/'credentials' in ingest block

Fix: Use either a source reference OR direct credentials, not both.

Missing Required Fields

[ERROR] Source reference requires either 'table' (for SQL sources) or 'resource' (for API sources)

Fix: Add table: for SQL database sources or resource: for REST API sources.

Best Practices

Use Environment Variables: Never hardcode credentials in msh.yaml
Organize by Source: Group related tables/resources under the same source
Document Sources: Add comments explaining what each source represents
Version Control: Commit msh.yaml but never commit .env files
Gradual Migration: Migrate from direct credentials to source references over time

Example with Comments

sources:
  # Production PostgreSQL database
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders        # Customer orders
      - name: customers     # Customer master data
      - name: products      # Product catalog

  # Stripe API for payment data
  - name: stripe_api
    type: rest_api
    endpoint: "https://api.stripe.com"
    credentials:
      api_key: "${STRIPE_API_KEY}"
    resources:
      - name: charges       # Payment charges
      - name: customers     # Stripe customer records

Variables Section

The vars section defines variables that can be used in SQL transformations using {{ var("variable_name") }}. This eliminates hard-coded values that change per environment.

Basic Structure

vars:
  variable_name: "value"
  env_var_ref: "${ENV_VAR_NAME}"

Using Variables in SQL

Define in msh.yaml:

vars:
  active_status: "active"
  start_date: "${START_DATE}"  # Environment variable
  batch_size: 1000

Use in .msh files:

transform: |
  SELECT * FROM {{ source }}
  WHERE status = '{{ var("active_status") }}'
  AND created_at > '{{ var("start_date") }}'
  LIMIT {{ var("batch_size") }}

Environment Variable Support

Variables can reference environment variables using ${VAR_NAME} syntax:

vars:
  start_date: "${START_DATE}"      # From .env
  end_date: "${END_DATE}"          # From .env
  active_status: "active"          # Static value

In .env file:

START_DATE=2024-01-01
END_DATE=2024-12-31

Variable Resolution

Variables are resolved at compile time:

Environment variables are resolved first
Variables are then available in SQL via {{ var("name") }}
Clear error if variable not found

Test Suites Section

The test_suites section defines reusable test suites that can be referenced from multiple .msh files, eliminating test repetition.

Basic Structure

test_suites:
  suite_name:
    - assert: "SQL assertion"
    - unique: column_name
    - not_null: column_name

Defining Test Suites

test_suites:
  staging:
    - assert: "id IS NOT NULL"
    - assert: "created_at IS NOT NULL"
    - unique: id
  
  financial:
    - assert: "amount > 0"
    - assert: "amount IS NOT NULL"
  
  quality:
    - not_null: [id, email, created_at]

Using Test Suites

Reference test suites in .msh files:

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
  - staging
  - financial
# Automatically expands to all tests from both suites

Combining Suites with Individual Tests

You can combine test suites with individual tests:

# models/staging/stg_orders.msh
name: stg_orders
test_suites:
  - staging
  - financial
tests:
  - assert: "order_date >= '2024-01-01'"  # Additional test
# All tests are merged together

Test Suite Benefits

✅ Define tests once, reuse everywhere
✅ Consistent quality checks across similar models
✅ Easy to update tests in one place
✅ Automatic deduplication of duplicate tests

Defaults Section

The defaults section defines default configuration values by layer (staging, marts, etc.), eliminating config repetition.

Basic Structure

defaults:
  layer_name:
    write_disposition: merge | replace | append
    primary_key: id
    tests:
      - assert: "..."
    test_suites:
      - suite_name

Defining Layer Defaults

defaults:
  staging:
    write_disposition: merge
    primary_key: id
    test_suites:
      - staging
  
  marts:
    write_disposition: replace
    test_suites:
      - quality
  
  intermediate:
    write_disposition: append

Using Layer Defaults

Specify layer: in .msh file or let it auto-detect from directory path:

Explicit layer:

# models/staging/stg_orders.msh
name: stg_orders
layer: staging  # Inherits staging defaults
ingest:
  source: prod_db
  table: orders
# Automatically inherits: write_disposition, primary_key, test_suites

Auto-detected layer:

# models/staging/stg_orders.msh (in models/staging/ directory)
name: stg_orders
# Layer auto-detected as "staging" from path
ingest:
  source: prod_db
  table: orders

Layer Auto-Detection

msh automatically detects layers from directory paths:

models/staging/ → staging
models/marts/ → marts
models/intermediate/ → intermediate

Overriding Defaults

Asset values override defaults (deep merge):

# msh.yaml
defaults:
  staging:
    write_disposition: merge
    primary_key: id

# models/staging/stg_orders.msh
name: stg_orders
layer: staging
primary_key: order_id  # Overrides default
# Still inherits: write_disposition: merge

Complete Configuration Example

A complete msh.yaml example showing all features:

# msh.yaml

# Source definitions
sources:
  - name: prod_db
    type: sql_database
    credentials: "${DB_PROD_CREDENTIALS}"
    schema: public
    tables:
      - name: orders
      - name: customers

  - name: stripe_api
    type: rest_api
    endpoint: "https://api.stripe.com"
    credentials:
      api_key: "${STRIPE_API_KEY}"
    resources:
      - name: charges
      - name: customers

# Variables
vars:
  active_status: "active"
  start_date: "${START_DATE}"
  batch_size: 1000

# Test suites
test_suites:
  staging:
    - assert: "id IS NOT NULL"
    - assert: "created_at IS NOT NULL"
    - unique: id
  financial:
    - assert: "amount > 0"

# Config defaults
defaults:
  staging:
    write_disposition: merge
    primary_key: id
    test_suites:
      - staging
  marts:
    write_disposition: replace
    test_suites:
      - quality

SQL Macros

SQL macros allow you to define reusable SQL functions/patterns in macros/ directory and use them across multiple .msh files.

Creating Macros

Create a macros/ directory in your project root:

my-project/
├── macros/
│   └── common.sql
├── msh.yaml
└── models/

Defining Macros

Define macros in .sql files using Jinja2 syntax:

-- macros/common.sql
{% macro clean_string(column_name) %}
  TRIM(UPPER({{ column_name }}))
{% endmacro %}

{% macro format_date(column_name, format='YYYY-MM-DD') %}
  DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}

{% macro calculate_age(birth_date) %}
  DATEDIFF(YEAR, {{ birth_date }}, CURRENT_DATE)
{% endmacro %}

Using Macros

Macros are automatically loaded and available in all .msh files:

# models/staging/stg_orders.msh
name: stg_orders
transform: |
  SELECT 
    id,
    {{ clean_string('customer_name') }} as customer_name,
    {{ format_date('created_at', 'YYYY-MM') }} as created_month,
    {{ calculate_age('birth_date') }} as age
  FROM {{ source }}

Macro Features

✅ Supports parameters (positional and keyword)
✅ Supports default parameter values
✅ Macros can call other macros
✅ Macros can use var() function
✅ Clear error if macro not found

Macro Examples

With parameters:

{% macro clean_string(column_name) %}
  TRIM(UPPER({{ column_name }}))
{% endmacro %}

With default values:

{% macro format_date(column_name, format='YYYY-MM-DD') %}
  DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}

Using variables:

{% macro filter_by_date(column_name) %}
  {{ column_name }} > '{{ var("start_date") }}'
{% endmacro %}

Error Handling

Variable Errors

[ERROR] Variable 'active_status' not found in msh.yaml vars.
Available variables: start_date, end_date

Fix: Add the variable to vars: section in msh.yaml.

Test Suite Errors

[ERROR] Test suite 'staging' not found in msh.yaml.
Available suites: financial, quality

Fix: Check the suite name spelling or add it to test_suites: section.

Default Layer Errors

[ERROR] Layer 'staging' not found in defaults.
Available layers: marts, intermediate

Fix: Add the layer to defaults: section or check the layer: field in your .msh file.

Macro Errors

[ERROR] Error rendering macro 'clean_string': ...

Fix: Check macro syntax and ensure the macro file is in macros/ directory.

.msh File Format - How to reference sources in .msh files
DRY Principles - Complete guide on eliminating repetition
Universal Connectivity - Connecting to various data sources
Troubleshooting - Common source resolution issues

File Location​

Sources Section​

Basic Structure​

SQL Database Sources​

Basic SQL Database Source​

Using Environment Variables​

Schema and Table Resolution​

Multiple Schemas​

REST API Sources​

Basic REST API Source​

REST API with Authentication​

Resource Resolution​

Environment Variable Syntax​

Layered Project Example​

Backward Compatibility​

Error Handling​

Source Not Found​

Table/Resource Not Found​

Mixing Source Reference with Direct Credentials​

Missing Required Fields​

Best Practices​

Example with Comments​

Variables Section​

Basic Structure​

Using Variables in SQL​

Environment Variable Support​

Variable Resolution​

Test Suites Section​

Basic Structure​

Defining Test Suites​

Using Test Suites​

Combining Suites with Individual Tests​

Test Suite Benefits​

Defaults Section​

Basic Structure​

Defining Layer Defaults​

Using Layer Defaults​

Layer Auto-Detection​

Overriding Defaults​

Complete Configuration Example​

SQL Macros​

Creating Macros​

Defining Macros​

Using Macros​

Macro Features​

Macro Examples​

Error Handling​

Variable Errors​

Test Suite Errors​

Default Layer Errors​

Macro Errors​

Related Documentation​

File Location

Sources Section

Basic Structure

SQL Database Sources

Basic SQL Database Source

Using Environment Variables

Schema and Table Resolution

Multiple Schemas

REST API Sources

Basic REST API Source

REST API with Authentication

Resource Resolution

Environment Variable Syntax

Layered Project Example

Backward Compatibility

Error Handling

Source Not Found

Table/Resource Not Found

Mixing Source Reference with Direct Credentials

Missing Required Fields

Best Practices

Example with Comments

Variables Section

Basic Structure

Using Variables in SQL

Environment Variable Support

Variable Resolution

Test Suites Section

Basic Structure

Defining Test Suites

Using Test Suites

Combining Suites with Individual Tests

Test Suite Benefits

Defaults Section

Basic Structure

Defining Layer Defaults

Using Layer Defaults

Layer Auto-Detection

Overriding Defaults

Complete Configuration Example

SQL Macros

Creating Macros

Defining Macros

Using Macros

Macro Features

Macro Examples

Error Handling

Variable Errors

Test Suite Errors

Default Layer Errors

Macro Errors

Related Documentation