msh.yaml Configuration Reference
The msh.yaml file is the central configuration file for your msh project. It allows you to define data sources once and reference them from individual .msh files, following the DRY (Don't Repeat Yourself) principle.
File Location
Place msh.yaml in the root of your msh project:
my-project/
├── msh.yaml # Project configuration
├── .env # Environment variables
└── models/ # Your .msh files
└── ...
Sources Section
The sources section defines data sources that can be referenced from .msh files. This eliminates credential repetition across hundreds of files.
Basic Structure
sources:
- name: source_name
type: source_type
credentials: "${ENV_VAR}"
# Additional type-specific fields
SQL Database Sources
Define SQL database connections once and reference tables from multiple .msh files.
Basic SQL Database Source
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: products
Using Environment Variables
Always use environment variables for credentials:
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}" # Resolved from .env
schema: public
tables:
- name: orders
In .env file:
DB_PROD_CREDENTIALS="postgresql://user:pass@prod-db.com/sales"
Schema and Table Resolution
When referencing a source in a .msh file, you can specify the table in two ways:
Option 1: Simple table name (uses source's schema)
# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
# models/stg_orders.msh
ingest:
source: prod_db
table: orders # Resolves to public.orders
Option 2: Schema-qualified table name
# models/stg_orders.msh
ingest:
source: prod_db
table: public.orders # Explicit schema.table format
Multiple Schemas
You can define multiple sources for different schemas:
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: analytics_db
type: sql_database
credentials: "${DB_ANALYTICS_CREDENTIALS}"
schema: analytics
tables:
- name: events
- name: sessions
REST API Sources
Define REST API endpoints once and reference resources from multiple .msh files.
Basic REST API Source
sources:
- name: jsonplaceholder
type: rest_api
endpoint: "https://jsonplaceholder.typicode.com"
resources:
- name: users
- name: posts
- name: comments
REST API with Authentication
sources:
- name: github_api
type: rest_api
endpoint: "https://api.github.com"
credentials:
token: "${GITHUB_TOKEN}"
resources:
- name: repos
- name: issues
- name: pull_requests
In .env file:
GITHUB_TOKEN="ghp_your_token_here"
Resource Resolution
When referencing a REST API source in a .msh file:
# msh.yaml
sources:
- name: jsonplaceholder
type: rest_api
endpoint: "https://jsonplaceholder.typicode.com"
resources:
- name: users
# models/stg_users.msh
ingest:
source: jsonplaceholder
resource: users # References the 'users' resource
Environment Variable Syntax
msh supports ${VAR_NAME} syntax for environment variable substitution:
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: "${DB_SCHEMA}" # Can use env vars for any field
Environment variables are resolved:
- From
.envfile in project root - From system environment variables
- Recursively (nested
${VAR}references are supported)
Layered Project Example
A complete example showing staging → intermediate → marts layers:
# msh.yaml
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: products
- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges
- name: customers
Staging layer:
# models/staging/stg_orders.msh
name: stg_orders
ingest:
source: prod_db
table: orders
transform: |
SELECT * FROM {{ source }}
Intermediate layer:
# models/intermediate/int_order_customer.msh
name: int_order_customer
transform: |
SELECT
o.*,
c.name as customer_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.id
Marts layer:
# models/marts/fct_orders.msh
name: fct_orders
transform: |
SELECT * FROM {{ ref('int_order_customer') }}
Backward Compatibility
Direct credentials still work! You can mix source references and direct credentials in the same project:
# ✅ Using source reference
# models/stg_orders.msh
ingest:
source: prod_db
table: orders
# ✅ Using direct credentials (still works)
# models/stg_legacy.msh
ingest:
type: sql_database
credentials: "${LEGACY_DB_URI}"
table: legacy_table
This allows gradual migration:
- Start with existing files using direct credentials
- Add sources to
msh.yaml - Migrate
.mshfiles one at a time to use source references - Both styles can coexist
Error Handling
msh provides clear error messages for source resolution issues:
Source Not Found
[ERROR] Source 'prod_db' not found in msh.yaml.
Available sources: staging_db, analytics_db
Fix: Check the source name in msh.yaml matches what you're referencing.
Table/Resource Not Found
[ERROR] Table 'orders' not found in source 'prod_db'.
Available tables: customers, products
Fix: Add the table to the source's tables list in msh.yaml.
Mixing Source Reference with Direct Credentials
[ERROR] Cannot specify both 'source' reference and direct 'type'/'credentials' in ingest block
Fix: Use either a source reference OR direct credentials, not both.
Missing Required Fields
[ERROR] Source reference requires either 'table' (for SQL sources) or 'resource' (for API sources)
Fix: Add table: for SQL database sources or resource: for REST API sources.
Best Practices
- Use Environment Variables: Never hardcode credentials in
msh.yaml - Organize by Source: Group related tables/resources under the same source
- Document Sources: Add comments explaining what each source represents
- Version Control: Commit
msh.yamlbut never commit.envfiles - Gradual Migration: Migrate from direct credentials to source references over time
Example with Comments
sources:
# Production PostgreSQL database
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders # Customer orders
- name: customers # Customer master data
- name: products # Product catalog
# Stripe API for payment data
- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges # Payment charges
- name: customers # Stripe customer records
Variables Section
The vars section defines variables that can be used in SQL transformations using {{ var("variable_name") }}. This eliminates hard-coded values that change per environment.
Basic Structure
vars:
variable_name: "value"
env_var_ref: "${ENV_VAR_NAME}"
Using Variables in SQL
Define in msh.yaml:
vars:
active_status: "active"
start_date: "${START_DATE}" # Environment variable
batch_size: 1000
Use in .msh files:
transform: |
SELECT * FROM {{ source }}
WHERE status = '{{ var("active_status") }}'
AND created_at > '{{ var("start_date") }}'
LIMIT {{ var("batch_size") }}
Environment Variable Support
Variables can reference environment variables using ${VAR_NAME} syntax:
vars:
start_date: "${START_DATE}" # From .env
end_date: "${END_DATE}" # From .env
active_status: "active" # Static value
In .env file:
START_DATE=2024-01-01
END_DATE=2024-12-31
Variable Resolution
Variables are resolved at compile time:
- Environment variables are resolved first
- Variables are then available in SQL via
{{ var("name") }} - Clear error if variable not found
Test Suites Section
The test_suites section defines reusable test suites that can be referenced from multiple .msh files, eliminating test repetition.
Basic Structure
test_suites:
suite_name:
- assert: "SQL assertion"
- unique: column_name
- not_null: column_name
Defining Test Suites
test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id
financial:
- assert: "amount > 0"
- assert: "amount IS NOT NULL"
quality:
- not_null: [id, email, created_at]
Using Test Suites
Reference test suites in .msh files:
# models/staging/stg_orders.msh
name: stg_orders
test_suites:
- staging
- financial
# Automatically expands to all tests from both suites
Combining Suites with Individual Tests
You can combine test suites with individual tests:
# models/staging/stg_orders.msh
name: stg_orders
test_suites:
- staging
- financial
tests:
- assert: "order_date >= '2024-01-01'" # Additional test
# All tests are merged together
Test Suite Benefits
- ✅ Define tests once, reuse everywhere
- ✅ Consistent quality checks across similar models
- ✅ Easy to update tests in one place
- ✅ Automatic deduplication of duplicate tests
Defaults Section
The defaults section defines default configuration values by layer (staging, marts, etc.), eliminating config repetition.
Basic Structure
defaults:
layer_name:
write_disposition: merge | replace | append
primary_key: id
tests:
- assert: "..."
test_suites:
- suite_name
Defining Layer Defaults
defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging
marts:
write_disposition: replace
test_suites:
- quality
intermediate:
write_disposition: append
Using Layer Defaults
Specify layer: in .msh file or let it auto-detect from directory path:
Explicit layer:
# models/staging/stg_orders.msh
name: stg_orders
layer: staging # Inherits staging defaults
ingest:
source: prod_db
table: orders
# Automatically inherits: write_disposition, primary_key, test_suites
Auto-detected layer:
# models/staging/stg_orders.msh (in models/staging/ directory)
name: stg_orders
# Layer auto-detected as "staging" from path
ingest:
source: prod_db
table: orders
Layer Auto-Detection
msh automatically detects layers from directory paths:
models/staging/→stagingmodels/marts/→martsmodels/intermediate/→intermediate
Overriding Defaults
Asset values override defaults (deep merge):
# msh.yaml
defaults:
staging:
write_disposition: merge
primary_key: id
# models/staging/stg_orders.msh
name: stg_orders
layer: staging
primary_key: order_id # Overrides default
# Still inherits: write_disposition: merge
Complete Configuration Example
A complete msh.yaml example showing all features:
# msh.yaml
# Source definitions
sources:
- name: prod_db
type: sql_database
credentials: "${DB_PROD_CREDENTIALS}"
schema: public
tables:
- name: orders
- name: customers
- name: stripe_api
type: rest_api
endpoint: "https://api.stripe.com"
credentials:
api_key: "${STRIPE_API_KEY}"
resources:
- name: charges
- name: customers
# Variables
vars:
active_status: "active"
start_date: "${START_DATE}"
batch_size: 1000
# Test suites
test_suites:
staging:
- assert: "id IS NOT NULL"
- assert: "created_at IS NOT NULL"
- unique: id
financial:
- assert: "amount > 0"
# Config defaults
defaults:
staging:
write_disposition: merge
primary_key: id
test_suites:
- staging
marts:
write_disposition: replace
test_suites:
- quality
SQL Macros
SQL macros allow you to define reusable SQL functions/patterns in macros/ directory and use them across multiple .msh files.
Creating Macros
Create a macros/ directory in your project root:
my-project/
├── macros/
│ └── common.sql
├── msh.yaml
└── models/
Defining Macros
Define macros in .sql files using Jinja2 syntax:
-- macros/common.sql
{% macro clean_string(column_name) %}
TRIM(UPPER({{ column_name }}))
{% endmacro %}
{% macro format_date(column_name, format='YYYY-MM-DD') %}
DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}
{% macro calculate_age(birth_date) %}
DATEDIFF(YEAR, {{ birth_date }}, CURRENT_DATE)
{% endmacro %}
Using Macros
Macros are automatically loaded and available in all .msh files:
# models/staging/stg_orders.msh
name: stg_orders
transform: |
SELECT
id,
{{ clean_string('customer_name') }} as customer_name,
{{ format_date('created_at', 'YYYY-MM') }} as created_month,
{{ calculate_age('birth_date') }} as age
FROM {{ source }}
Macro Features
- ✅ Supports parameters (positional and keyword)
- ✅ Supports default parameter values
- ✅ Macros can call other macros
- ✅ Macros can use
var()function - ✅ Clear error if macro not found
Macro Examples
With parameters:
{% macro clean_string(column_name) %}
TRIM(UPPER({{ column_name }}))
{% endmacro %}
With default values:
{% macro format_date(column_name, format='YYYY-MM-DD') %}
DATE_FORMAT({{ column_name }}, '{{ format }}')
{% endmacro %}
Using variables:
{% macro filter_by_date(column_name) %}
{{ column_name }} > '{{ var("start_date") }}'
{% endmacro %}
Error Handling
Variable Errors
[ERROR] Variable 'active_status' not found in msh.yaml vars.
Available variables: start_date, end_date
Fix: Add the variable to vars: section in msh.yaml.
Test Suite Errors
[ERROR] Test suite 'staging' not found in msh.yaml.
Available suites: financial, quality
Fix: Check the suite name spelling or add it to test_suites: section.
Default Layer Errors
[ERROR] Layer 'staging' not found in defaults.
Available layers: marts, intermediate
Fix: Add the layer to defaults: section or check the layer: field in your .msh file.
Macro Errors
[ERROR] Error rendering macro 'clean_string': ...
Fix: Check macro syntax and ensure the macro file is in macros/ directory.
Related Documentation
- .msh File Format - How to reference sources in
.mshfiles - DRY Principles - Complete guide on eliminating repetition
- Universal Connectivity - Connecting to various data sources
- Troubleshooting - Common source resolution issues