Skip to main content

Freshness Command

Check if your data assets are up-to-date and meet freshness expectations. This command validates that data pipelines are running on schedule and alerts you to stale data.

Overview

The msh freshness command checks the last successful run time for each asset against configured freshness expectations. It queries the msh_state_history table (fast) rather than scanning raw data (slow), making it efficient for monitoring large datasets.

Usage

msh freshness

What it does:

  1. Reads freshness expectations from .msh files
  2. Queries msh_state_history table for last successful run timestamps
  3. Compares age against warn_after and error_after thresholds
  4. Displays a table with status indicators

Example Output

Data Freshness
┌──────────────┬──────────────┬────────┐
│ Asset │ Last Run │ Status │
├──────────────┼──────────────┼────────┤
│ customers │ 2.3h ago │ OK │
│ orders │ 15.2h ago │ WARN │
│ revenue │ 50.1h ago │ ERROR │
└──────────────┴──────────────┴────────┘

Status Indicators:

  • OK (green): Data is fresh, within warn_after threshold
  • WARN (yellow): Data is stale, exceeds warn_after but within error_after
  • ERROR (red): Data is very stale, exceeds error_after threshold
  • MISSING: Asset has never been successfully deployed

Configuration: Freshness Block

Add a freshness block to your .msh file to set expectations:

name: customers
ingest:
type: rest_api
endpoint: https://api.example.com/customers

freshness:
warn_after: 12h # Warn if data is older than 12 hours
error_after: 24h # Error if data is older than 24 hours

transform: |
SELECT * FROM {{ source }}

Freshness Fields

  • warn_after (optional): Duration before warning (default: 24h)

    • Format: 12h, 30m, 2d, 3600s
    • Examples: 12h, 1d, 30m, 7200s
  • error_after (optional): Duration before error (default: 48h)

    • Format: Same as warn_after
    • Must be greater than warn_after

Duration Format

Supported time units:

  • s - seconds
  • m - minutes
  • h - hours
  • d - days

Examples:

freshness:
warn_after: 6h # 6 hours
error_after: 12h # 12 hours

freshness:
warn_after: 30m # 30 minutes
error_after: 2h # 2 hours

freshness:
warn_after: 1d # 1 day
error_after: 3d # 3 days

How It Works: The Fast Query Trick

Why it's fast: Instead of scanning raw data tables (which can be terabytes), msh freshness queries the lightweight msh_state_history metadata table. This table stores:

  • Asset name
  • Deployment hash
  • Timestamp of last successful run

Query Pattern:

SELECT timestamp FROM msh_state_history 
WHERE asset = 'customers'
ORDER BY timestamp DESC
LIMIT 1

This makes freshness checks instant even for large datasets.

Examples

Example 1: High-Frequency Data

For data that should update every few hours:

name: real_time_metrics
ingest:
type: rest_api
endpoint: https://api.example.com/metrics

freshness:
warn_after: 2h # Warn after 2 hours
error_after: 6h # Error after 6 hours

transform: |
SELECT * FROM {{ source }}

Example 2: Daily Batch Data

For daily ETL jobs:

name: daily_sales
ingest:
type: sql_database
credentials: ${SOURCE_DB}
table: sales

freshness:
warn_after: 25h # Warn if not updated within 25 hours
error_after: 48h # Error if not updated within 48 hours

transform: |
SELECT * FROM {{ source }}

Example 3: Weekly Reports

For weekly aggregations:

name: weekly_summary
ingest:
type: rest_api
endpoint: https://api.example.com/weekly

freshness:
warn_after: 8d # Warn after 8 days
error_after: 10d # Error after 10 days

transform: |
SELECT * FROM {{ source }}

Integration with Monitoring

CI/CD Pipelines

Use msh freshness in CI/CD to fail builds if data is stale:

# GitHub Actions example
- name: Check Data Freshness
run: msh freshness
# Exit code 1 if any asset exceeds error_after threshold

Scheduled Monitoring

Set up a cron job to check freshness regularly:

# Check freshness every hour
0 * * * * cd /path/to/project && msh freshness

Alerting

Combine with alerting tools:

#!/bin/bash
msh freshness
if [ $? -ne 0 ]; then
# Send alert (email, Slack, PagerDuty, etc.)
send_alert "Data freshness check failed"
fi

Troubleshooting

"MISSING" Status

If an asset shows MISSING:

│ customers    │ Never         │ MISSING │

Causes:

  • Asset has never been successfully deployed
  • Asset name mismatch between .msh file and deployment history

Solution:

# Deploy the asset
msh run customers

# Verify deployment
msh status

Assets Not Showing

If assets don't appear in the freshness check:

Cause: Missing freshness block in .msh file

Solution: Add freshness configuration to your .msh file:

freshness:
warn_after: 24h
error_after: 48h

Incorrect Timestamps

If timestamps seem wrong:

Cause: Timezone differences or clock skew

Solution: Ensure your database server and local machine have synchronized clocks.

Best Practices

  1. Set Realistic Thresholds: Base warn_after and error_after on your actual update frequency
  2. Monitor in Production: Add freshness checks to production monitoring dashboards
  3. Alert on Errors: Set up alerts for ERROR status to catch pipeline failures quickly
  4. Document Expectations: Document freshness requirements in your data catalog
  • msh status: View active versions (doesn't check freshness)
  • msh versions: View deployment history for an asset
  • msh run: Deploy assets (updates freshness timestamps)

See Also