Skip to main content

Metadata Cache

msh maintains a metadata cache in the .msh/ directory for fast AI operations and project analysis.

Location

The metadata cache is stored in .msh/ directory at the project root:

my-project/
├── .msh/
│ ├── manifest.json # Compiled manifest of all assets
│ ├── lineage.json # Lineage graph (edges between assets)
│ ├── schemas.json # Flattened view of schemas per asset
│ ├── tests.json # Test definitions and latest statuses
│ ├── versions.json # Deployment versions/history
│ └── glossary.json # Cached glossary file
├── msh.yaml
└── models/

Cache Files

manifest.json

Compiled manifest of all assets in the project.

Structure:

{
"project": {
"id": "my-project",
"name": "My Project",
"warehouse": "postgres",
"default_schema": "public"
},
"assets": [
{
"id": "revenue",
"path": "assets/revenue.msh",
"blocks": {...},
"schema": {...}
}
]
}

Purpose:

  • Fast AI operations (no need to re-parse all files)
  • Project-level analysis
  • Asset discovery

lineage.json

Lineage graph showing dependencies between assets.

Structure:

{
"edges": [
{
"from": "stg_orders",
"to": "revenue",
"type": "transform"
}
],
"nodes": ["stg_orders", "revenue"]
}

Purpose:

  • Dependency analysis
  • Impact analysis
  • DAG visualization

schemas.json

Flattened view of schemas per asset.

Structure:

{
"revenue": {
"columns": [
{"name": "customer_id", "type": "integer"},
{"name": "month", "type": "date"},
{"name": "monthly_revenue", "type": "decimal"}
]
}
}

Purpose:

  • Schema analysis
  • Type checking
  • Column discovery

tests.json

Test definitions and latest statuses.

Structure:

{
"revenue": {
"tests": [
{"type": "unique", "columns": ["customer_id", "month"]},
{"type": "assert", "sql": "monthly_revenue > 0"}
],
"last_run": "2024-01-15T10:00:00Z",
"status": "passed"
}
}

Purpose:

  • Test analysis
  • Quality monitoring
  • Test coverage

versions.json

Deployment versions and history.

Structure:

{
"revenue": [
{
"version": "a1b2c3d4",
"deployed_at": "2024-01-15T10:00:00Z",
"status": "active"
}
]
}

Purpose:

  • Version tracking
  • Deployment history
  • Rollback analysis

glossary.json

Cached glossary file.

Structure:

{
"terms": [
{
"id": "term.customer",
"name": "Customer",
"description": "A customer entity"
}
],
"metrics": [...],
"dimensions": [...],
"policies": [...]
}

Purpose:

  • Business glossary
  • Term linking
  • Policy enforcement

Cache Generation

Initial Generation

Generate cache using msh manifest:

msh manifest

This creates all cache files in .msh/ directory.

Incremental Updates

Cache is automatically updated when:

  • Assets are modified
  • Glossary is updated
  • Tests are run

You can also manually update:

msh manifest --update

Cache Invalidation

Cache is invalidated when:

  • Asset files are modified
  • msh.yaml is changed
  • Glossary is updated

Manual invalidation:

rm -rf .msh/
msh manifest

Benefits

Performance

  • Fast AI operations: No need to re-parse all files
  • Incremental updates: Only changed files are re-parsed
  • Shared cache: CLI and engine share the same cache

Consistency

  • Single source of truth: All metadata in one place
  • Version controlled: Cache can be committed (optional)
  • Reproducible: Same cache = same results

AI Optimization

  • Token limits: Cache is optimized for AI consumption
  • Relevant data: Only necessary metadata included
  • Structured format: Easy for AI to parse

Best Practices

Version Control

Option 1: Commit cache (recommended for teams)

git add .msh/
git commit -m "Update metadata cache"

Option 2: Ignore cache (regenerate on demand)

# .gitignore
.msh/

Cache Maintenance

  • Regular updates: Run msh manifest after changes
  • Clean cache: Remove .msh/ if cache is corrupted
  • Monitor size: Cache files can grow large in big projects