Skip to main content

Production Deployment

This guide covers deploying msh in production environments using Docker, CI/CD, and orchestration platforms.

Containerization

Dockerfile

Create a Dockerfile in your project root:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
git \
&& rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install msh and dependencies
RUN pip install --no-cache-dir msh-cli dbt-core dbt-postgres dlt

# Copy project files
COPY models/ ./models/
COPY .env.production .env

# Run msh
CMD ["msh", "run"]

Build and Run

# Build the image
docker build -t msh-pipeline:latest .

# Run locally
docker run --env-file .env.production msh-pipeline:latest

# Run with volume mount for development
docker run -v $(pwd)/models:/app/models msh-pipeline:latest

CI/CD Integration

GitHub Actions

msh provides a command to generate a GitHub Actions workflow:

msh generate github

This creates .github/workflows/msh-deploy.yml:

name: msh Deploy

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
deploy:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Install msh
run: |
pip install msh-cli dbt-core dbt-snowflake dlt

- name: Run msh doctor
run: msh doctor
env:
DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}

- name: Run msh plan (PR only)
if: github.event_name == 'pull_request'
run: msh plan
env:
DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}

- name: Run msh deploy (main only)
if: github.ref == 'refs/heads/main'
run: msh run
env:
DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}
STRIPE_API_KEY: ${{ secrets.STRIPE_API_KEY }}
SALESFORCE_USERNAME: ${{ secrets.SALESFORCE_USERNAME }}
SALESFORCE_PASSWORD: ${{ secrets.SALESFORCE_PASSWORD }}

Setting Up GitHub Secrets

  1. Navigate to your repository → SettingsSecrets and variablesActions
  2. Click New repository secret
  3. Add each environment variable:

Destination Credentials (Snowflake example):

  • SNOWFLAKE_DATABASE
  • SNOWFLAKE_USERNAME
  • SNOWFLAKE_PASSWORD
  • SNOWFLAKE_HOST
  • SNOWFLAKE_WAREHOUSE
  • SNOWFLAKE_ROLE

Source Credentials:

  • STRIPE_API_KEY
  • SALESFORCE_USERNAME
  • SALESFORCE_PASSWORD
  • SALESFORCE_SECURITY_TOKEN

GitLab CI

Create .gitlab-ci.yml:

stages:
- validate
- deploy

variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
paths:
- .cache/pip

validate:
stage: validate
image: python:3.11-slim
script:
- pip install msh-cli dbt-core dbt-postgres dlt
- msh doctor
- msh plan
only:
- merge_requests

deploy:
stage: deploy
image: python:3.11-slim
script:
- pip install msh-cli dbt-core dbt-postgres dlt
- msh run
only:
- main
environment:
name: production

Add variables in SettingsCI/CDVariables.


Multi-Environment Strategy

Environment-Specific Configuration

Create separate .env files for each environment:

.env.dev
.env.staging
.env.production

.env.dev:

DESTINATION__POSTGRES__CREDENTIALS="postgresql://user:pass@localhost:5432/analytics_dev"

.env.staging:

DESTINATION__POSTGRES__CREDENTIALS="postgresql://user:pass@staging-db:5432/analytics_staging"

.env.production:

DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE="ANALYTICS_PROD"
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME="MSH_PROD_USER"
# ... other Snowflake credentials

Using --env Flag

Run msh with a specific environment:

# Development
msh run --env dev

# Staging
msh run --env staging

# Production
msh run --env prod

This loads the corresponding .env.<environment> file.


Orchestration with Airflow

Airflow DAG

Create dags/msh_pipeline.py:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}

dag = DAG(
'msh_pipeline',
default_args=default_args,
description='Run msh data pipeline',
schedule_interval='0 2 * * *', # Daily at 2 AM
catchup=False,
)

# Health check
doctor = BashOperator(
task_id='msh_doctor',
bash_command='cd /opt/msh && msh doctor',
dag=dag,
)

# Run pipeline
run = BashOperator(
task_id='msh_run',
bash_command='cd /opt/msh && msh run --env prod',
dag=dag,
)

# Verify deployment
verify = BashOperator(
task_id='verify_deployment',
bash_command='cd /opt/msh && msh ui --verify',
dag=dag,
)

doctor >> run >> verify

Environment Variables in Airflow

Set environment variables in Airflow:

  1. Airflow UIAdminVariables
  2. Add each credential as a variable
  3. Reference in your DAG using Variable.get('SNOWFLAKE_PASSWORD')

Or use Connections for database credentials.


Kubernetes Deployment

Kubernetes Manifests

k8s/configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
name: msh-config
data:
MSH_ENV: "production"

k8s/secret.yaml:

apiVersion: v1
kind: Secret
metadata:
name: msh-secrets
type: Opaque
stringData:
DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: "ANALYTICS_PROD"
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: "MSH_USER"
DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: "secure_password"
DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: "abc123.snowflakecomputing.com"
DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: "COMPUTE_WH"
DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: "TRANSFORMER"
STRIPE_API_KEY: "sk_live_..."

k8s/cronjob.yaml:

apiVersion: batch/v1
kind: CronJob
metadata:
name: msh-pipeline
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: msh
image: msh-pipeline:latest
command: ["msh", "run", "--env", "prod"]
envFrom:
- configMapRef:
name: msh-config
- secretRef:
name: msh-secrets
restartPolicy: OnFailure

Deploy to Kubernetes

kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/cronjob.yaml

Monitoring and Alerting

Health Checks

Add a health check endpoint to your deployment:

# In your CI/CD or cron job
msh doctor || exit 1

Logging

msh logs to stdout by default. Capture logs in your orchestrator:

Docker:

docker logs msh-pipeline > /var/log/msh/pipeline.log

Kubernetes:

kubectl logs -f cronjob/msh-pipeline

Alerting

Set up alerts for:

  1. Pipeline Failures: Monitor exit codes from msh run
  2. State Drift: Check msh_state_history for failed deployments
  3. Performance: Track execution time

Example: Slack Notification on Failure

#!/bin/bash
msh run --env prod
if [ $? -ne 0 ]; then
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"msh pipeline failed!"}' \
$SLACK_WEBHOOK_URL
fi

Performance Tuning

Parallel Execution

For large projects, enable parallel execution:

msh run --threads 4

Resource Limits

In Kubernetes, set resource limits:

resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"

Incremental Runs

Use incremental execution for large datasets:

# In your .msh file
execution: incremental
incremental:
strategy: merge
primary_key: id

Security Best Practices

  1. Never Commit Secrets: Use .gitignore to exclude .env files
  2. Use Secret Management: Store secrets in GitHub Secrets, AWS Secrets Manager, or HashiCorp Vault
  3. Least Privilege: Grant database users only necessary permissions
  4. Rotate Credentials: Regularly rotate API keys and database passwords
  5. Audit Logs: Enable audit logging in your destination database
  6. Network Security: Use VPCs and private endpoints for database connections

Rollback Strategy

If a deployment fails in production:

# Automatic rollback on failure
msh run --env prod --auto-rollback

# Manual rollback
msh rollback --env prod

Example: Complete Production Setup

Directory Structure:

my-msh-project/
├── .github/
│ └── workflows/
│ └── msh-deploy.yml
├── k8s/
│ ├── configmap.yaml
│ ├── secret.yaml
│ └── cronjob.yaml
├── models/
│ ├── customers.msh
│ └── revenue.msh
├── .env.dev
├── .env.staging
├── .env.production
├── Dockerfile
└── .gitignore

Deployment Flow:

  1. Developer pushes to feature branch → GitHub Actions runs msh plan
  2. PR is merged to main → GitHub Actions runs msh run --env staging
  3. Manual approval → Kubernetes CronJob runs msh run --env prod daily
  4. On failure → Slack alert sent, automatic rollback triggered

Next Steps