Production Deployment

This guide covers deploying msh in production environments using Docker, CI/CD, and orchestration platforms.

Containerization

Dockerfile

Create a Dockerfile in your project root:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install msh and dependencies
RUN pip install --no-cache-dir msh-cli dbt-core dbt-postgres dlt

# Copy project files
COPY models/ ./models/
COPY .env.production .env

# Run msh
CMD ["msh", "run"]

Build and Run

# Build the image
docker build -t msh-pipeline:latest .

# Run locally
docker run --env-file .env.production msh-pipeline:latest

# Run with volume mount for development
docker run -v $(pwd)/models:/app/models msh-pipeline:latest

CI/CD Integration

GitHub Actions

msh provides a command to generate a GitHub Actions workflow:

msh generate github

This creates .github/workflows/msh-deploy.yml:

name: msh Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install msh
        run: |
          pip install msh-cli dbt-core dbt-snowflake dlt
      
      - name: Run msh doctor
        run: msh doctor
        env:
          DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}
      
      - name: Run msh plan (PR only)
        if: github.event_name == 'pull_request'
        run: msh plan
        env:
          DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}
      
      - name: Run msh deploy (main only)
        if: github.ref == 'refs/heads/main'
        run: msh run
        env:
          DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: ${{ secrets.SNOWFLAKE_DATABASE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: ${{ secrets.SNOWFLAKE_HOST }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: ${{ secrets.SNOWFLAKE_WAREHOUSE }}
          DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: ${{ secrets.SNOWFLAKE_ROLE }}
          STRIPE_API_KEY: ${{ secrets.STRIPE_API_KEY }}
          SALESFORCE_USERNAME: ${{ secrets.SALESFORCE_USERNAME }}
          SALESFORCE_PASSWORD: ${{ secrets.SALESFORCE_PASSWORD }}

Setting Up GitHub Secrets

Navigate to your repository → Settings → Secrets and variables → Actions
Click New repository secret
Add each environment variable:

Destination Credentials (Snowflake example):

SNOWFLAKE_DATABASE
SNOWFLAKE_USERNAME
SNOWFLAKE_PASSWORD
SNOWFLAKE_HOST
SNOWFLAKE_WAREHOUSE
SNOWFLAKE_ROLE

Source Credentials:

STRIPE_API_KEY
SALESFORCE_USERNAME
SALESFORCE_PASSWORD
SALESFORCE_SECURITY_TOKEN

GitLab CI

Create .gitlab-ci.yml:

stages:
  - validate
  - deploy

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
  paths:
    - .cache/pip

validate:
  stage: validate
  image: python:3.11-slim
  script:
    - pip install msh-cli dbt-core dbt-postgres dlt
    - msh doctor
    - msh plan
  only:
    - merge_requests

deploy:
  stage: deploy
  image: python:3.11-slim
  script:
    - pip install msh-cli dbt-core dbt-postgres dlt
    - msh run
  only:
    - main
  environment:
    name: production

Add variables in Settings → CI/CD → Variables.

Multi-Environment Strategy

Environment-Specific Configuration

Create separate .env files for each environment:

.env.dev
.env.staging
.env.production

.env.dev:

DESTINATION__POSTGRES__CREDENTIALS="postgresql://user:pass@localhost:5432/analytics_dev"

.env.staging:

DESTINATION__POSTGRES__CREDENTIALS="postgresql://user:pass@staging-db:5432/analytics_staging"

.env.production:

DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE="ANALYTICS_PROD"
DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME="MSH_PROD_USER"
# ... other Snowflake credentials

Using `--env` Flag

Run msh with a specific environment:

# Development
msh run --env dev

# Staging
msh run --env staging

# Production
msh run --env prod

This loads the corresponding .env.<environment> file.

Orchestration with Airflow

Airflow DAG

Create dags/msh_pipeline.py:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'msh_pipeline',
    default_args=default_args,
    description='Run msh data pipeline',
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    catchup=False,
)

# Health check
doctor = BashOperator(
    task_id='msh_doctor',
    bash_command='cd /opt/msh && msh doctor',
    dag=dag,
)

# Run pipeline
run = BashOperator(
    task_id='msh_run',
    bash_command='cd /opt/msh && msh run --env prod',
    dag=dag,
)

# Verify deployment
verify = BashOperator(
    task_id='verify_deployment',
    bash_command='cd /opt/msh && msh ui --verify',
    dag=dag,
)

doctor >> run >> verify

Environment Variables in Airflow

Set environment variables in Airflow:

Airflow UI → Admin → Variables
Add each credential as a variable
Reference in your DAG using Variable.get('SNOWFLAKE_PASSWORD')

Or use Connections for database credentials.

Kubernetes Deployment

Kubernetes Manifests

k8s/configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: msh-config
data:
  MSH_ENV: "production"

k8s/secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: msh-secrets
type: Opaque
stringData:
  DESTINATION__SNOWFLAKE__CREDENTIALS__DATABASE: "ANALYTICS_PROD"
  DESTINATION__SNOWFLAKE__CREDENTIALS__USERNAME: "MSH_USER"
  DESTINATION__SNOWFLAKE__CREDENTIALS__PASSWORD: "secure_password"
  DESTINATION__SNOWFLAKE__CREDENTIALS__HOST: "abc123.snowflakecomputing.com"
  DESTINATION__SNOWFLAKE__CREDENTIALS__WAREHOUSE: "COMPUTE_WH"
  DESTINATION__SNOWFLAKE__CREDENTIALS__ROLE: "TRANSFORMER"
  STRIPE_API_KEY: "sk_live_..."

k8s/cronjob.yaml:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: msh-pipeline
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: msh
            image: msh-pipeline:latest
            command: ["msh", "run", "--env", "prod"]
            envFrom:
            - configMapRef:
                name: msh-config
            - secretRef:
                name: msh-secrets
          restartPolicy: OnFailure

Deploy to Kubernetes

kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/cronjob.yaml

Monitoring and Alerting

Health Checks

Add a health check endpoint to your deployment:

# In your CI/CD or cron job
msh doctor || exit 1

Logging

msh logs to stdout by default. Capture logs in your orchestrator:

Docker:

docker logs msh-pipeline > /var/log/msh/pipeline.log

Kubernetes:

kubectl logs -f cronjob/msh-pipeline

Alerting

Set up alerts for:

Pipeline Failures: Monitor exit codes from msh run
State Drift: Check msh_state_history for failed deployments
Performance: Track execution time

Example: Slack Notification on Failure

#!/bin/bash
msh run --env prod
if [ $? -ne 0 ]; then
  curl -X POST -H 'Content-type: application/json' \
    --data '{"text":"msh pipeline failed!"}' \
    $SLACK_WEBHOOK_URL
fi

Performance Tuning

Parallel Execution

For large projects, enable parallel execution:

msh run --threads 4

Resource Limits

In Kubernetes, set resource limits:

resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "4Gi"
    cpu: "2000m"

Incremental Runs

Use incremental execution for large datasets:

# In your .msh file
execution: incremental
incremental:
  strategy: merge
  primary_key: id

Security Best Practices

Never Commit Secrets: Use .gitignore to exclude .env files
Use Secret Management: Store secrets in GitHub Secrets, AWS Secrets Manager, or HashiCorp Vault
Least Privilege: Grant database users only necessary permissions
Rotate Credentials: Regularly rotate API keys and database passwords
Audit Logs: Enable audit logging in your destination database
Network Security: Use VPCs and private endpoints for database connections

Rollback Strategy

If a deployment fails in production:

# Automatic rollback on failure
msh run --env prod --auto-rollback

# Manual rollback
msh rollback --env prod

Example: Complete Production Setup

Directory Structure:

my-msh-project/
├── .github/
│   └── workflows/
│       └── msh-deploy.yml
├── k8s/
│   ├── configmap.yaml
│   ├── secret.yaml
│   └── cronjob.yaml
├── models/
│   ├── customers.msh
│   └── revenue.msh
├── .env.dev
├── .env.staging
├── .env.production
├── Dockerfile
└── .gitignore

Deployment Flow:

Developer pushes to feature branch → GitHub Actions runs msh plan
PR is merged to main → GitHub Actions runs msh run --env staging
Manual approval → Kubernetes CronJob runs msh run --env prod daily
On failure → Slack alert sent, automatic rollback triggered

Next Steps

Troubleshooting: Debug production issues
CLI Reference: Full command documentation
Lifecycle Contract: Understand Blue/Green deployment

Containerization​

Dockerfile​

Build and Run​

CI/CD Integration​

GitHub Actions​

Setting Up GitHub Secrets​

GitLab CI​

Multi-Environment Strategy​

Environment-Specific Configuration​

Using --env Flag​

Orchestration with Airflow​

Airflow DAG​

Environment Variables in Airflow​

Kubernetes Deployment​

Kubernetes Manifests​

Deploy to Kubernetes​

Monitoring and Alerting​

Health Checks​

Logging​

Alerting​

Performance Tuning​

Parallel Execution​

Resource Limits​

Incremental Runs​

Security Best Practices​

Rollback Strategy​

Example: Complete Production Setup​

Next Steps​