Skip to main content

Smart Ingest

Smart Ingest is msh's cost-optimization engine. Unlike traditional ELT tools that load everything and then filter it down, msh analyzes your SQL transformations to determine exactly which columns are needed from the source.

How It Works

  1. Parse: msh parses your .msh (SQL) files to identify SELECT statements.
  2. Resolve: It traces these columns back to the source definition.
  3. Fetch: It instructs the ingestion engine (dlt) to fetch only those specific columns from the API or database.

This drastically reduces:

  • Egress costs from APIs.
  • Storage costs in your warehouse.
  • Compute time for processing.

The SQL Parsing Logic

msh uses a multi-stage compiler to extract column dependencies:

Stage 1: AST Parsing

Your .msh file is parsed into an Abstract Syntax Tree (AST):

-- models/leads.msh
SELECT
email,
status,
UPPER(company_name) as company
FROM source('salesforce', 'Lead')
WHERE status = 'New'

The compiler identifies:

  • Selected columns: email, status, company_name
  • Filter columns: status
  • Source reference: salesforce.Lead

Stage 2: Column Resolution

msh resolves these columns to the source schema and generates an optimized fetch specification.

Stage 3: Optimized Query Generation

For API sources, msh generates a field list:

# Generated dlt configuration
salesforce_resource = salesforce.Lead(
fields=['email', 'status', 'company_name'] # Only 3 of 500+ fields
)

For SQL sources (DB-to-DB), msh generates an optimized SELECT:

-- Instead of: SELECT * FROM legacy_db.leads
-- msh generates:
SELECT
email,
status,
company_name
FROM legacy_db.leads
WHERE status = 'New'

Example: API Source Optimization

Consider a Salesforce Lead object with 500+ columns. You only care about the email and status.

Traditional ELT

Loads all 500 columns into a landing table, then you select 2.

API Request:

GET /services/data/v58.0/query?q=SELECT * FROM Lead

Result: 500 columns × 10,000 rows = 5M data points transferred.

msh Smart Ingest

You write:

-- models/leads.msh
SELECT
email,
status
FROM source('salesforce', 'Lead')
WHERE status = 'New'

Optimized API Request:

GET /services/data/v58.0/query?q=SELECT email, status FROM Lead WHERE status='New'

Result: 2 columns × 10,000 rows = 20K data points transferred.

info

Smart Ingest happens automatically. You don't need to configure anything—just write your SQL.

success

Result: 99% reduction in data volume for wide tables.

Example: SQL Source Optimization

For database-to-database replication:

# models/legacy_users.msh
source:
type: dlt
source_name: sql_database
credentials: ${SOURCE_DB}
table: users

sql: |
SELECT
user_id,
email,
created_at
FROM source('sql_database', 'users')
WHERE created_at > CURRENT_DATE - INTERVAL '7 days'

Without Smart Ingest:

-- Full table scan
SELECT * FROM legacy_db.users

With Smart Ingest:

-- Optimized query with column pruning and predicate pushdown
SELECT user_id, email, created_at
FROM legacy_db.users
WHERE created_at > CURRENT_DATE - INTERVAL '7 days'

This reduces both network transfer and source database load.