Smart Ingest
Smart Ingest is msh's cost-optimization engine. Unlike traditional ELT tools that load everything and then filter it down, msh analyzes your SQL transformations to determine exactly which columns are needed from the source.
How It Works
- Parse: msh parses your
.msh(SQL) files to identifySELECTstatements. - Resolve: It traces these columns back to the source definition.
- Fetch: It instructs the ingestion engine (dlt) to fetch only those specific columns from the API or database.
This drastically reduces:
- Egress costs from APIs.
- Storage costs in your warehouse.
- Compute time for processing.
The SQL Parsing Logic
msh uses a multi-stage compiler to extract column dependencies:
Stage 1: AST Parsing
Your .msh file is parsed into an Abstract Syntax Tree (AST):
-- models/leads.msh
SELECT
email,
status,
UPPER(company_name) as company
FROM source('salesforce', 'Lead')
WHERE status = 'New'
The compiler identifies:
- Selected columns:
email,status,company_name - Filter columns:
status - Source reference:
salesforce.Lead
Stage 2: Column Resolution
msh resolves these columns to the source schema and generates an optimized fetch specification.
Stage 3: Optimized Query Generation
For API sources, msh generates a field list:
# Generated dlt configuration
salesforce_resource = salesforce.Lead(
fields=['email', 'status', 'company_name'] # Only 3 of 500+ fields
)
For SQL sources (DB-to-DB), msh generates an optimized SELECT:
-- Instead of: SELECT * FROM legacy_db.leads
-- msh generates:
SELECT
email,
status,
company_name
FROM legacy_db.leads
WHERE status = 'New'
Example: API Source Optimization
Consider a Salesforce Lead object with 500+ columns. You only care about the email and status.
Traditional ELT
Loads all 500 columns into a landing table, then you select 2.
API Request:
GET /services/data/v58.0/query?q=SELECT * FROM Lead
Result: 500 columns × 10,000 rows = 5M data points transferred.
msh Smart Ingest
You write:
-- models/leads.msh
SELECT
email,
status
FROM source('salesforce', 'Lead')
WHERE status = 'New'
Optimized API Request:
GET /services/data/v58.0/query?q=SELECT email, status FROM Lead WHERE status='New'
Result: 2 columns × 10,000 rows = 20K data points transferred.
Smart Ingest happens automatically. You don't need to configure anything—just write your SQL.
Result: 99% reduction in data volume for wide tables.
Example: SQL Source Optimization
For database-to-database replication:
# models/legacy_users.msh
source:
type: dlt
source_name: sql_database
credentials: ${SOURCE_DB}
table: users
sql: |
SELECT
user_id,
email,
created_at
FROM source('sql_database', 'users')
WHERE created_at > CURRENT_DATE - INTERVAL '7 days'
Without Smart Ingest:
-- Full table scan
SELECT * FROM legacy_db.users
With Smart Ingest:
-- Optimized query with column pruning and predicate pushdown
SELECT user_id, email, created_at
FROM legacy_db.users
WHERE created_at > CURRENT_DATE - INTERVAL '7 days'
This reduces both network transfer and source database load.