Discover Command
Auto-discover source schemas from REST APIs or SQL databases and generate .msh file configurations automatically.
Overview
The msh discover command probes data sources and automatically generates production-ready .msh files with:
- Inferred schema information
- Proper source configuration
- Schema contracts based on discovered columns
- Ready-to-use transformation templates
Usage
msh discover <source> [options]
Source Types
The command automatically detects the source type:
- REST API: HTTP/HTTPS URLs
- SQL Database: Database connection strings
REST API Discovery
Discover schemas from REST API endpoints.
Basic Usage
msh discover https://api.github.com/repos/dlt-hub/dlt/issues
What it does:
- Makes a sample HTTP request to the endpoint
- Analyzes the JSON response structure
- Infers column names and data types
- Generates a
.mshfile with proper configuration
Example Output
$ msh discover https://api.github.com/repos/dlt-hub/dlt/issues
Detected source type: rest_api
Generated .msh configuration:
============================================================
name: issues
description: Auto-discovered from rest_api
ingest:
type: rest_api
endpoint: https://api.github.com/repos/dlt-hub/dlt/issues
resource: data
contract:
evolution: evolve
enforce_types: true
required_columns:
- id
- title
- state
- created_at
transform: |
SELECT * FROM {{ source }}
============================================================
✓ Written to: models/issues.msh
You can now run: msh run issues
SQL Database Discovery
Discover schemas from SQL databases (Postgres, MySQL, DuckDB, etc.).
Basic Usage
# Discover a specific table
msh discover postgresql://user:pass@localhost:5432/mydb table:orders
# Discover from connection string (will prompt for table)
msh discover postgresql://user:pass@localhost:5432/mydb
Connection String Format
The command supports standard SQLAlchemy connection strings:
postgresql://user:pass@host:port/dbnamemysql://user:pass@host:port/dbnameduckdb:///path/to/file.duckdbsqlite:///path/to/file.db
Table Specification
Specify the table using table: prefix:
msh discover postgresql://user:pass@localhost/db table:orders
msh discover postgresql://user:pass@localhost/db table:public.users
Example Output
$ msh discover postgresql://user:pass@localhost/mydb table:orders
Detected source type: sql_database
Generated .msh configuration:
============================================================
name: orders
description: Auto-discovered from sql_database
ingest:
type: sql_database
credentials: postgresql://user:pass@localhost/mydb
table: orders
contract:
evolution: evolve
enforce_types: true
required_columns:
- id
- customer_id
- total_amount
- created_at
transform: |
SELECT * FROM {{ source }}
============================================================
✓ Written to: models/orders.msh
You can now run: msh run orders
Options
--name <name>
Specify a custom asset name (defaults to inferred name from source).
msh discover https://api.example.com/users --name my_users
Inference rules:
- REST API: Uses last path segment (e.g.,
issuesfrom/repos/.../issues) - SQL Database: Uses table name
--output, -o <path>
Specify custom output file path (defaults to models/{name}.msh).
msh discover https://api.example.com/users --output custom/users.msh
--write / --no-write
Control whether to write to file or print only (default: --write).
# Print to console only
msh discover https://api.example.com/users --no-write
# Write to file (default)
msh discover https://api.example.com/users --write
Generated Contract Blocks
The discover command automatically generates schema contracts based on the discovered schema:
contract:
evolution: evolve # Allow schema evolution (default)
enforce_types: true # Enforce type consistency
required_columns: # First 10 columns marked as required
- id
- name
- email
# ... up to 10 columns
Contract defaults:
evolution: evolve- Allows new columns to be addedenforce_types: true- Validates data types match expectationsrequired_columns- First 10 discovered columns (or all if fewer than 10)
You can modify these contracts after generation to match your requirements.
Schema Inference
The command infers data types from sample data:
- REST APIs: Analyzes JSON response structure
- SQL Databases: Uses database metadata (information_schema)
Supported types:
string- Text datainteger- Whole numbersnumber- Decimal numbersboolean- True/false valuesdatetime- Date/time strings (detected by pattern)object- Nested objects/arrays
Examples
Example 1: Discover GitHub Issues
msh discover https://api.github.com/repos/dlt-hub/dlt/issues --name github_issues
Generates models/github_issues.msh with GitHub API configuration.
Example 2: Discover Postgres Table
msh discover postgresql://user:pass@localhost/analytics table:customers --name prod_customers
Generates models/prod_customers.msh with SQL database configuration.
Example 3: Preview Without Writing
msh discover https://api.example.com/data --no-write
Prints the generated configuration without creating a file.
Next Steps
After discovery, you can:
-
Review the generated file:
cat models/issues.msh -
Customize the transformation: Edit the
transformblock to add your business logic. -
Adjust contracts: Modify the
contractblock to match your schema requirements. -
Run the asset:
msh run issues
Troubleshooting
"Connection refused" or "Network error"
- Check your internet connection (for REST APIs)
- Verify database credentials and connectivity (for SQL databases)
- Ensure firewall rules allow connections
"Table not found"
- Verify the table name is correct
- Check schema/namespace (use
schema.tableformat if needed) - Ensure database user has read permissions
"Invalid JSON response"
- REST API may require authentication
- Check if endpoint returns valid JSON
- Some APIs require specific headers or query parameters
"Schema inference failed"
- Source may return empty or inconsistent data
- Try with a different endpoint or table
- Manually specify columns if needed
Integration with Schema Contracts
The generated .msh files include contract blocks that:
- Validate required columns exist before ingestion
- Enforce type consistency to prevent data quality issues
- Control schema evolution (freeze vs evolve)
See Schema Contracts for more details.