Skip to main content

Discover Command

Auto-discover source schemas from REST APIs or SQL databases and generate .msh file configurations automatically.

Overview

The msh discover command probes data sources and automatically generates production-ready .msh files with:

  • Inferred schema information
  • Proper source configuration
  • Schema contracts based on discovered columns
  • Ready-to-use transformation templates

Usage

msh discover <source> [options]

Source Types

The command automatically detects the source type:

  1. REST API: HTTP/HTTPS URLs
  2. SQL Database: Database connection strings

REST API Discovery

Discover schemas from REST API endpoints.

Basic Usage

msh discover https://api.github.com/repos/dlt-hub/dlt/issues

What it does:

  • Makes a sample HTTP request to the endpoint
  • Analyzes the JSON response structure
  • Infers column names and data types
  • Generates a .msh file with proper configuration

Example Output

$ msh discover https://api.github.com/repos/dlt-hub/dlt/issues

Detected source type: rest_api
Generated .msh configuration:
============================================================
name: issues
description: Auto-discovered from rest_api
ingest:
type: rest_api
endpoint: https://api.github.com/repos/dlt-hub/dlt/issues
resource: data
contract:
evolution: evolve
enforce_types: true
required_columns:
- id
- title
- state
- created_at
transform: |
SELECT * FROM {{ source }}
============================================================

✓ Written to: models/issues.msh
You can now run: msh run issues

SQL Database Discovery

Discover schemas from SQL databases (Postgres, MySQL, DuckDB, etc.).

Basic Usage

# Discover a specific table
msh discover postgresql://user:pass@localhost:5432/mydb table:orders

# Discover from connection string (will prompt for table)
msh discover postgresql://user:pass@localhost:5432/mydb

Connection String Format

The command supports standard SQLAlchemy connection strings:

  • postgresql://user:pass@host:port/dbname
  • mysql://user:pass@host:port/dbname
  • duckdb:///path/to/file.duckdb
  • sqlite:///path/to/file.db

Table Specification

Specify the table using table: prefix:

msh discover postgresql://user:pass@localhost/db table:orders
msh discover postgresql://user:pass@localhost/db table:public.users

Example Output

$ msh discover postgresql://user:pass@localhost/mydb table:orders

Detected source type: sql_database
Generated .msh configuration:
============================================================
name: orders
description: Auto-discovered from sql_database
ingest:
type: sql_database
credentials: postgresql://user:pass@localhost/mydb
table: orders
contract:
evolution: evolve
enforce_types: true
required_columns:
- id
- customer_id
- total_amount
- created_at
transform: |
SELECT * FROM {{ source }}
============================================================

✓ Written to: models/orders.msh
You can now run: msh run orders

Options

--name <name>

Specify a custom asset name (defaults to inferred name from source).

msh discover https://api.example.com/users --name my_users

Inference rules:

  • REST API: Uses last path segment (e.g., issues from /repos/.../issues)
  • SQL Database: Uses table name

--output, -o <path>

Specify custom output file path (defaults to models/{name}.msh).

msh discover https://api.example.com/users --output custom/users.msh

--write / --no-write

Control whether to write to file or print only (default: --write).

# Print to console only
msh discover https://api.example.com/users --no-write

# Write to file (default)
msh discover https://api.example.com/users --write

Generated Contract Blocks

The discover command automatically generates schema contracts based on the discovered schema:

contract:
evolution: evolve # Allow schema evolution (default)
enforce_types: true # Enforce type consistency
required_columns: # First 10 columns marked as required
- id
- name
- email
# ... up to 10 columns

Contract defaults:

  • evolution: evolve - Allows new columns to be added
  • enforce_types: true - Validates data types match expectations
  • required_columns - First 10 discovered columns (or all if fewer than 10)

You can modify these contracts after generation to match your requirements.

Schema Inference

The command infers data types from sample data:

  • REST APIs: Analyzes JSON response structure
  • SQL Databases: Uses database metadata (information_schema)

Supported types:

  • string - Text data
  • integer - Whole numbers
  • number - Decimal numbers
  • boolean - True/false values
  • datetime - Date/time strings (detected by pattern)
  • object - Nested objects/arrays

Examples

Example 1: Discover GitHub Issues

msh discover https://api.github.com/repos/dlt-hub/dlt/issues --name github_issues

Generates models/github_issues.msh with GitHub API configuration.

Example 2: Discover Postgres Table

msh discover postgresql://user:pass@localhost/analytics table:customers --name prod_customers

Generates models/prod_customers.msh with SQL database configuration.

Example 3: Preview Without Writing

msh discover https://api.example.com/data --no-write

Prints the generated configuration without creating a file.

Next Steps

After discovery, you can:

  1. Review the generated file:

    cat models/issues.msh
  2. Customize the transformation: Edit the transform block to add your business logic.

  3. Adjust contracts: Modify the contract block to match your schema requirements.

  4. Run the asset:

    msh run issues

Troubleshooting

"Connection refused" or "Network error"

  • Check your internet connection (for REST APIs)
  • Verify database credentials and connectivity (for SQL databases)
  • Ensure firewall rules allow connections

"Table not found"

  • Verify the table name is correct
  • Check schema/namespace (use schema.table format if needed)
  • Ensure database user has read permissions

"Invalid JSON response"

  • REST API may require authentication
  • Check if endpoint returns valid JSON
  • Some APIs require specific headers or query parameters

"Schema inference failed"

  • Source may return empty or inconsistent data
  • Try with a different endpoint or table
  • Manually specify columns if needed

Integration with Schema Contracts

The generated .msh files include contract blocks that:

  • Validate required columns exist before ingestion
  • Enforce type consistency to prevent data quality issues
  • Control schema evolution (freeze vs evolve)

See Schema Contracts for more details.