Polyglot Transforms
msh allows you to mix SQL and Python seamlessly within your data pipeline. This is particularly powerful for operations that are difficult or impossible in pure SQL, such as complex PII masking, calling external APIs, or using advanced libraries like Polars or Pandas.
In-Flight Masking
A common requirement is to mask sensitive data (PII) before it lands in the data warehouse. With msh, you can intercept the data stream using a Python block.
Example: PII Masking
# models/customers.msh
source:
type: dlt
resource: users
python: |
import polars as pl
def transform(df: pl.DataFrame) -> pl.DataFrame:
# Mask email addresses
return df.with_columns(
pl.col("email").map_elements(lambda x: "*****" + x[-5:])
)
sql: |
SELECT * FROM {{ python_output }}
In this example:
- Data is fetched from the source.
- It is passed to the
transformfunction as a Polars DataFrame. - The email column is masked.
- The masked data is then available to the SQL block.
Using Polars/Pandas
msh supports both Polars (default, faster) and Pandas. You can specify your preference in the configuration, but Polars is recommended for its performance and memory efficiency.
The transform function must accept a DataFrame and return a DataFrame.
Connecting Local Logic
You can import local Python modules into your .msh files. This allows you to share logic across multiple models or keep your transformation code version-controlled and testable.
python: |
from my_project.utils import mask_pii
def transform(df):
return mask_pii(df)
Ensure your local modules are in the python path or installed in the environment where msh is running.