Back

Using Matillion variables for dynamic data integration with practical use cases

13.05.2024

Although many articles and excellent formal documentation are available on the subject of variables in Matillion ETL, my main focus here is to make this blog serve as a more practical presentation of our experience in implementing some specific use cases related to the use of variables. The use cases described here are simplified yet applicable solutions intended to encourage the utilisation of this powerful tool in various solutions. The final goal is to stimulate the development of more unified data pipelines even when you use a low-code or no-code environment.

It is good to note from the outset that this article references the Matillion ETL product. Most of the definitions and solutions are applicable in the new Matillion Data Productivity Cloud platform. The main differences certainly lie in the changed terminology, so I will try to point out the terms that are different in both versions referencing the same issue. Even so, there can be other differences, too.

Variables

By definition, variables are name-value pairs that can be used in parameters and expressions. All scalar variables can be referenced as ${<variable_name>} by:

all query/SQL query sections
component properties (connection URL, RDS endpoints, destinations db/schema/table, username/password, file storage locations etc.)
other variables
Python and Bash Scripts. Here the reference is as <variable_name>, so the variables become first-class. That is why care should be taken not to name variables with Python reserved words.

According to the scope of the variables Matillion supports:

Environment Variables (Project Variables in DPC) – can be defined and visible across the project. Environment variables can have different default values assigned for different environments, which are usually used to dynamically change configurations or parameters dependent on the environment (i.e. separate target locations for production, QA and development).
Job Variables (Pipeline Variables in DPC) – can be defined within the scope of a job (Pipeline) and will overwrite any environment variable of the same name within the job

There are also two types of predefined variables in Matillion:

Automatic Variables – visible across Matillion ETL acting as system defined variables. The values of the automatic variables are predefined and cannot be modified. Matillion provides a full list of available automatic variables
Component Exports – pseudo-variables predefined within the component (in the Export tab), but visible across the job. To use the value, you must assign it to an existing job or environment variable. Matillion provides a full list of available component export values.

Automatic variable and component exports are mainly used in logging and auditing solutions where the logging information at the job or component level is passed into database tables.

All of these variables are known also as Scalar Variables in Matillion. They can be referred to as copied or shared depending on the branch behavior inside the job.

Copied – can be updated within one branch in the job without impacting other instances of the same name variable in other branches. This is the only choice when you run loops and iterators in concurrent mode.
Shared – always updated in a job-wide fashion impacting all branches within a job

Because all variable types are referred to the same way, it is good practice to assign a prefix to variable names to separate user-defined environment and job variables from automatic variables. I usually use the prefix env_ for environment variables and var_ for job variables.

Aside from the scalar variables, Matillion also supports Grid Variables as two-dimensional array variable types that can hold scalar values in named columns. Grid variables can be defined only at the job level. They can be referenced only by grid iterator components or in loop sections by Python script component (access via the content print(context.getGridVariable(‘mygridvar’)), update by using context.updateGridVariable(‘mygridvar’,[[‘list’,’of’],[‘lists’,’!’]])).

Real Life Use Cases

Name	Conditional METL Instance Shutdown
Variable usage	Job, environment and automatic scalar variables, assigning the API result into a scalar variable
Use Case	The job that will be scheduled to periodically check if no other jobs are running on the Matillion ETL instance (except this job itself) then to shut down the instance using slack channel.

Variables:

Name	Var Type	Visibility	Behavior	Type
var_running_jobs	Job scalar	Public	Shared	NUMERIC
env_AuditRunningJobs	Environment	Public	Shared	TEXT
job_id	Automatic

Because an API Query component can only ingest the resulting data into the table in the target database, you can always design a pair of components (yellow frame in the job).

The first component will execute the query using a previously defined API profile. The name of the profile is stored in the ${env_AuditRunningJobs} variable and can be different depending on the environment where this job is executing. The target table AUDIT.CHECK_RUNNING_JOBS will be recreated every time the component is executed with the result of the query (where clause jobID <> $jib_id will eliminate the current job from the result set).

The other component will then simply fetch the row count from that table and assign the value to the variable var_running_jobs for the condition check in the next IF component.

If the value is equal to 0, then the Shut Down Instance component can be executed by sending an appropriate message to the predefined Slack channel. If not, the job will end.

This job can be scheduled to start periodically with an appropriate frequency and shut down the instance only when all ETL jobs are finished in order to save resources. It is important to understand that this solution is applicable only on the Matillion ETL as the Data Productivity Cloud is SaaS and you don’t have to worry about the infrastructure cost.

Name:	S3_2_S3
Variable usage:	Job, environment scalar and job grid variables, using dynamically defined S3 location in file data transfer depending on the environment
Use Case:	Development of a common job that will be used to redirect the file transfer target location according to the environment where the job is executed. Different folder branches for different environments (i.e. prod and non-prod) must exist as a prerequisite.

Existing S3 folder structure example:

For production environment – s3://<bucket_name>/prod/<vendor_name>/ingest/

For development environment – s3:// <bucket_name>/non-prod/<vendor_name>/ingest/

This is a two-step job where, in the first step, the Python script will check the existence of the source files in the destination folder. If the file is new (does not exist in the destination folder) its filename will be appended to the grid variable. This variable will be used in the next step, where the grid iterator will execute the Data Transfer component for each file and will transfer the file to the destination.

Variables:

Name	Var Type	Visibility	Description
S3_Source_Bucket	Job scalar	Public	S3 source bucket name
S3_Source_Path	Job scalar	Public	S3 source path
S3_Regex	Job Scalar	Public	Regex for masking the source file set
S3_Dest_Bucket	Job scalar	Public	S3 destination bucket name
S3_Dest_Path	Job scalar	Public	S3 destination path
env_PrdURL	Environment		The folder name when the point of folder branching for different environments starts (i.e. prod, non-prod)
S3_MaxLoop	Job scalar	Public	Max number of files that has to be checked in one pass
S3_Filename	Grid	Public	One column array (column name = NAME)
LOOP_Filename	Job scalar	Private	Loop variable defined to map the current instance of the grid variable value for every iteration so it can be passed and used in Data Transfer component.

The following is the python script, where you can see how the variables are referenced in python. The last line of the code is assignment of values to the grid variable in python.

At the end, the Data Transfer component’s properties are presented. Especially in the Source URL and Destination URL property can be seen the use of variables and its power. It is a combination of multiple job and environment variables used to dynamically create the destination path.

import boto3
import os
import re

s3 = boto3.resource('s3')
s3dest = boto3.client('s3')
s1 = []
cc = 1

myBucket = s3.Bucket(S3_Source_Bucket)
for object_summary in myBucket.objects.filter(Prefix=S3_Source_Path):
    head_tail = os.path.split(object_summary.key)
    if S3_Source_Path == (head_tail[0] +'/'):
        file = head_tail[1]
        if eval(S3_Regex):
            result = s3dest.list_objects_v2(Bucket=S3_Dest_Bucket, Prefix=(env_PrdURL + S3_Dest_Path + file))
            if 'Contents' in result:
                pass
            else:
                s1.append([file])
                cc = cc + 1
                if cc > S3_MaxLoop:
                    cc = cc-1
                    break
print ("To transfer: " + str(cc) + " Files")
context.updateGridVariable('S3_FileName', s1)

Parameter	Value
Grid Variable	S3_FileName
Grid Variable Mapping	Name, LOOP_Filename

Data Transfer Object:

Parameter	Value

Source Type	S3
Source URL	s3://${S3_Source_Bucket}/${S3_Source_Path}${LOOP_Filename}
Unpack ZIP file	No
Target Type	S3
Gzip data	No
Target Object Name	${LOOP_Filename}
Target URL	s3://${S3_Dest_Bucket}/${env_PrdURL}${S3_Dest_Path}

Conclusion

The concepts described in this article can be used in a variety of other use cases where you need to dynamically control the data pipeline flow and behaviour. The use of variables even as a combination of multiple variables and static characters to build a URL or connection string, username or any other property in the Matillion components is a very powerful tool in the hands of a data engineer. I hope my experience will encourage you to use variables when designing your Matillion pipelines and will save time and resources in the process.

Ranato Stamatovski
Senior Consultant




Full name *
Company *
Primary Email *
Message *

Cookie	Type	Duration	Description
cookielawinfo-checkbox-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
JSESSIONID	1		Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
viewed_cookie_policy	0	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.





Subscribe to our News

Using Matillion variables for dynamic data integration with practical use cases

Variables

Real Life Use Cases

Conclusion

Related Content

Unlocking Efficient Data Loading: PostgreSQL to Snowflake with Change Data Capture

Generating Sample Data Using the faker Library with Snowpark