Using Matillion variables for dynamic data integration with practical use cases
13.05.2024Although many articles and excellent formal documentation are available on the subject of variables in Matillion ETL, my main focus here is to make this blog serve as a more practical presentation of our experience in implementing some specific use cases related to the use of variables. The use cases described here are simplified yet applicable solutions intended to encourage the utilisation of this powerful tool in various solutions. The final goal is to stimulate the development of more unified data pipelines even when you use a low-code or no-code environment.
It is good to note from the outset that this article references the Matillion ETL product. Most of the definitions and solutions are applicable in the new Matillion Data Productivity Cloud platform. The main differences certainly lie in the changed terminology, so I will try to point out the terms that are different in both versions referencing the same issue. Even so, there can be other differences, too.
Variables
By definition, variables are name-value pairs that can be used in parameters and expressions. All scalar variables can be referenced as ${<variable_name>} by:
- all query/SQL query sections
- component properties (connection URL, RDS endpoints, destinations db/schema/table, username/password, file storage locations etc.)
- other variables
- Python and Bash Scripts. Here the reference is as <variable_name>, so the variables become first-class. That is why care should be taken not to name variables with Python reserved words.
According to the scope of the variables Matillion supports:
- Environment Variables (Project Variables in DPC) – can be defined and visible across the project. Environment variables can have different default values assigned for different environments, which are usually used to dynamically change configurations or parameters dependent on the environment (i.e. separate target locations for production, QA and development).
- Job Variables (Pipeline Variables in DPC) – can be defined within the scope of a job (Pipeline) and will overwrite any environment variable of the same name within the job
There are also two types of predefined variables in Matillion:
- Automatic Variables – visible across Matillion ETL acting as system defined variables. The values of the automatic variables are predefined and cannot be modified. Matillion provides a full list of available automatic variables
- Component Exports – pseudo-variables predefined within the component (in the Export tab), but visible across the job. To use the value, you must assign it to an existing job or environment variable. Matillion provides a full list of available component export values.
Automatic variable and component exports are mainly used in logging and auditing solutions where the logging information at the job or component level is passed into database tables.
All of these variables are known also as Scalar Variables in Matillion. They can be referred to as copied or shared depending on the branch behavior inside the job.
- Copied – can be updated within one branch in the job without impacting other instances of the same name variable in other branches. This is the only choice when you run loops and iterators in concurrent mode.
- Shared – always updated in a job-wide fashion impacting all branches within a job
Because all variable types are referred to the same way, it is good practice to assign a prefix to variable names to separate user-defined environment and job variables from automatic variables. I usually use the prefix env_ for environment variables and var_ for job variables.
Aside from the scalar variables, Matillion also supports Grid Variables as two-dimensional array variable types that can hold scalar values in named columns. Grid variables can be defined only at the job level. They can be referenced only by grid iterator components or in loop sections by Python script component (access via the content print(context.getGridVariable(‘mygridvar’)), update by using context.updateGridVariable(‘mygridvar’,[[‘list’,’of’],[‘lists’,’!’]])).
Real Life Use Cases
Name | Conditional METL Instance Shutdown |
Variable usage | Job, environment and automatic scalar variables, assigning the API result into a scalar variable |
Use Case | The job that will be scheduled to periodically check if no other jobs are running on the Matillion ETL instance (except this job itself) then to shut down the instance using slack channel. |

Variables:
Name | Var Type | Visibility | Behavior | Type |
var_running_jobs | Job scalar | Public | Shared | NUMERIC |
env_AuditRunningJobs | Environment | Public | Shared | TEXT |
job_id | Automatic |
Because an API Query component can only ingest the resulting data into the table in the target database, you can always design a pair of components (yellow frame in the job).
The first component will execute the query using a previously defined API profile. The name of the profile is stored in the ${env_AuditRunningJobs} variable and can be different depending on the environment where this job is executing. The target table AUDIT.CHECK_RUNNING_JOBS will be recreated every time the component is executed with the result of the query (where clause jobID <> $jib_id will eliminate the current job from the result set).

The other component will then simply fetch the row count from that table and assign the value to the variable var_running_jobs for the condition check in the next IF component.

If the value is equal to 0, then the Shut Down Instance component can be executed by sending an appropriate message to the predefined Slack channel. If not, the job will end.

This job can be scheduled to start periodically with an appropriate frequency and shut down the instance only when all ETL jobs are finished in order to save resources. It is important to understand that this solution is applicable only on the Matillion ETL as the Data Productivity Cloud is SaaS and you don’t have to worry about the infrastructure cost.
Name: | S3_2_S3 |
Variable usage: | Job, environment scalar and job grid variables, using dynamically defined S3 location in file data transfer depending on the environment |
Use Case: | Development of a common job that will be used to redirect the file transfer target location according to the environment where the job is executed. Different folder branches for different environments (i.e. prod and non-prod) must exist as a prerequisite. |
Existing S3 folder structure example:
For production environment – s3://<bucket_name>/prod/<vendor_name>/ingest/
For development environment – s3:// <bucket_name>/non-prod/<vendor_name>/ingest/

This is a two-step job where, in the first step, the Python script will check the existence of the source files in the destination folder. If the file is new (does not exist in the destination folder) its filename will be appended to the grid variable. This variable will be used in the next step, where the grid iterator will execute the Data Transfer component for each file and will transfer the file to the destination.
Variables:
Name | Var Type | Visibility | Description |
S3_Source_Bucket | Job scalar | Public | S3 source bucket name |
S3_Source_Path | Job scalar | Public | S3 source path |
S3_Regex | Job Scalar | Public | Regex for masking the source file set |
S3_Dest_Bucket | Job scalar | Public | S3 destination bucket name |
S3_Dest_Path | Job scalar | Public | S3 destination path |
env_PrdURL | Environment | The folder name when the point of folder branching for different environments starts (i.e. prod, non-prod) | |
S3_MaxLoop | Job scalar | Public | Max number of files that has to be checked in one pass |
S3_Filename | Grid | Public | One column array (column name = NAME) |
LOOP_Filename | Job scalar | Private | Loop variable defined to map the current instance of the grid variable value for every iteration so it can be passed and used in Data Transfer component. |
The following is the python script, where you can see how the variables are referenced in python. The last line of the code is assignment of values to the grid variable in python.
At the end, the Data Transfer component’s properties are presented. Especially in the Source URL and Destination URL property can be seen the use of variables and its power. It is a combination of multiple job and environment variables used to dynamically create the destination path.
import boto3
import os
import re
s3 = boto3.resource('s3')
s3dest = boto3.client('s3')
s1 = []
cc = 1
myBucket = s3.Bucket(S3_Source_Bucket)
for object_summary in myBucket.objects.filter(Prefix=S3_Source_Path):
head_tail = os.path.split(object_summary.key)
if S3_Source_Path == (head_tail[0] +'/'):
file = head_tail[1]
if eval(S3_Regex):
result = s3dest.list_objects_v2(Bucket=S3_Dest_Bucket, Prefix=(env_PrdURL + S3_Dest_Path + file))
if 'Contents' in result:
pass
else:
s1.append([file])
cc = cc + 1
if cc > S3_MaxLoop:
cc = cc-1
break
print ("To transfer: " + str(cc) + " Files")
context.updateGridVariable('S3_FileName', s1)

Parameter | Value |
Grid Variable | S3_FileName |
Grid Variable Mapping | Name, LOOP_Filename |
Data Transfer Object:
Parameter | Value |
Source Type | S3 |
Source URL | s3://${S3_Source_Bucket}/${S3_Source_Path}${LOOP_Filename} |
Unpack ZIP file | No |
Target Type | S3 |
Gzip data | No |
Target Object Name | ${LOOP_Filename} |
Target URL | s3://${S3_Dest_Bucket}/${env_PrdURL}${S3_Dest_Path} |
Conclusion
The concepts described in this article can be used in a variety of other use cases where you need to dynamically control the data pipeline flow and behaviour. The use of variables even as a combination of multiple variables and static characters to build a URL or connection string, username or any other property in the Matillion components is a very powerful tool in the hands of a data engineer. I hope my experience will encourage you to use variables when designing your Matillion pipelines and will save time and resources in the process.

Senior Consultant