Back

Using Matillion variables for dynamic data integration with practical use cases

13.05.2024

Although many articles and excellent formal documentation are available on the subject of variables in Matillion ETL, my main focus here is to make this blog serve as a more practical presentation of our experience in implementing some specific use cases related to the use of variables. The use cases described here are simplified yet applicable solutions intended to encourage the utilisation of this powerful tool in various solutions. The final goal is to stimulate the development of more unified data pipelines even when you use a low-code or no-code environment.

It is good to note from the outset that this article references the Matillion ETL product. Most of the definitions and solutions are applicable in the new Matillion Data Productivity Cloud platform. The main differences certainly lie in the changed terminology, so I will try to point out the terms that are different in both versions referencing the same issue. Even so, there can be other differences, too.

Variables

By definition, variables are name-value pairs that can be used in parameters and expressions. All scalar variables can be referenced as ${<variable_name>} by:

  • all query/SQL query sections
  • component properties (connection URL, RDS endpoints, destinations db/schema/table, username/password, file storage locations etc.)
  • other variables
  • Python and Bash Scripts. Here the reference is as <variable_name>, so the variables become first-class. That is why care should be taken not to name variables with Python reserved words.

According to the scope of the variables Matillion supports:

  • Environment Variables (Project Variables in DPC) – can be defined and visible across the project. Environment variables can have different default values assigned for different environments, which are usually used to dynamically change configurations or parameters dependent on the environment (i.e. separate target locations for production, QA and development).
  • Job Variables (Pipeline Variables in DPC) – can be defined within the scope of a job (Pipeline) and will overwrite any environment variable of the same name within the job

There are also two types of predefined variables in Matillion:

  • Automatic Variables – visible across Matillion ETL acting as system defined variables. The values of the automatic variables are predefined and cannot be modified. Matillion provides a full list of available automatic variables
  • Component Exports pseudo-variables predefined within the component (in the Export tab), but visible across the job. To use the value, you must assign it to an existing job or environment variable. Matillion provides a full list of available component export values.

Automatic variable and component exports are mainly used in logging and auditing solutions where the logging information at the job or component level is passed into database tables.

All of these variables are known also as Scalar Variables in Matillion. They can be referred to as copied or shared depending on the branch behavior inside the job.

  • Copied – can be updated within one branch in the job without impacting other instances of the same name variable in other branches. This is the only choice when you run loops and iterators in concurrent mode. 
  • Shared – always updated in a job-wide fashion impacting all branches within a job

Because all variable types are referred to the same way, it is good practice to assign a prefix to variable names to separate user-defined environment and job variables from automatic variables. I usually use the prefix env_ for environment variables and var_ for job variables.

Aside from the scalar variables, Matillion also supports Grid Variables as two-dimensional array variable types that can hold scalar values in named columns. Grid variables can be defined only at the job level. They can be referenced only by grid iterator components or in loop sections by Python script component (access via the content print(context.getGridVariable(‘mygridvar’)), update by using context.updateGridVariable(‘mygridvar’,[[‘list’,’of’],[‘lists’,’!’]])).

Real Life Use Cases

NameConditional METL Instance Shutdown
Variable usageJob, environment and automatic scalar variables, assigning the API result into a scalar variable
Use CaseThe job that will be scheduled to periodically check if no other jobs are running on the Matillion ETL instance (except this job itself) then to shut down the instance using slack channel.

Variables:  

NameVar TypeVisibilityBehaviorType
var_running_jobsJob scalarPublicSharedNUMERIC
env_AuditRunningJobsEnvironmentPublicSharedTEXT
job_idAutomatic   

Because an API Query component can only ingest the resulting data into the table in the target database, you can always design a pair of components (yellow frame in the job).

The first component will execute the query using a previously defined API profile. The name of the profile is stored in the ${env_AuditRunningJobs} variable and can be different depending on the environment where this job is executing. The target table AUDIT.CHECK_RUNNING_JOBS will be recreated every time the component is executed with the result of the query (where clause jobID <> $jib_id will eliminate the current job from the result set).

The other component will then simply fetch the row count from that table and assign the value to the variable var_running_jobs for the condition check in the next IF component.

If the value is equal to 0, then the Shut Down Instance component can be executed by sending an appropriate message to the predefined Slack channel. If not, the job will end.

This job can be scheduled to start periodically with an appropriate frequency and shut down the instance only when all ETL jobs are finished in order to save resources. It is important to understand that this solution is applicable only on the Matillion ETL as the Data Productivity Cloud is SaaS and you don’t have to worry about the infrastructure cost.

Name:S3_2_S3
Variable usage:Job, environment scalar and job grid variables, using dynamically defined S3 location in file data transfer depending on the environment
Use Case:Development of a common job that will be used to redirect the file transfer target location according to the environment where the job is executed. Different folder branches for different environments (i.e. prod and non-prod) must exist as a prerequisite.

Existing S3 folder structure example:

For production environment – s3://<bucket_name>/prod/<vendor_name>/ingest/

For development environment – s3:// <bucket_name>/non-prod/<vendor_name>/ingest/

This is a two-step job where, in the first step, the Python script will check the existence of the source files in the destination folder. If the file is new (does not exist in the destination folder) its filename will be appended to the grid variable. This variable will be used in the next step, where the grid iterator will execute the Data Transfer component for each file and will transfer the file to the destination. 

Variables:  

NameVar TypeVisibilityDescription
S3_Source_BucketJob scalarPublicS3 source bucket name
S3_Source_PathJob scalarPublicS3 source path
S3_RegexJob ScalarPublicRegex for masking the source file set
S3_Dest_BucketJob scalarPublicS3 destination bucket name
S3_Dest_PathJob scalarPublicS3 destination path
env_PrdURLEnvironment The folder name when the point of folder branching for different environments starts (i.e. prod, non-prod)
S3_MaxLoopJob scalarPublicMax number of files that has to be checked in one pass
S3_FilenameGridPublicOne column array (column name = NAME)
LOOP_FilenameJob scalarPrivateLoop variable defined to map the current instance of the grid variable value for every iteration so it can be passed and used in Data Transfer component.  

The following is the python script, where you can see how the variables are referenced in python. The last line of the code is assignment of values to the grid variable in python.

At the end, the Data Transfer component’s properties are presented. Especially in the Source URL and Destination URL property can be seen the use of variables and its power. It is a combination of multiple job and environment variables used to dynamically create the destination path.

import boto3
import os
import re

s3 = boto3.resource('s3')
s3dest = boto3.client('s3')
s1 = []
cc = 1

myBucket = s3.Bucket(S3_Source_Bucket)
for object_summary in myBucket.objects.filter(Prefix=S3_Source_Path):
    head_tail = os.path.split(object_summary.key)
    if S3_Source_Path == (head_tail[0] +'/'):
        file = head_tail[1]
        if eval(S3_Regex):
            result = s3dest.list_objects_v2(Bucket=S3_Dest_Bucket, Prefix=(env_PrdURL + S3_Dest_Path + file))
            if 'Contents' in result:
                pass
            else:
                s1.append([file])
                cc = cc + 1
                if cc > S3_MaxLoop:
                    cc = cc-1
                    break
print ("To transfer: " + str(cc) + " Files")
context.updateGridVariable('S3_FileName', s1)

ParameterValue
Grid VariableS3_FileName
Grid Variable MappingName, LOOP_Filename

Data Transfer Object:

ParameterValue
Source TypeS3
Source URLs3://${S3_Source_Bucket}/${S3_Source_Path}${LOOP_Filename}
Unpack ZIP fileNo
Target TypeS3
Gzip dataNo
Target Object Name${LOOP_Filename}
Target URLs3://${S3_Dest_Bucket}/${env_PrdURL}${S3_Dest_Path}

Conclusion

The concepts described in this article can be used in a variety of other use cases where you need to dynamically control the data pipeline flow and behaviour. The use of variables even as a combination of multiple variables and static characters to build a URL or connection string, username or any other property in the Matillion components is a very powerful tool in the hands of a data engineer. I hope my experience will encourage you to use variables when designing your Matillion pipelines and will save time and resources in the process.


Ranato Stamatovski
Senior Consultant