Snowflake and Data Mesh5.05.2022
More than ever, the ability to use data for decision-making is critical to company success. Despite this knowledge, companies are still not fully empowering their employees with easy access to the data they need. According to Zhamak Dehghani, the founder of Data Mesh *, we must start thinking outside of the box because the traditional approach to managing and collecting data is not sufficient any longer.
For decades, there has been a divide between operational and analytical data with ETL as the intermediary process to get data from operational systems into the analytical data warehouse. ETL, which has always been primarily in the hands of IT developers, is perceived as a bottleneck to delivering timely analytical data. Furthermore, dimensional data models are not well suited for machine learning models that have become essential.
To overcome this, the data lake emerged around 2010. The idea of the data lake is to store vast amounts of semi-structured data in object stores to allow various consumers to use the data according to their needs. But there are challenges with accessing heaps of data that have been dumped into the data lake without giving much thought to its organization and consequently the data lake did not live up to its potential.
With the proliferation of cloud providers, such as Snowflake, we have an immense number of tools at our disposal that should allow users to access their own data as they see fit. What is still missing is a paradigm shift in architecture, organization, and technology. The data mesh architecture has emerged as a new framework to help solve these missing pieces. It encompasses four principles that are elaborated in the sections below.
The Snowflake Data Cloud connects organizations and data teams with the data they need, when they need it, without silos or complexity. Snowflake has been built for ease of use, performance at scale, and governed data sharing, all features that are well aligned with the data mesh principles.
Traditionally, data warehouse implementations split teams based on technology, for example, there are ETL teams, data governance teams, reporting teams, and so on. Each of the teams focuses on a particular technological aspect, but they lack the business understanding of the data that they are sourcing from the domains and delivering to users without fully understanding their needs.
An alternate way to decompose the implementation, instead of by technology, is to hand over the ownership of the data to the people who are most familiar with it, that is the domain that is producing the data. Data should be curated, cleansed, reshaped, and served as a reusable data structure at the source by pushing accountability towards the domains themselves.
Data as a product
Each domain can build, maintain, and share one or more data products. This approach immediately raises the question: how do we avoid data silos? The data mesh answers this question by imposing criteria that must be fulfilled by a data product. For example, a data product must be discoverable, understandable, interoperable, valuable, trustworthy, accessible, and secure.
The data product is a unit of architecture that includes the code that sources, transforms, serves, and shares the data as well as the metadata that defines it. The owner of the data product has long-term responsibility for ensuring accuracy, quality, growth, and usage of the data including deprecating, combining, or splitting data products when they no longer serve their original purpose.
Snowflake’s platform allows domain teams to build their data products independently and then share them with each other. Each domain team can choose which data objects they want to share within their data product and publish the descriptions in a Snowflake Data Exchange, which serves as an inventory of all data products in the data mesh. Users can search that inventory to discover data products that they need. Access to data products can be obtained either instantaneously or through a request-and-approval process between the data producer and the consumer.
Domain teams must be empowered to independently build and maintain their own data products. There must be a common platform and a set of tools at their disposal that are easy to use even for those without a technical data infrastructure background. There should be no need for niche technological skills or specialized data engineering resources. Domain teams should not have to concern themselves with infrastructure maintenance or resource limitations. Their focus must be on building the data product as an architectural building block to be shared and reused.
Snowflake’s platform enables domain teams to build their data products by providing ease of use, near-zero maintenance, and instantaneous scaling of resources. Each domain team can deploy and scale their own resources according to their needs without impacting others and without involving a dedicated infrastructure team. Snowflake’s platform supports many workloads to allow the loading of all types of data (structured, semi-structured, and unstructured), and make them available in the data products.
Data products must have governance in place to ensure that global standards are followed while at the same time allowing the domain teams to work independently. Federated governance standards must be set to ensure data privacy, access controls, data protection, and compliance. Additionally, metadata and documentation standards that each domain follows should be defined to provide the discoverability and usability of the data products.
Standards ensure that data products from different domains can be combined easily. There should be a balance between upholding global governance policy standards and allowing individual domain teams the freedom to interpret how standards are to be implemented when creating and sharing their data products.
Snowflake provides many native cross-cloud governance controls needed to support federated governance. This includes tracking of object dependencies, data lineage, metadata tags for data products, row-level access control, dynamic data masking for sensitive information, and other controls. In Snowflake, governance controls such as tags or access policies can be defined separately from applying them to data objects. This enables organizations to define common governance standards for the data mesh while allowing individual domain teams to apply these standards to the data in their domain as needed.
In summary, Snowflake and Data Mesh are an excellent fit. Snowflake provides individual domain teams with easy access to storage, compute power, performance, scalability, security, and governance. The data domain teams utilize these resources to build data products that they can readily share with other teams.