Moving On-Prem Data Warehouses to Snowflake: Mastering the Art of Data Migration Phase18.05.2023
We started our analytical journey more than 15 years ago, providing end-to-end solutions to our clients – from data strategy, business analytics, data warehouse implementation and data integration, to business intelligence or AI use cases. 6 years ago, we started working with Snowflake as our preferred cloud data platform. We have expanded our services to include complex data warehouse migrations to cloud platforms so that today 90% of our migration projects are focused on Snowflake.
Our migration methodology has evolved over all these years, and we have added various in-house assets that help us accelerate the process on large projects and minimize risk for our clients. In this post, we will focus on questions we receive from potential customers regarding the specific phase of migration – initial data migration and ongoing data synchronization.
In the bellow picture, we will focus on these two topics:
One short story about how it all started
Our first Snowflake project was not a small tech-savvy customer or some modern use-case, but we went all-in with one of our largest regional clients that was having problems with their existing new data warehouse platform, and after all the POCs, we got our first Snowflake customer. Then the real problems started because they had a very large IBM mainframe environment as a source system, to which we had to transfer more than 250 GB of data per day from thousands of tables. To make things more interesting, in the next 5 months they would also start using SAP Hana, and that was to become another large on-premises source for us. At the time, we were testing various existing solutions, but there were none focused on on-premise to cloud data transfers, while new vendors focused on cloud-to-cloud, such as Fivetran, were emerging.
The IBM mainframe database admins told us that we cannot »kill« the database. This feedback prompted our clever data engineers to develop a fully scalable platform that can scale linearly with resources and have the ability to send X number of queries against the specific table to enable maximum partitioning. The extraction and ingestion process worked so well that we produced some serious spikes in IBM Mainframe consumption, but that is another story :).
This platform is called antFarm because it behaves like ant colonies – it has a queen that coordinates all the workers and each worker can do different things – from extracting data, generating CSV files, encrypting, loading data into public cloud storage (S3, Azure Blob, GCP) or directly into Snowflake (in the last 6 years, we have also added support for other cloud providers as we migrated some solutions to other vendors as well). We use it now on all of our migration projects and various POCs as we can, with minimum setup, generate and transfer 1000s of tables and have all the flexibility that we need for the on-premises world – think in terms of writing optimized queries with an agreement with the client’s senior DBA.
Data migration from large Oracle Exadata customer
We migrated an Oracle Exadata database with approx. 50 TB of data and the goal was to do this over the weekend, to go live with the new solution as soon as possible. Of course, we ran numerous loads during the migration process so that multiple teams could work on other migration activities (from code conversion to ETL modernization to BI migration, etc.). We tested different configurations:
- Queen API on AWS and Workers on AWS
- a combination of Queen API on AWS and Workers on local Linux machines (this is usually the better approach, as the extraction phase can be the slowest and it is a good practice to generate files close to the source system).
In both configurations, the storage of extracted CSV files from Oracle was on S3. The EC2 virtual machine on AWS had 32 CPUs and 32 GB of RAM. In the end, the version with Linux on-premise was about 20% faster and later we deployed two instances to scale the load horizontally.
Another specific was optimizing the largest tables with more than 1TB of data. The existing partitions on Oracle tables were defined with about 50 GB of data, and in joint discussions with the client’s DBAs, we decided to do »double” partitioning on reads – first reading the data using a specific Oracle partition and then splitting the partition with an additional 10 queries for each partition to maximize the reads from Oracle and have similar chunks of data to efficiently load it later into Snowflake.
$ExtPartitionfield = (select $ExtPartitionfield from (select $ExtPartitionfield, row_number() OVER(PARTITION BY null order by $ExtPartitionfield) as indeks from ( select column_value AS $ExtPartitionfield from [sys.odcinumberlist($ExtValues), sys.odcidatelist($ExtValues),…])) where indeks = (TRUNC(($partition_num-1)/($NumOfPartitions/$ExtNumOfpartitions))+1)) [AND ABS(MOD(NVL($PartitionField,1) ,($NumOfPartitions/$ExtNumOfpartitions))) = MOD($partition_num, ($NumOfPartitions/$ExtNumOfpartitions))] $ExtPartitionfield => name of a column table is partitioned on $ExtValues => list of all partition values $ExtNumOfpartitions => count of all partition values $PartitionField => name of a column on which we can do antFarm partition $NumOfPartitions => number of antFarm partitions you want to create (it must be a multiplier of $ExtNumOfpartitions) $partition_num => sequence number of a partition
For the initial migration, we decided to classify the existing data as cold (does not change and we can load it beforehand without time pressure) and hot (can change and we need data freeze with respect to changes). The cold part had a size of 42 TB and the hot part of 8 TB. For the cold part, it took us about 8 days as there were also ongoing processes for the existing data warehouse and other business activities. For the hot part, we did a weekend freeze with the customer and transferred the data in 1.5 days, so that the new regular delta data integration process was running on Monday. Over the weekend, we transferred around 2000 tables, with a total of 56B+ rows. Based on all the measurements, we could get even more performance, if the customer could provide more than a 1 Gbit network line and isolate the network, but we had to work with these constraints.
Initial and daily data extraction from Microsoft SQL Server
One of our customers uses a large group of Microsoft SQL Servers on-premises for their core applications. The requirement was to efficiently sync the data to Snowflake using the Microsoft Change Data Capture (CDC) possibility. AntFarm has the ability to track the latest CDC IDs for all tables, so we can properly load incremental data if the customer enables this feature on the source database.
Rapid migration of data from a Microsoft SQL server is also a particular challenge as Microsoft SQL can have some obstacles when 100 queries are made for data extracts. After several trial-and-error phases, we came to a specific configuration of parallelism that works optimally without stopping the existing data warehouse. Extraction also needs to consider collations and specific data types, and that’s something our platform can configure and do automatically.
The initial sync of 4 main tables with around 1.3 TB of data, took around 6 hours and the daily increment now takes between a few and 30 minutes, depending on the size of the increment and the performance on SQL side to get the required CDC changes.
To get optimal performance, we need to adjust the parallelism to have antFarm running on Linux server at 100%. If there are bottlenecks on SQL Server side, we do not need so much parallelism / HW for the antFarm server.
Data migration from a data warehouse on Microsoft SQL Server
Another example is our customer with 5 Microsoft SQL databases that needed to be migrated from the existing data warehouse. Each database was about 1 TB in size and contained between 300 and 550 tables.
We have set up antFarm on AWS on a large EC2 instance of c6i.8xlarge with 64 GB RAM and 32 CPU cores. After the initial tries it became clear that the bottleneck was the existing SQL server, as we could not achieve all the parallelism that such EC2 instance can handle. This is why we later downsized the resources to not spend too much on AWS costs.
We needed around 6 hours for each database, and the total size ended up being about 5 TB, which is about 30 hours due to the network and the existing Microsoft SQL power.
Daily data extraction from IBM Mainframe and SAP Hana
Sometimes our customers also want to use antFarm for daily load from their core systems. This is something we can easily set up with antFarm, so we can automate the whole process and adding new tables only takes a minute of configuration.
Our first large migration customer uses antFarm daily to transfer over 300 GB of data from their IBM mainframe and SAP Hana (on weekends we transfer even more, since they have the ability to change certain ledgers backwards for a long period of time). The internet line is 100Gbit, but of course we are not alone on the network at night. It currently takes us 1 hour to transfer approximately 1.2 billion rows of data from multiple applications on IBM mainframe and SAP systems.
We perform incremental loading using custom SQL with WHERE clauses for specific tables that are dynamically calculated and applied. In addition, the client’s team can override the settings and specify their own logic for specific tables if needed.
They are also an official Snowflake reference, which can be found at: https://resources.snowflake.com/case-study/petrol-delivering-superior-data-driven-customer-service