Interview with Kent Graziano – on Snowflake, Covid-19 and the Future of Data, part 121.04.2020
When you work in a company that is immersed in all things data and you spend your days among some 40+ data world enthusiasts, few things are certain. If there’s even a tiny bit of interest in IT in you, you will eventually fall for their world. You will absolutely need to learn at least a bit of SQL (enough at least so you can get the jokes you hear during lunch breaks). You will quickly learn what’s cool or not, who’s in and who’s out and certain names will start popping up. The rock stars of the data world.
This is how I first found out about Kent Graziano, Chief Technical Evangelist for Snowflake.
Along with the rest of the crew, I was honored to have him respond to our interview request. Due to the corona lockdown, we were all working from home and it seemed like the perfect time to talk about data, Snowflake and the impact it can have in these unsettling times. Hope you’ll enjoy this as much as we did.
(If you haven’t yet heard about Snowflake, let me do a quick update; it’s the most mind-boggling cloud data platform you can currently find on the market. Founded in 2012 by three data warehousing experts: Benoit Dageville, Thierry Cruanes and Marcin Zukowski, it was built from the ground up for the cloud and as such revolutionized the data warehousing world. The rest is history (for more information follow the link at the end of this blog).
Kent, please tell us a bit about the path that led you to Snowflake.
I have been working in data for more than 30 years. I’d learned to program already in high school and eventually got introduced to Oracle in the late 80s. I started there with a very early version of Oracle and then from there got introduced to data warehousing. I got to work with Bill Inmon and Claudia Imhoff. I wrote my first book with Bill and Claudia was the technical editor on the data warehouse section. I just fell in love with data space and, more importantly, I saw the data warehousing and business intelligence as part of the future and apparently, I called that one right.
I worked for various organizations, leading data warehouse teams, worked as an independent consultant, as well as for some government agencies and various lines of businesses (telco, education, healthcare). I’ve been with Snowflake for a little over four years – I started only few months after the first release.
In one of your interviews I’ve read that meeting Snowflake has re-energized you and that it got you thinking: “Okay, this is it. This technology solves the problems that my clients face every day.” Can you tell us something about those pain points you were constantly experiencing with former solutions?
There were two things that jumped out immediately with Snowflake. When I first heard about it, I was in the middle of the agile data warehouse project on SQL server.
On the first day of the project the first thing that the VP asked me was: “How big a server do I need to buy?”. So, this was like hour zero, I had no information and there was no way I could answer that. I asked him why he needs to know now, and he replied: “It’s a 6 weeks process to order new hardware, so we need to know now. It’s a three-month project, so we’ll be halfway through before we’ll get the hardware.”
I asked him how much data they have. “Oh, we have a lot of data”, he answered. And when asking how much data that is the reply was, he has no idea. It was even unknown how many users there will be. It took 6 weeks before I could figure out just how big the source system was. It was less than a 100 GB, so it was quite small for something that was going to feed the data warehouse.
Right in the middle of this project I found out about Snowflake, where you didn’t need to know how big the system is going to be because it was elastic. Because it had elastic storage and you didn’t need to pre-allocate the space. Had I had Snowflake, I could just start designing it as soon as I found out what the requirements were, and I would not have had to know how much data there is going to be.
Also because of the separation of compute from storage that Snowflake has, on the compute side I wouldn’t have had to know how many users there are going to be. I could easily design virtual warehouses to be of any size. And if I was wrong at the beginning, I could easily add more compute, or I could split them up and allocate a separate virtual warehouse to the ETL process and another one for dashboard users. So again, that question that the VP asked me on the first day of our project – I wouldn’t need to know the answer.
Really, already within the first 15 minutes of my first Snowflake presentation I was just marveling at it and thought – if I’d known this 6 weeks ago, I could be much farther along with my project. So those two things jumped out immediately.
And then there was this big project for a healthcare organization, which I had prior to this where we asked for 10 terabytes of disk from the infrastructure team. They laughed at us and said: “No one has ever used 10 terabytes in your department, there’s no way you need that much.” So, they got me 2 terabytes. In 3 months, we filled them up and we were stuck. And we had to wait 6 weeks to get more space. Again, Snowflake solves that problem.
As a Data Architect and as a Data Warehouse Designer, I would no longer have to struggle with infrastructure issues, like how much disk do I need, ordering it far enough in advance and later on running the risk of running out of space.
Then, if I’d go all the way back to some of my early projects – the beginning of times of data warehousing – where we always had to run our ETL at night. Because no matter how big a server there was, if we tried to run ETL during the day, then the queries wouldn’t run, so the users weren’t happy. Resource contention has always been a problem. Industry best practice was to run your ETL sometimes between midnight and 6 AM and hope it all gets done by the morning so the users can start using it. Even when I had the best systems and the system administrators and the infrastructure team would try to convince me that there’s more than enough CPU, it was never true. Whenever we would run ETL everything else slowed down.
Even 5 years back people started to realize the importance of near real time data feeds. They needed near real time analytics, they started looking at things in the middle of the day and wanted to push things to their customers. But there was no way you could do that in a traditional world. And no matter what everybody was doing it wasn’t working. And immediately, when I saw what Snowflake is capable of, what the multi-cluster architecture could do for everybody, I knew that will make a difference to organizations.
As an architect I didn’t need the system administrator, the expert DBA of any kind, I didn’t need to be an infrastructure or a network person. I could be an architect. I could look at the business problem and design the system to solve that problem and then I could build that system and get things started a lot faster. Trying to do agile processes and doing an agile project that will deliver something quickly, instead of doing all that up-front jobs, what would be normally an agile project sprint zero. To do all that infrastructure from the beginning – all that was simply gone.
If we touch the administration part, even back at the start, you could get a Snowflake account set in 24 hours and today you only need 15 minutes. Even with 24 hours, Snowflake was faster than anybody in the world. Nobody else could get you a data warehouse up and running so fast.
The other part of it was the variant data type and being able to ingest semi-structured data, because there was certainly a lot of demand for it and some organizations were already starting to do data lakes and to join that data with the relational data. For example, Oracle, SQL Server and few others had some minimal capability to do it, but it still required a lot of tuning, a lot of work and it was a multi-step process. The data latency was again the problem – there was no way to do data processing in real time.
And then, if we come back to my first presentation of Snowflake where I’ve seen how they ran queries against some JSON documents, they wrote everything with SQL. Snowflake added just a little bit of a difference and they could pull data out. That solved the question about big data for me. Before, I had been really trying to avoid it because I didn’t see how that could work, particularly in the Oracle space that I was working at the time. You had to have a big data machine, an Oracle machine and a lot of other things to put those two together. Snowflake really solved that problem, so now we can easily take semi structured data, put it in a table, write queries against it and join it with relational data. And that got me probably even more excited than the architecture itself – that one little piece about data types. When I joined Snowflake, it was probably one of the first things I wrote – a blog post about how to do that, which later turned into an e-book, a webinar and a bunch of other things. It’s probably one of the most popular aspects, at least for the people who followed me through these years, since it’s solving quite a few problems in one nice package.
Our experience is the same and when we present Snowflake capabilities, people are always amazed. But often it comes down to the question of security. They feel insecure putting all their precious data in a cloud that is seen as this unknown entity, which is something totally different from the traditional solutions that are there – visible, and therefore one can have the feeling they can be more easily controlled. Can you tell our readers a bit about how security is taken care of with Snowflake?
Yeah, that has been a huge question, especially when I started off with Snowflake – everybody was worried about the cloud and the security. And Snowflake’s security story is equally as impressive as its architecture. The work our security team has done, right off the top, something that nobody else has done, is having the data encrypted automatically. When you upload data from on prem, we encrypt it on the client side and encrypt it along the stream. It’s industry strong 286-bit AESencryption. The great thing about it is that it’s built in and it’s how it should be done (with a lot of other products this is only an add-on layer).
Next comment I’d get is that this will surely slow down queries. But it doesn’t – now, how can that be? And we’re back at our founders who, when they designed Snowflake, knew it had to be secure, because it’s going to be in the cloud. Our founders had long years of experience, they knew what the challenges were, because they were working with them for a very long time. And they also knew what the concerns are going to be, so they were aware from the beginning, it needed to be built in – to the point that encryption is part of Snowflake.
It’s very hard for people to believe until they see it. It’s a multi-tiered encryption of keys encrypting keys, encrypting keys, all the way down to the micro partition files that we store under the cover and each of those is encrypted on that level. We also do a key rotation every couple of weeks, there are different keys. Even in the same table, as it’s growing through the years under the covers of the individual files where data is stored, they’re using different keys as well. So, just an added level of security. And on top of it you can do a federated authentication and we do multi-factor authentications. For example, when you log in to your Snowflake account you have to authenticate on your smart phone and put in the code. There’s lots of levels at the security level. One of the options is Tri-secret secure, which allows a company to control half of the encryption keys themselves. So, even if its encrypted, there has to be an encryption key to decrypt it – Snowflake manages that for everybody but if you don’t want us to do all of it, you can manage half of it. It’s one of the advanced features in Business Critical Edition. Alongside, we’ve also added numerous security certifications. There’s really no way to have Snowflake not encrypted.
Stay tuned for Part 2, where we’ll talk about DWH automation tools, the future of data and warrior mentality.
Ana Mikoš, Marketing assistant @ In516ht
For more inspiration follow the links:
Learn more about Snowflake @ https://www.snowflake.com/
Follow Kent Graziano on his Data Warrior blog @ https://kentgraziano.com/
Or stay with us and learn how In516ht team can help you jump on a cloud.