Snowflake Dynamic Tables
Dynamic Tables in Snowflake
Since summer time is a little more quiet I finally had a chance to play around with Snowflake Dynamic Tables (DTs), generally available since late April 2024. Just in case it's useful to anyone, let's start with a summary generated by Perplexity, but you might as well skip to the first heading 😜
Ah, Snowflake Dynamic Tables - the magical unicorns of the data world! 🦄✨ Let me paint you a whimsical picture of these delightful creatures.
Imagine you're a data wizard in a mystical land called Snowflake. You've got mountains of data to transform, but you're tired of waving your wand and muttering complex incantations every time you need to update your tables. Enter Dynamic Tables, the enchanted tables that do the work for you!
These clever tables are like self-cleaning ovens for your data. You simply tell them what you want (using a dash of SQL magic), and they keep themselves fresh and up-to-date without you having to lift a finger. It's like having a team of industrious data elves working tirelessly behind the scenes!
Now, you might be thinking, "But I already have views! What's so special about these Dynamic Tables?" Well, my curious friend, while views are like windows into your data, Dynamic Tables are more like self-replenishing cookie jars. They store the results of your query, so you don't have to crunch the numbers every time you peek inside. It's like having your cake and eating it too - but with data!
One of the best things about Dynamic Tables is that they're incredibly low-maintenance pets in your data menagerie. You don't need to feed them complex schedules or train them with intricate task routines. Just set a target freshness (like "stay fresh within 5 minutes"), and they'll do their best to keep up. It's like having a dog that walks itself!
And here's a fun little secret: Dynamic Tables can form friendships with each other, creating a delightful data conga line. They can pass information down the line, each one transforming the data a little bit before passing it on to the next. It's like a game of telephone, but without the hilarious misunderstandings at the end!
So, if you're tired of herding cats (I mean, managing complex data pipelines), why not give Dynamic Tables a whirl? They're the closest thing to "set it and forget it" in the data world. Just remember, with great power comes great responsibility - and possibly a sudden urge to name all your tables after fairy tale characters. Snow White and the Seven Dynamic Tables, anyone?
Why use DTs?
Sure, orchestration of data transformation is a lot of fun, but after you figured out all dependencies and everything is running smoothly, you might as well start optimizing.
I personally would generally advise to start building a model with a couple of (interdependent) views and some (regularly and fully refreshed) materialized tables in the end. This way you verify the model and it's usefulness and keep very flexible as you can modify the model pretty quickly. However, once the model is stable and you're not changing too much too frequently anymore, this approach becomes inefficient.
Then you want to start building a transformational pipeline only refreshing newly added or updated rows to the model by delta-loading new data. There are multiple reasonable approaches to this and Snowflake DTs are another one.
They imho. have the unique benefit of taking care of building the delta for you. You won't have to think about how to identify new/updated rows, let Snowflake take care of that. Obviously, if this black-box approach doesn't work well, you once again might needlessly waste time and resources, so keeping an eye on the whole transformation-pipeline (at least for some time) is very important ⚠️
The downside compared to, say, a pipeline set up with dbt or Coalesce are the missing data-quality-tests. But who has time to define tests anyway, right Monte Carlo? 😅
How to?
Just like with the interdependent views mentioned earlier, you start building your layered data-cake from bottom to top - or up- to downstream. But instead of layered views, you use layered tables moving data around sequentially.
Limitations
Now, DTs (currently) have some limitations, particularly when you want them to refresh incrementally: docs.snowflake.com/en/user-guide/dynamic-tables-limitations#support-for-incremental-refresh
Most of them can be circumvented by simply adding another (or sometimes multiple) intermediate DTs. An example: In the screenshot above the table to the right is a union of 4 other tables. However, this only works for UNION ALL without conflicting ROW_IDs. Those can be avoided by having 3 intermediate DTs incrementally refreshed and then union-all them and the 4th table. Another limitation I stumbled across was the (as of now) missing support for lateral joins. However, with an intermediate DT filtering the array (instead of flattening it first and then filtering the flat table) this worked just fine, too.
Another limitation worth mentioning: outer joins (usually left joins) with anything but an equality predicate are not allowed. Hence, for (cartesian) range joins or something the like you would again build an intermediate DT containing the cartesian join as an inner join and then using this intermediate DT with an equality predicate.
Oh and: the Snowflake documentation on limitations currently (2024-07-19) contains a minor error... You can only use 10 DTs as input for another DT, not 100 🤓 Otherwise the compiler will simply timeout without any indication of the issue.
Freshness
For each DT you define a target freshness: if the data of the table wasn't refreshed for X hours/days, refresh it! For most upstream tables you can simply define this target to be dependent on downstream tables. This way, you only have to set the target freshness on critical tables, i.e. tables at the very end of the pipeline or maybe a few critical intermediate tables used my many other downstream tables.
Generally, when you are used to pipeline orchestration anyway, you can also just set the target value to anything (the max value currently is '90 days', but setting it to 'downstream' for all DTs in the pipeline will effectively prevent anything from loading automatically) and continue using the orchestration tool of your choice by refreshing the most downstream tables in the pipeline quasi-manually:
ALTER DYNAMIC TABLE [namespace.schema.tablename] REFRESH;Or use a combination of both the automatic freshness for some DTs (e.g. intermediate DTs for which it's not super critical to have the latest data in the model) and on-demand-refreshes when you need it.
Warehouse
As you also have to pick a warehouse when defining a DT, you might as well use that to your benefit: only use larger-than-XS warehouses on DTs if they are needed. Most DTs probably can very well refresh with an XS warehouse, but if you need some heavy lifting (like those cartesian joins mentioned earlier), use a larger warehouse just for those DTs!
Summary
(this time it's mine, not a generated one)
DTs can indeed improve query performance of data models by delta-loading while you also don't have to invest too much thought into setting them up - as the old saying goes with regard to Snowflake: It just works.
This and the lineage graph you get from using DTs are really awesome! However, with the (current) limitations, the lineage can get rather large and complex. In this example I turned 9 views into a total of 73 DTs. The performance of the pipeline is significantly improved, but try to explain this chart to a business user:
OK, I couldn't have explained the views, either, I guess...