Lakehouse

Snowflake + Polaris + PyIceberg/DuckDB/Polars + AZURE

Is the combination of PyIceberg and Azure ready for multi-engine compute, yet?

Martin Seifert

24 Feb 2025 — 3 min read

UPDATE 2025-05-12: with Snowflake bundle 2025_03 this is finally possible as it uses a more modern schema to create Iceberg tables 🥳

TL/DR: no.

Since Julien Hurault talked about his vision for a multi-engine data stack last year, I waited for a quiet week to give it a shot. Then February came around and here we are.

The idea sounds rather intriguing: Compute is fairly expensive in Snowflake, so it might be preferred to run certain operations outside Snowflake, i.e. in a local DuckDB or Polars. However, replicating data to then work with it on a different engine causes friction - and, thanks to Iceberg, can be avoided!

Of course, Julien has a post about this, too

To store data in an Iceberg table, a catalog is required. If that catalog can be accessed by Snowflake and PyIceberg, one can either process the data in the Iceberg table using Snowflake or some other engine utilizing PyIceberg, i.e. DuckDB or Polars. A Snowflake-managed Polaris catalog (which for some reason Snowflake now calls “Open Catalog”) is fairly easy to spin up. Buuut…

Sure, you could use Spark, too, but who has the resources to maintain this?

Challenge one: write directions

To be able to write data to a Polaris-cataloged Iceberg table from Snowflake, the catalog must be managed by Snowflake - basically synchronizing a Snowflake table to an Iceberg table automatically.

This sounds a lot like duplicating the data to me, still, but at least someone else (Snowflake) manages this

However, PyIceberg can’t write to Snowflake-managed Polaris-tables, so for the other direction, we’ll need a second Polaris-internal catalog, from which Snowflake then can read.

OK, two catalogs then: one for data eventually leaving Snowflake, one for data to be read by Snowflake. Since we’d use two tables (one with the raw data to be processed, one with the processed data) anyway, this adds a little overhead, but nothing too serious.

If we want to use Snowflake-internal processing as a fallback however, we now have two potential processed-data-tables: either the Snowflake-internal table or the Iceberg table. Hence: Either the pipeline knows which of the two tables has the latest processed data during the current run, or we really only can use this workflow for ad-hoc processing. Still, not a blocker i.m.h.o.

Challenge two: Azure

Since my Snowflake instance is Azure hosted, I can only put Iceberg tables to Azure storage accounts with a Snowflake-managed Polaris catalog. Not too bad, the process is well documented.

However, Polaris creates Iceberg tables in the Azure-specific wasbs scheme - which PyIceberg version 0.8.1 can’t read. I expect this to be changed in version 0.9.0, but right now, this is a huge blocker.

cf. github.com/apache/iceberg-python/issues/1606

It’s possible to circumvent this issue by simply not using an Azure-hosted Snowflake instance and Polaris catalog or replicating data in Snowflake across regions and platforms to an AWS hosted Snowflake account and Polaris catalog, but then we have data replication in play again. And if replication is not the issue here, why not simply export Snowflake-data to a local CSV/Parquet and work with that instead?

cf. docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-platforms

The reverse direction however works rather smoothly: PyIceberg can write to Azure Blob Storage (using a different scheme than wasbs) with a (not-Snowflake-managed) Polaris catalog and Snowflake can read from it. But then you’d have two Polaris catalog accounts (one on AWS to get data out of Snowflake, one on Azure to get data into Snowflake) or you could replicate the data in the AWS-hosted Snowflake account back to the Azure-hosted account. To me, this sounds a lot like: let’s wait for PyIceberg 0.9.0 😅

Summary

As intriguing the idea might sound, with the currently available stack and hosting Snowflake on Azure, Iceberg is out of reach (to me at least). I’ll update this once PyIceberg is updated, though 😜