Infrastructure

dltHub Pro is here. So is my static-egress proxy VM.

How I kept IP‑locked Snowflake and Azure databases happy with a managed runtime that runs “wherever Python runs”

TL/DR: dltHub Pro went GA on May 19. The managed runtime is excellent, but its jobs run from different cloud IPs every time while my Azure databases and Snowflake account are locked down by IP. I solved it by parking a tiny proxy VM with one static egress IP in front of everything and routing dltHub’s traffic through that.

dltHub Pro went GA, and I hit a wall.

dltHub Pro went generally available on May 19. I've been running dlt pipelines on managed dltHub for a while during the beta phase as it fits neatly into my stack: agents build the pipelines on my laptop, dltHub Pro runs them in production on a schedule. One command to deploy.

Introducing dltHub Pro: Claude/Codex/Cursor-native data engineering

The GA announcement. 91% of new dlt pipelines are now built by agents. Pro is the runtime that ships them to production.

I migrated five existing pipelines (four Postgres sources, one MySQL source, all writing into Snowflake) to the new runtime in about 10 minutes. And then I hit the wall this post is about.

The wall: dltHub jobs run from everywhere

My Azure databases are locked down by IP allow-list. That's the reason those pipelines don't run in SPCS in the first place, because SPCS IPs are not static either.

Also, my Snowflake account has a NETWORK POLICY. That's fine when the consumer is a known VM with a known address. It's not fine when the consumer is a managed runtime that schedules jobs on whatever capacity is free.

To quantify it, I wrote a tiny dlt pipeline that only calls api.ipify.org, writes the result to a row, and repeats. I ran it on a schedule for 24 hours and saw 110 distinct egress IPs across various cloud service providers. IP allow‑listing per job was a non‑starter.

To be clear, this is by design: dltHub's pitch is "runs where Python runs". The runtime is supposed to use cheap, ephemeral capacity.

The fix: one VM, two proxies, one static IP

The architecture is embarrassingly simple:

dltHub (dynamic IPs)
    ├──> HAProxy:5432x  (TCP passthrough)  → Azure Postgres / MySQL
    └──> Squid:31280    (HTTP CONNECT)     → Snowflake / Azure Blob staging

I run one small VM on elest.io (Hetzner under the hood), Ubuntu, with Docker Compose and two containers in network_mode: host. HAProxy handles the database connections as raw TCP passthrough, Squid handles Snowflake as an HTTP CONNECT proxy, and every Azure DB firewall plus the Snowflake NETWORK POLICY only need to know one IP - that of the VM.

Why elest.io specifically? (I had briefly confused "Elastio" the backup vendor with "elest.io" the managed PaaS... they are not the same company 😅). Any small cloud VM with a stable public IP would work the same way, but not having to manage the deployment and hosting is good enough a reason for me.

Why two proxies?

HAProxy is enough for Postgres and MySQL. Clients connect TCP, TLS is end-to-end, the proxy never sees the bytes. The only caveat: the client must use sslmode=require instead of sslmode=verify-full, because the cert CN is the original Azure FQDN, not the proxy hostname. TLS is still enforced; only hostname binding is relaxed.

Snowflake is different. The Snowflake Python connector speaks HTTPS to *.snowflakecomputing.com and does strict hostname verification on the cert. With TCP passthrough on a non-matching hostname, the handshake fails, and there's no clean knob to disable hostname binding the way psycopg has one.

The standard answer for HTTPS-with-strict-hostname is an HTTP CONNECT proxy. The client tells the proxy CONNECT *.snowflakecomputing.com:443, the proxy opens a TCP tunnel, the client does the TLS handshake straight through it. The cert validates against the real hostname because that's the hostname the client used. Squid does exactly this, and the Snowflake connector (plus requests, urllib3, botocore, azure-storage-blob) all honor HTTP_PROXY / HTTPS_PROXY natively.

I lock Squid down to two domain suffixes so it cannot be abused as a general‑purpose open proxy:

acl snowflake_domain dstdomain .snowflakecomputing.com
acl azure_blob_domain dstdomain .blob.core.windows.net

http_access allow CONNECT snowflake_domain
http_access allow CONNECT azure_blob_domain
http_access deny all

Don't proxy what doesn't need proxying

My first attempt set HTTP_PROXY globally at the top of the dltHub deployment module. That broke two things at once:

A pipeline reading from Tableau Online started failing because Squid refused CONNECT to *.online.tableau.com.
The Snowflake pipelines themselves failed in a post-success step with botocore.exceptions.ProxyConnectionError... turns out dltHub uploads its own runtime artifacts to an internal S3 bucket after each job, and that traffic was also getting routed through Squid and rejected.

The fix was a small context manager in __deployment__.py that scopes the proxy env vars to Snowflake jobs only (via dltHub job trampolines), plus a NO_PROXY list for dltHub's own telemetry/S3 buckets and localhost:

from contextlib import contextmanager
import os

@contextmanager
def _snowflake_proxy():
    keys = ("HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY")
    old = {k: os.environ.get(k) for k in keys}

    os.environ["HTTP_PROXY"] = "http://<my_static_ip>:31280"
    os.environ["HTTPS_PROXY"] = "http://<my_static_ip>:31280"
    os.environ["NO_PROXY"] = ".dlthub.com,.amazonaws.com,127.0.0.1,localhost"

    try:
        yield
    finally:
        for k, v in old.items():
            if v is None:
                os.environ.pop(k, None)
            else:
                os.environ[k] = v

Non-Snowflake jobs (Tableau, Azure Graph, etc.) are completely unaffected.

The HAProxy config trap

elest.io ships HAProxy with a stock haproxy.cfg that includes a dataplane API block for their web-based config editor:

userlist haproxy-dataplaneapi
  user admin insecure-password ...

  program api
    ...

That program directive is indented with two spaces. HAProxy 3.3 parses indented lines as continuations of the preceding section, so program ends up inside the userlist section, which HAProxy promptly rejects:

[ALERT] config : unknown keyword 'program' in 'userlist' section
[ALERT] config : Fatal errors found in configuration.

I ripped the entire block out of the rendered config in my deploy script. The elest.io web editor stops working, but I'm managing the config over SSH from a Python script anyway, so I don't miss it.

MySQL has its own hostname problem

PyMySQL was the last one to surprise me. Its ssl.create_default_context() defaults to check_hostname=True, and there's no DSN-level sslmode=require equivalent. So the MySQL pipeline blew up with a cert hostname mismatch even though I'd already learned this lesson for Postgres.

The fix is the same idea, just expressed differently in code:

import ssl

_mysql_ssl = ssl.create_default_context()
_mysql_ssl.check_hostname = False  # cert validation is still on; only hostname binding is relaxed

engine_kwargs = {
    "connect_args": {
        "ssl": _mysql_ssl,
        # ...
    }
}

What's left on the TODO

After adjusting all the network configs in Azure and applying the modified NETWORK POLICY in Snowflake, the whole thing has been running stably for a few hours now.

I am not sure, how stable the public IP of the proxy is (across VM reboots particularly). I'll have to keep an eye on this 😜

The bigger picture

dltHub Pro is a real shift. The "agents build, runtime ships" loop genuinely works, and I'm now running production pipelines that I never had to write a Dockerfile for. But "runs where Python runs" comes with the trade-off that the egress side of the network is no longer mine, and any destination that firewalls by IP needs a story for that.

I'm almost certain the kind folks at dltHub will eventually come up with a native solution for this, but until then a $5/month proxy VM is a perfectly fine story. If anything, it forced me to clean up a years-old mess of per-source firewall rules into one tidy allowlist. I'll take that. 😎