dlt on a 4 GB pod: loading multi-GB datasets without the SIGKILL

Exit code 137 is dlt's way of saying "your RAM and I need to talk"

Share
dlt on a 4 GB pod: loading multi-GB datasets without the SIGKILL

TL/DR: Download in parallel, split huge datasets into chunks, then load one chunk at a time with a fresh dlt.pipeline(). That's the whole trick to running multi-GB loads on a tiny VM.


A while back I moved a Salesforce full load (~56 objects, the biggest CSV weighing in at 5 GB) into Snowflake using dlt.

Can you run dlt inside Snowflake? (Part 2/2: SPCS)
In my last post I explored whether dlt could run in a Snowflake UDF... and failed. But can it still run inside Snowflake using a Container? TL/DR: yes, it does ๐Ÿ˜Ž

Locally on a 64 GB laptop: piece of cake.

On a high-mem โ„๏ธ SPCS container with 28 GB memory: also fine.

On a 4 GB Modal pod: dead. No Python traceback, no stack, just the container restarting and an exit code: 137 in the logs.

Why does this matter? Well, dltHub Pro runs on Modal pods and (as of this writing) there is no option to size the pod it runs on. Attempting to process a "too large" dataset needs a few workarounds...

The product is still young, I expect this to not be an issue anymore in the very near future. But today that 137 is SIGKILL: the kernel's OOM killer picked dlt as the biggest target and shot it. Here's what was actually eating the memory, and the pattern that fixed it for good.

Why it OOMs

The default pipeline.run(source) reads everything into RAM during extract and normalize, then keeps the load-package metadata around too. dlt has some mechanisms to spill data from memory to disk if that's necessary, but the sneakier problem is pipeline state.

Every run() against a pipeline object adds schema deltas, load packages, and cursor values to its in-memory state. The Schema and LoadInfo graphs grow for the entire lifetime of that object. Loop over 100 chunks with one pipeline and you OOM somewhere around chunk 15, not chunk 1, which is what makes it confusing.

# the trap: one pipeline, many runs -> state grows every iteration
pipeline = dlt.pipeline(...)
for resource in all_resources:
    pipeline.run(resource)

Fresh pipeline per object

The fix is a fresh dlt.pipeline() per source object instead of one for everything. Per object, with an explicit del when dlt is done with it.

for object_name, chunks in by_object.items():
    pipeline = dlt.pipeline(...)
    for chunk in chunks:
        pipeline.run(source_from_chunk(chunk))
    del pipeline  # state drops to zero, GC reclaims it

Pipeline creation costs about a second, so doing this O(number of objects) is negligible compared to the extract/load time. Memory stays O(chunk size) and drops back down after each object.

One detail that paid off: sort objects heaviest-first (by chunk count). If something is going to OOM, it does so in the first minute, not two hours in on the last tiny table.

Split the Source without reading it all

The source I was struggling with is Salesforce's Bulk API (V2). It returns strictly formatted CSV files after async SOQL queries, which is actually a useful format in this case ๐Ÿ˜Ž

After downloading such a 5 GB file, attempting to read().splitlines() or similar it into a list of lines, peak memory is essentially the whole file. The fix is a streaming two-pass split: count rows first, then stream them into chunk files, flushing every few MB.

def _split_csv(path: str, n_chunks: int) -> list[str]:
    with open(path) as src:
        header = src.readline()
        total = sum(1 for _ in src)  # generator, never a list

    chunk_size = (total + n_chunks - 1) // n_chunks
    # ... open n_chunks files, write the header to each ...

    with open(path) as src:
        src.readline()  # skip header
        batch, batch_bytes, current, rows = [], 0, 0, 0
        for line in src:
            batch.append(line); batch_bytes += len(line); rows += 1
            if rows >= chunk_size:
                handles[current].writelines(batch)
                batch.clear(); batch_bytes = 0; rows = 0; current += 1
            elif batch_bytes >= 4 * 1024 * 1024:  # 4 MB flush
                handles[current].writelines(batch); batch.clear(); batch_bytes = 0

Peak memory is now ~4 MB for the write buffer, regardless of file size. One warning, though:

Don't split CSVs by naive string splitting. Quoted fields containing newlines will get torn apart. The line-oriented approach above works because Salesforce Bulk API CSVs are well-formed; if yours aren't, use a proper csv reader/writer that understands quoting.

I only chunk above 500k rows, targeting ~500 MB per chunk, capped at 16 chunks as a safety net. That keeps chunk counts manageable while staying well within the 4 GB podโ€™s limits.

Stage to blob, not local disk

If I stage to local disk and then PUT to Snowflake, both the CSV chunk and the staging files sit on the pod's disk at the same time. Modal's ephemeral storage is small, so the pod fills up and gets killed... same symptom (SIGKILL), different resource (disk).

Instead, I point dlt's staging straight at blob storage (Azure in my case, but S3 etc. work equally well):

pipeline = dlt.pipeline(
    pipeline_name="salesforce_to_snowflake",
    destination=dlt.destinations.snowflake(),
    staging=dlt.destinations.filesystem(bucket_url="az://stage/salesforce"),
    dataset_name="salesforce",
)

Now dlt normalizes into small JSONL (or Parquet-files, but JSONL works smoother with VARIANT columns) batches on local temp, uploads them to Azure, and Snowflake runs COPY INTO from the az:// URI. Nothing large accumulates on the pod.

Two pitfalls that cost me hours

pipeline.default_schema.tables drifts over time. After a few runs, that property lists every table ever loaded through that pipeline name, not just the current object's. I used it to decide which tables to drop for a full load and happily wiped tables belonging to other objects. Now I keep my own explicit object list and drop from that instead ๐Ÿ˜…

Manual worker config was slower. I assumed capping extract/normalize/load workers would help a memory-starved pod. It didn't: The same dataset took 47 minutes with manual limits vs 34 minutes with pure dlt defaults. When reading from local files, dlt's defaults already balance worker counts well.

Bonus footgun: log_level="CRITICAL" hides silent auth failures. If an expired token returns HTTP 200 with a login-page body, dlt extracts 0 rows and reports success. Use INFO or WARNING so you actually see whatโ€™s happening.

And that's it: same 4 GB pod, same 5 GB CSVs, zero SIGKILLs. The pattern is fairly boring: fresh pipeline per object, stream the split, stage to blob, and let dlt's defaults do their thing. It took me a while to finally find the beauty in that boringness ๐Ÿ˜Ž