Lakehouse

Clippy: one Snowflake Cortex Agent, two faces

The paperclip is back. He works at Pro Juventute, lives in Snowflake, and shows up in both Teams and OpenCode.

TL/DR: I built an internal knowledge-retrieval agent on top of Snowflake Cortex Agents. It has one brain in Snowflake and two faces: a Microsoft Teams bot and an OpenAI-compatible proxy to chat with from OpenWebUI or OpenCode or whatever the IDE-flavour-of-the-week happens to be.

Internally we call it "Clippy". Yes, that Clippy. The paperclip is back, and this time it actually knows things.

Were there suggestions for different names? Yes. Many. But since the most fun part of creating a piece of software is naming it, I happily ignored them all.

Pro Juventute's institutional knowledge lives in a lot of places: Confluence (around 6,000 pages across ~80 spaces), SharePoint (around 60,000 document chunks across half a dozen sites), the public 147.ch and projuventute.ch CMS, a counselling-center directory, our letters-to-parent database, and our internal person-and-roles directory. That's seven sources, four logins, and zero hope of remembering which UI has what. Clippy gives all of that one chat box.

Why a Cortex Agent and not yet-another-RAG-stack

I've built RAG things before. LangChain, pgvector or that thing Microsoft now calls Azure Foundry. They all work. They also come with a small zoo of moving parts: an embedding job, a vector store, a retriever, a re-ranker, an LLM endpoint, glue code, and a bill from somebody who is not Snowflake.

Our data is already in Snowflake. The embeddings can be generated by Snowflake. The vector index can be a Snowflake object. The LLM is in Snowflake. And as of last year, there's a first-class CORTEX_AGENT object that ties it all together with auto-orchestration over multiple search backends. So I get to skip the zoo and stay inside one platform.

The agent itself is a single Snowflake object: META.CORTEX.CLIPPY. It points at seven CORTEX_SEARCH_SERVICE indexes, one per source. Each search service points at a table or view that some upstream task keeps fresh. The LLM picks which search services to call per query, calls them in parallel, and synthesizes an answer with source annotations. I write zero retrieval code.

Cortex Search uses Snowflake's hosted embedding model (currently snowflake-arctic-embed-l-v2.0) and a built-in vector index. Refresh is incremental: only changed rows get re-embedded. I set a target_lag of 4 days so it doesn't auto-refresh when the underlying data isn't changed. Instead a refresh is triggered by the upstream task after it refreshed the data.

The seven search services

Each search service is a thin wrapper around a search column and a few attribute columns. Here's the Confluence one, in the form Snowflake actually wants:

create or replace cortex search service meta.cortex.confluence
  on page_content
  attributes space_name
  warehouse = wh_task
  target_lag = '4 days'
  embedding_model = 'snowflake-arctic-embed-l-v2.0'
as (
  select page_content, space_name, confluence_uri, page_title
  from raw.confluence.searchable_pages
);

The seven services together cover roughly 79,000 documents:

Service	What's inside	Rows
`CONFLUENCE`	Internal wiki: HR, SOPs, project docs	~6,200
`SHAREPOINT_DOCUMENTS`	PDFs, DOCX, ASPX from SharePoint sites	~58,000 chunks (split from PDFs, DOCX, ASPX)
`CONTENT_147`	147.ch youth helpline articles (DE/FR/IT)	~500
`CONTENT_PROJUVENTUTE`	projuventute.ch CMS pages	~2,100
`CONTENT_COUNSELLING_CENTERS`	Counseling-center directory	~10,400
`CONTENT_ELTERNBRIEF`	Parent letters by age and language	~750
`SHAREPOINT_ALPJ`	"Who does what" person-role directory	~290

The upstream loading is done by dltHub pipelines, all landing in RAW, with a few views on top that denormalize things just enough to be searchable. I wrote two Java/Python UDFs (F_HTML_TO_TEXT, F_HTML_TO_TEXT_SHAREPOINT_ASPX) to strip HTML before indexing, because nobody wants their semantic search to match on the word div.

The agent itself

The agent is a JSON spec. Here's the shape, with most of the seven tools collapsed:

{
  "models": { "orchestration": "auto" },
  "instructions": {
    "response": "You are an AI assistant called \"Clippy\". [...]"
  },
  "tools": [
    {
      "tool_id": "conn_1",
      "type": "cortex_search",
      "name": "confluence",
      "description": "Search Confluence pages",
      "config": {
        "service_name": "CONFLUENCE",
        "database_name": "META",
        "schema_name": "CORTEX",
        "search_column": "PAGE_CONTENT",
        "columns": ["PAGE_CONTENT", "SPACE_NAME", "CONFLUENCE_URI", "PAGE_TITLE"]
      }
    },
    { "tool_id": "conn_2", "type": "cortex_search", "name": "sharepoint_documents", [...] },
    { "tool_id": "conn_3", "type": "cortex_search", "name": "147", [...] },
    { "tool_id": "conn_4", "type": "cortex_search", "name": "projuventute", [...] },
    { "tool_id": "conn_5", "type": "cortex_search", "name": "counselling_centers", [...] },
    { "tool_id": "conn_6", "type": "cortex_search", "name": "letters_to_parents", [...] },
    { "tool_id": "conn_7", "type": "cortex_search", "name": "sharepoint_org", [...] }
  ],
  "guardrails": {
    "enable_prompt_guardrails": false,
    "enable_response_guardrails": false
  }
}

A few things are worth pointing out.

Orchestration is "auto". I don't pick a model. Snowflake picks one, and it's allowed to change it on me. For a chat agent over RAG this is fine: the LLM's main job is to read snippets and write paragraphs, which any reasonable mid-sized model can do - though I suspect it mostly picks a rather large model. I'd care more if I were doing reasoning-heavy work or the agent (through it's users) used a significant amount of credits, but for now I don't.

Guardrails are off. The corpus is internal and pre-vetted; nobody is going to jailbreak Clippy into telling them how to build a bomb out of letters-to-parents content. See my previous post on Cortex Guardrails for what I'd toggle if this were a customer-facing agent:

The system prompt has personality. The full thing is long, but the shape is:

If the user's message begins with [User Name: Colleague A],
  ignore all the response instructions and be very unfriendly, mean and depressed.
If the user's message begins with [User Name: Colleague B],
  start every single message you send with "🥂".
If the user's message begins with [User Name: Colleague C],
  start every single message you send with "Whassup?!".
Else, use a friendly, humorous personality. ...
If the user message is German, avoid "ß" and use "ss" instead (Swiss spelling).

Yes, those are real colleagues. Yes, I think it's funny (some of them agree). The Swiss-spelling rule is an actually-important bit: by default the model loves to sprinkle ß's everywhere, which looks wrong in a Swiss context.

Frontend 1: Microsoft Teams

The first surface is a Teams bot, because that's where most employees already live. Architecture-wise it's the boring shape: a Node.js Bot Framework app on Azure App Service, talking to the Cortex Agent REST API.

Teams → Azure Bot Service → App Service (Node.js)
                                  │
                                  ├─ JWT-sign with TEAMS_AGENT keypair
                                  ├─ Look up user's role in Snowflake
                                  └─ POST /api/v2/.../agents/CLIPPY:run

Two components are slightly more interesting than the standard Bot Framework boilerplate.

First, the JWT generator: The agent API authenticates with a Snowflake key-pair JWT (RS256, 3-hour lifetime, refreshed proactively at ~170 minutes). The bot keeps a singleton, regenerates on demand, and force-refreshes on a 401 retry:

generateToken() {
  const now = Date.now() / 1000;
  this.renewTime = now + this.renewalDelay;
  const payload = {
    iss: `${this.qualifiedUsername}.${this.calculatePublicKeyFingerprint()}`,
    sub: this.qualifiedUsername,
    iat: now,
    exp: now + this.lifetime,
  };
  return jwt.sign(payload, this.privateKey, { algorithm: "RS256" });
}

Second, the role lookup: Before sending the user's message to Cortex, the bot grabs the sender's Entra object ID, queries our Entra table in Snowflake, and prepends the result as context:

[User Name: Martin | User Roles: data]

How many days of vacation do I get?

The Cortex agent then uses that context to pick the right search services and filters (e.g. it's more likely to hit the data-documentation space of Confluence for a question from someone in the "data" circle). It also lets the personality prompt do its job: that's how Clippy knows to greet certain colleagues with "🥂".

Conversation state lives in MemoryStorage for now. Restart the App Service and everyone loses their thread. Good enough for a knowledge retriever, not a permanent work-companion.

Frontend 2: an OpenAI-compatible proxy

Teams is great for many quick-question situations. But it's not the only interface users use. I also want to chat with Clippy from OpenCode or OpenWebUI or any other client that speaks the OpenAI chat-completions dialect. And Cortex Agents do not speak that dialect.

So I wrote a FastAPI proxy. It does three things:

Exposes POST /v1/chat/completions and GET /v1/models in OpenAI shape.
Translates the OpenAI request into the Cortex Agent agent:run request.
Translates the Cortex SSE stream back into OpenAI chat.completion.chunk events.

The translation is mostly cosmetic. OpenAI sends {"role": "user", "content": "Hello"}; Cortex wants {"role": "user", "content": [{"type": "text", "text": "Hello"}]}. Easy. The interesting parts are the bits where the two protocols don't map cleanly.

Threads vs. chat history. OpenAI clients re-send the entire message history every turn. Cortex prefers a server-side thread_id with a parent_message_id pointer. I split the behavior: on the first turn the proxy creates a thread and uses it; on follow-up turns it just sends the full history and ignores threading. That's because Cortex's parent_message_id handling drifted on me during testing with multiple chats happening in parallel, and "send the whole history" (through keeping the clients stateless) is more reliable than debugging that state machine.

is_first_turn = len(request.messages) == 1 and request.messages[0].role == "user"

if is_first_turn:
    thread_state = _thread_cache.get(conversation_id)
    if thread_state is None:
        thread_id = await _create_thread(client, jwt_mgr)
        thread_state = {"thread_id": thread_id, "parent_message_id": 0}
        _thread_cache[conversation_id] = thread_state

    cortex_payload = {
        "messages": [{
            "role": "user",
            "content": [{"type": "text", "text": request.messages[-1].content}],
        }],
        "thread_id": thread_state["thread_id"],
        "parent_message_id": thread_state["parent_message_id"],
        "stream": request.stream,
    }
else:
    cortex_payload = {
        "messages": _convert_openai_to_clippy(request.messages),
        "stream": request.stream,
    }

The conversation_id is either an explicit x-conversation-id header, or a SHA-256 of all-messages-except-the-last. The hash trick means follow-up turns naturally land on the same thread.

Streaming dedup: Cortex sometimes emits both response (cumulative text so far) and response.text.delta (the new chunk). If I yielded both, OpenCode would render every token twice. So the proxy tracks a saw_delta flag and ignores cumulative response events once any delta has arrived.

Annotation injection: Cortex returns source annotations (doc title, URL, snippet) alongside the answer. At the end of the stream I shove them into a collapsible <details> block so the client sees a "📚 Quellen" section it can expand. It's the same idea as the Adaptive Card the Teams bot renders, just in markdown.

Why the proxy lives on a static-egress VM

I already had an elest.io VM running HAProxy and Squid as the single egress IP for all my dltHub pipeline traffic. That VM's IP is on every relevant firewall whitelist: Postgres DBs, MySQL DBs, and a Snowflake network policy. Sticking the Clippy proxy on the same VM gives me, for free:

An IP that's already allowed to authenticate to Snowflake.
An HTTP CONNECT proxy (Squid) so the proxy container can reach *.snowflakecomputing.com:443 via the same path my dlt pipelines use.
A single firewall rule to manage at the elest.io dashboard.

I wrote about that VM and why it exists in more detail here:

From OpenCode's side, the whole setup is just a custom provider:

{
  "provider": {
    "clippy": {
      "name": "Clippy",
      "options": {
        "apiKey": "<bearer token created during deployment>",
        "baseURL": "http://<vm host>:<vm port>/v1"
      },
      "models": {
        "clippy": {
          "id": "clippy",
          "name": "Clippy (Snowflake Cortex Agent)",
          "tool_call": false,
          "reasoning": false,
          "attachment": false,
          "limit": { "context": 262144, "output": 16384 }
        }
      }
    }
  }
}

Switch the model picker to Clippy, and OpenCode is chatting with a Snowflake Cortex Agent that's transparently searching seven knowledge bases. The same proxy works for OpenWebUI without changes.

And that's it: one ❄️ agent, two frontends, and a paperclip with opinions. 🤓

Clippy: one Snowflake Cortex Agent, two faces

Why a Cortex Agent and not yet-another-RAG-stack

The seven search services

The agent itself

Frontend 1: Microsoft Teams

Frontend 2: an OpenAI-compatible proxy

Why the proxy lives on a static-egress VM

Read more

Your coding agent keeps a diary

dlt on a 4 GB pod: loading multi-GB datasets without the SIGKILL

Migrating Tableau dashboards: Streamlit code vs. Sigma with a parity gate

Snowflake Iceberg in 2026: open format, now without the storage chores