Streaming AI Chats That Render Images, Not Links: Practical Patterns and Code Tips

If a chat assistant generates an image but users only see a naked URL, the magic is lost. At AI Tech Inspire, we’ve been seeing more teams hit this exact snag when wiring up custom tools with GPT-style models: the model streams text beautifully, but image outputs arrive as plain links. The good news? This is fixable with a few pragmatic patterns that keep the stream flowing while your UI renders images natively.

What’s happening in this setup

The app uses a streaming chat interface with an OpenAI model and tool/function calls.
The standard flow: provide tools, receive a tool call, run code, send tool output back, and then the model returns a final response (or more tool calls).
This works well for tools returning text data; the model’s final response streams cleanly to the user.
One custom tool returns JSON containing both text and an image URL.
Result: the model includes that image URL in its streamed text, so users see a link instead of an actual image.
Goal: treat the image output as an image component in the UI, not just text.

Why models turn image results into URLs

Most chat completion models stream text and tool_calls. They don’t stream image binaries or typed “image” blocks as assistant content by default. When a tool returns JSON with an image URL, the assistant tends to reference it in text unless given a clear protocol or a UI-aware tool. Without that protocol, your downstream renderer has no signal to switch from text streaming to an image component.

“If you want images to render, give your UI the signal—don’t make the UI guess.”

Below are battle-tested ways to move from link in text to image in chat while keeping the streaming experience.

Solution blueprint: from textual link to visual artifact

1) Render tool output directly as UI artifacts

This is the simplest, highest-leverage change. When your app executes the tool and gets { text, imageUrl }, emit two UI events:

A text bubble (if needed) using text.
An image bubble component using imageUrl (plus caption/alt where appropriate).

Then, when posting the tool result back to the model, include a system nudge: “If a tool returns an image URL, do not echo the link; instead summarize what the image shows.” This keeps the assistant’s stream readable and avoids duplicate links.

This pattern treats tool outputs as first-class UI events—parallel to the model’s text stream. It’s a clean separation of concerns: the model orchestrates; your UI decides presentation.

2) Add a `display_image` tool the model must call

Create a no-op UI tool, something like:

{ name: "display_image", description: "Render an image in the chat UI", parameters: { type: "object", properties: { url: { type: "string", format: "uri" }, alt: { type: "string" }, title: { type: "string" }, width: { type: "number" }, height: { type: "number" } }, required: ["url"] } }

In your system prompt, instruct: “When a tool returns an image, call display_image with the URL rather than printing the link.” Your client intercepts this tool call and renders the image component. This approach keeps the model in the loop while preserving a structured, UI-friendly contract.

3) Use structured output to emit UI directives

Another option: request structured responses via a JSON schema (e.g., response_format set to a schema). The schema could include { type: "text" | "image", payload: {...} } blocks, such as:

{ "type": "object", "properties": { "actions": { "type": "array", "items": { "oneOf": [ { "type": "object", "properties": { "type": {"const": "text"}, "content": {"type": "string"} }, "required": ["type","content"] }, { "type": "object", "properties": { "type": {"const": "image"}, "url": {"type": "string", "format": "uri"}, "alt": {"type": "string"} }, "required": ["type","url"] } ] } } }, "required": ["actions"] }

Your renderer consumes this stream of actions and decides whether to render a text bubble or an image. This is great for complex apps that need consistency across varied outputs.

4) Multiplex your stream: text channel + event channel

Keep the token stream flowing for conversational text, but emit separate UI events (Server-Sent Events or WebSocket messages) when tool outputs are ready. Many developer stacks already support this. If you’re using the Vercel AI SDK, look into its tool/annotation streams; similar ideas exist across agent frameworks like LangChain and the ecosystem around Hugging Face.

5) Fallback: URL auto-detection

As a safety net, detect image URLs in streamed text with a simple regex and render them as images inline. It’s not as clean as the approaches above, but it improves UX if the assistant forgets to call your display tool.

A minimal end-to-end flow (pseudocode)

// 1) Send user + system + tools const res = openai.chat.completions.create({ model: "gpt-4o-mini", stream: true, messages: [ { role: "system", content: "If a tool returns an image, call display_image and summarize rather than echoing URLs." }, { role: "user", content: "Create a logo and show it here." } ], tools: [generateLogo, displayImage] });


// 2) Stream tokens and tool calls

for await (const delta of res) {

  if (delta.tool_calls) handleToolCalls(delta.tool_calls);

  if (delta.content) ui.appendText(delta.content);

}
async function handleToolCalls(calls) {

  for (const call of calls) {

    if (call.name === "generate_logo") {

      const output = await generateLogoOnServer(call.args);

      // output: { imageUrl, caption }
      // UI: render immediately

      ui.renderImage(output.imageUrl, output.caption);

// 3) Send tool result back to the model so it can narrate or continue await openai.chat.completions.create({ model: "gpt-4o-mini", stream: true, messages: [ { role: "tool", tool_call_id: call.id, content: JSON.stringify(output) } ] }); } if (call.name === "display_image") { ui.renderImage(call.args.url, call.args.title || call.args.alt); } } }

Two essential details to notice:

Render images as soon as your tool produces them—don’t wait for the assistant’s narrative.
Be explicit in your system instructions so the model calls display_image (or sticks to a structured schema) instead of echoing raw URLs.

How this compares to other ecosystems

Agent frameworks often bake in the notion of artifacts. For example, some LangChain agents return observations that UIs can render differently, and the broader ecosystem around Hugging Face includes Spaces and demos where outputs are strongly typed (images, audio, text). In generative image flows (think Stable Diffusion servers behind a tool), developers usually pass back a URL or a signed path. The difference is UX discipline: define a protocol so the UI knows when to show an image. This is less about the model and more about the contract between your tool layer and the chat renderer.

Why this matters for developers

Better UX: Users expect images to appear inline, not as links they have to click.
Control: A UI-aware tool or schema lets you standardize rendering across features (images now, audio/video later).
Scalability: As more tools emit non-text outputs, a structured protocol avoids brittle parsing and regex hacks.

It’s the same lesson many learned when moving from plain-text logs to structured logs: once there’s a schema, everything gets easier.

Security, performance, and production tips

Sanitize and proxy: Don’t embed arbitrary external URLs. Proxy through your server for CORS, auth, and content filtering.
Expiry-aware links: If using signed URLs, handle refresh and show placeholders on expiration.
Alt text and captions: For accessibility and clarity, require alt in your display_image schema.
Caching: Cache images (CDN) and consider a low-res preview for faster perceived performance.
Robust fallbacks: If the assistant forgets your protocol, auto-detect image URLs and render gracefully.

Practical prompts you can copy

System message snippet:

You are a helpful assistant in a chat UI. When any tool returns an image URL, do not print the URL in text. Instead, call the tool display_image with the appropriate { url, alt, title }. Then continue with a short textual summary of the image.

Developer instruction to the model (as an extra system or developer message):

When returning images, prefer the display_image tool or emit JSON matching the provided schema. Keep text responses concise and avoid duplicating image URLs in the transcript.

The bottom line

Models stream text. Your tools produce artifacts. To bridge the gap, establish a protocol that turns tool outputs into renderable UI events—via direct rendering, a display_image tool, or a structured JSON schema. The payoff is immediate: a smoother experience where users see images, not just links, without sacrificing the streaming feel that makes chat UIs so engaging.

At AI Tech Inspire, the most successful teams we observe treat their UI contract as a first-class API. Once that contract exists, adding new modalities—audio snippets, charts, even 3D assets—becomes a matter of adding new action types, not rewriting your chat logic.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

Fiverr Image Editing

Get the perfect logo.

Fiverr Marketplace

Hire AI talent.