
If a chat assistant generates an image but users only see a naked URL, the magic is lost. At AI Tech Inspire, we’ve been seeing more teams hit this exact snag when wiring up custom tools with GPT-style models: the model streams text beautifully, but image outputs arrive as plain links. The good news? This is fixable with a few pragmatic patterns that keep the stream flowing while your UI renders images natively.
What’s happening in this setup
- The app uses a streaming chat interface with an OpenAI model and tool/function calls.
- The standard flow: provide tools, receive a tool call, run code, send tool output back, and then the model returns a final response (or more tool calls).
- This works well for tools returning text data; the model’s final response streams cleanly to the user.
- One custom tool returns JSON containing both text and an image URL.
- Result: the model includes that image URL in its streamed text, so users see a link instead of an actual image.
- Goal: treat the image output as an image component in the UI, not just text.
Why models turn image results into URLs
Most chat completion models stream text and tool_calls
. They don’t stream image binaries or typed “image” blocks as assistant content by default. When a tool returns JSON with an image URL, the assistant tends to reference it in text unless given a clear protocol or a UI-aware tool. Without that protocol, your downstream renderer has no signal to switch from text streaming to an image component.
“If you want images to render, give your UI the signal—don’t make the UI guess.”
Below are battle-tested ways to move from link in text to image in chat while keeping the streaming experience.
Solution blueprint: from textual link to visual artifact
1) Render tool output directly as UI artifacts
This is the simplest, highest-leverage change. When your app executes the tool and gets { text, imageUrl }
, emit two UI events:
- A text bubble (if needed) using
text
. - An image bubble component using
imageUrl
(plus caption/alt where appropriate).
Then, when posting the tool result back to the model, include a system
nudge: “If a tool returns an image URL, do not echo the link; instead summarize what the image shows.” This keeps the assistant’s stream readable and avoids duplicate links.
This pattern treats tool outputs as first-class UI events—parallel to the model’s text stream. It’s a clean separation of concerns: the model orchestrates; your UI decides presentation.
2) Add a display_image
tool the model must call
Create a no-op UI tool, something like:
{
name: "display_image",
description: "Render an image in the chat UI",
parameters: {
type: "object",
properties: {
url: { type: "string", format: "uri" },
alt: { type: "string" },
title: { type: "string" },
width: { type: "number" },
height: { type: "number" }
},
required: ["url"]
}
}
In your system
prompt, instruct: “When a tool returns an image, call display_image
with the URL rather than printing the link.” Your client intercepts this tool call and renders the image component. This approach keeps the model in the loop while preserving a structured, UI-friendly contract.
3) Use structured output to emit UI directives
Another option: request structured responses via a JSON schema (e.g., response_format
set to a schema). The schema could include { type: "text" | "image", payload: {...} }
blocks, such as:
{
"type": "object",
"properties": {
"actions": {
"type": "array",
"items": {
"oneOf": [
{ "type": "object", "properties": { "type": {"const": "text"}, "content": {"type": "string"} }, "required": ["type","content"] },
{ "type": "object", "properties": { "type": {"const": "image"}, "url": {"type": "string", "format": "uri"}, "alt": {"type": "string"} }, "required": ["type","url"] }
]
}
}
},
"required": ["actions"]
}
Your renderer consumes this stream of actions and decides whether to render a text bubble or an image. This is great for complex apps that need consistency across varied outputs.
4) Multiplex your stream: text channel + event channel
Keep the token stream flowing for conversational text, but emit separate UI events (Server-Sent Events or WebSocket messages) when tool outputs are ready. Many developer stacks already support this. If you’re using the Vercel AI SDK, look into its tool/annotation streams; similar ideas exist across agent frameworks like LangChain and the ecosystem around Hugging Face.
5) Fallback: URL auto-detection
As a safety net, detect image URLs in streamed text with a simple regex and render them as images inline. It’s not as clean as the approaches above, but it improves UX if the assistant forgets to call your display tool.
A minimal end-to-end flow (pseudocode)
// 1) Send user + system + tools
const res = openai.chat.completions.create({
model: "gpt-4o-mini",
stream: true,
messages: [
{ role: "system", content: "If a tool returns an image, call display_image and summarize rather than echoing URLs." },
{ role: "user", content: "Create a logo and show it here." }
],
tools: [generateLogo, displayImage]
});
// 2) Stream tokens and tool calls
for await (const delta of res) {
if (delta.tool_calls) handleToolCalls(delta.tool_calls);
if (delta.content) ui.appendText(delta.content);
}
async function handleToolCalls(calls) {
for (const call of calls) {
if (call.name === "generate_logo") {
const output = await generateLogoOnServer(call.args);
// output: { imageUrl, caption }
// UI: render immediately
ui.renderImage(output.imageUrl, output.caption);
// 3) Send tool result back to the model so it can narrate or continue
await openai.chat.completions.create({
model: "gpt-4o-mini",
stream: true,
messages: [
{ role: "tool", tool_call_id: call.id, content: JSON.stringify(output) }
]
});
}
if (call.name === "display_image") {
ui.renderImage(call.args.url, call.args.title || call.args.alt);
}
}
}
Two essential details to notice:
- Render images as soon as your tool produces them—don’t wait for the assistant’s narrative.
- Be explicit in your
system
instructions so the model callsdisplay_image
(or sticks to a structured schema) instead of echoing raw URLs.
How this compares to other ecosystems
Agent frameworks often bake in the notion of artifacts. For example, some LangChain agents return observations that UIs can render differently, and the broader ecosystem around Hugging Face includes Spaces and demos where outputs are strongly typed (images, audio, text). In generative image flows (think Stable Diffusion servers behind a tool), developers usually pass back a URL or a signed path. The difference is UX discipline: define a protocol so the UI knows when to show an image. This is less about the model and more about the contract between your tool layer and the chat renderer.
Why this matters for developers
- Better UX: Users expect images to appear inline, not as links they have to click.
- Control: A UI-aware tool or schema lets you standardize rendering across features (images now, audio/video later).
- Scalability: As more tools emit non-text outputs, a structured protocol avoids brittle parsing and regex hacks.
It’s the same lesson many learned when moving from plain-text logs to structured logs: once there’s a schema, everything gets easier.
Security, performance, and production tips
- Sanitize and proxy: Don’t embed arbitrary external URLs. Proxy through your server for CORS, auth, and content filtering.
- Expiry-aware links: If using signed URLs, handle refresh and show placeholders on expiration.
- Alt text and captions: For accessibility and clarity, require
alt
in yourdisplay_image
schema. - Caching: Cache images (CDN) and consider a low-res preview for faster perceived performance.
- Robust fallbacks: If the assistant forgets your protocol, auto-detect image URLs and render gracefully.
Practical prompts you can copy
System message snippet:
You are a helpful assistant in a chat UI. When any tool returns an image URL, do not print the URL in text. Instead, call the tool display_image with the appropriate { url, alt, title }. Then continue with a short textual summary of the image.
Developer instruction to the model (as an extra system
or developer
message):
When returning images, prefer the display_image tool or emit JSON matching the provided schema. Keep text responses concise and avoid duplicating image URLs in the transcript.
The bottom line
Models stream text. Your tools produce artifacts. To bridge the gap, establish a protocol that turns tool outputs into renderable UI events—via direct rendering, a display_image
tool, or a structured JSON schema. The payoff is immediate: a smoother experience where users see images, not just links, without sacrificing the streaming feel that makes chat UIs so engaging.
At AI Tech Inspire, the most successful teams we observe treat their UI contract as a first-class API. Once that contract exists, adding new modalities—audio snippets, charts, even 3D assets—becomes a matter of adding new action types, not rewriting your chat logic.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.