Microsoft Lens in ComfyUI — Basic Video Guide

Overview

Microsoft Lens is presented as a small but capable text-to-image model. The video’s main point is that Lens has only 3.8B parameters, but can still produce detailed images, high resolutions, and flexible aspect ratios inside ComfyUI.

Simple summary: Lens is not pitched as the biggest or most powerful image model. It is pitched as an efficient model that can create surprisingly detailed images for its size.

Key names, tools, and locations

Model and files

Model: Microsoft Lens
Model size: 3.8B parameters
Dataset mentioned: Lens-800M
Main download location: Comfy-Org/Lens on Hugging Face
Versions: normal Lens and Turbo Lens

Workflow/tools mentioned

ComfyUI — local node workflow
RunningHub — online ComfyUI workflow platform
UNETLoader — loads the main model
GPT-OSS — text encoder
FLUX2 VAE — VAE used in the workflow

Basic ComfyUI settings mentioned

Use a recent ComfyUI build, because Lens support appears in newer versions.
Set CLIP type to Lens; otherwise encoding may not work correctly.
Normal Lens: about 20 sampling steps.
Turbo Lens: about 4 steps for faster preview-style output.
Sampler/scheduler used: Euler sampler and simple scheduler.
Example CFG: 5.0.
Example denoise: 1.0.
Square test resolution: 1440 × 1440.
Vertical test resolution: 1024 × 1536.
Important nodes to watch: ModelSamplingFlux and CFGNorm.

Easy-to-miss point: when changing image size, do not only change EmptyLatentImage. The width and height also affect ModelSamplingFlux, so the workflow should keep those values connected consistently.

Test scenes and locations from the video

Realistic photo test: a small independent watch repair shop at midnight, with an old watchmaker, tiny golden gears, rain outside, and red/blue neon reflections.
Chinese prompt/location test: a rainy night in Chongqing, China, with wet stairs, old residential buildings, street food stalls, lanterns, neon signs, a delivery rider, an electric scooter, river lights, and ferries.
English text test: a travel notebook cover with readable title text, hand-drawn map elements, seashells, watercolor tape, and sunlight.
Product text test: a black perfume bottle with a minimal label, reflective tabletop, and soft spotlighting.
Fantasy large-scene test: an ancient floating harbor above the clouds with wooden docks, sailing ships, mechanical cranes, travelers, and whale-like airships.
Sci-fi concept art test: a futuristic data cathedral with magnetic panels, glowing data pillars, glass floors, engineers in black robes, holograms, and optical fibers.
Object-count test: an overhead desktop scene with exactly four pencils, two ceramic cups, one silver laptop, and one square sketchbook.

Main takeaways

Lens seems strong at detail density and long prompt understanding for a small model.
It handles flexible aspect ratios better than square-only workflows.
English text generation is described as strong.
Chinese prompts can work for overall atmosphere, but English prompts are still recommended when possible.
Chinese text inside generated images is not reliable.
Exact object counts are still unreliable, which is common for text-to-image models.
The normal model is slower but higher quality; Turbo is the faster option.

Sources

Primary YouTube video — Veteran AI overview and tests of Microsoft Lens in ComfyUI.
Comfy-Org/Lens model files — Main model, Turbo model, text encoder, and VAE resources referenced in the video.
RunningHub — Online ComfyUI workflow platform mentioned in the video.
Example workflow JSON — Workflow file linked in the video description.

Source notes sidecar: microsoft-lens-comfyui-basic-guide-2026-05-28.sources.md