Closed-loop planning
Perceive–plan–execute–evaluate cycle with memory and visual feedback. No open-loop, single-shot edits.
Autonomous Image Editing
Where Aesthetic Intent Becomes Visual Reality
1Multimedia Laboratory, The Chinese University of Hong Kong 2Shanghai AI Laboratory
3University of Science and Technology of China 4Institute of Science Tokyo
5CPII under InnoHK
PhotoAgent turns photos into professionally edited results through exploratory visual aesthetic planning—no step-by-step prompts required. One goal, one click, autonomous enhancement.
Editing, Reimagined
Traditional photo editing demands expertise, endless parameter tuning, and exhausting trial-and-error. PhotoAgent replaces this fragile human-in-the-loop pipeline with an autonomous agent-in-the-loop system — one that perceives, plans, executes, and evaluates like a seasoned professional.
From instruction-following editors to an autonomous editing agent that plans, executes, and evaluates—aligned with human aesthetics.
Perceive–plan–execute–evaluate cycle with memory and visual feedback. No open-loop, single-shot edits.
MCTS-based planner explores editing trajectories and avoids short-sighted or irreversible decisions.
Any existing editing tool can serve as a building block — GPT-Image-1, Flux.1 Kontext, Step1X-Edit, Nano Banana, ZImage, and more. The agent selects and orchestrates the best one for each step.
UGC-Edit dataset and reward model for user-generated photos—evaluation that matches real user preferences.
Four core components form a single closed-loop system — perceiving, planning, executing, and evaluating in continuous cycles until the optimal result emerges.
VLM (e.g. Qwen3-VL) interprets the image and proposes K diverse, atomic editing actions. Supports fully autonomous or user-guided editing.
Vision-Language ModelMCTS explores candidate actions via selection, expansion, simulation, and backpropagation. Selects top-K actions for execution.
Monte Carlo Tree SearchRuns selected actions with traditional or generative tools. Retains the highest-scoring result as the next state.
Tool OrchestrationEnsemble of no-reference metrics, CLIP-based and instruction-following scores, plus UGC reward model. Drives re-planning when needed.
Multi-metric ScoringPhotoAgent maintains a full editing history across every iteration. Before proposing new actions, the Perceiver reviews all previously applied operations — preventing redundant edits, avoiding conflicting adjustments, and ensuring each step moves the image toward a genuinely better state.
Not all photos are created equal. PhotoAgent first classifies each input into a fine-grained scene category, then activates a scene-specific editing strategy — selecting the most appropriate tools, adjusting aesthetic targets, and tailoring evaluation criteria to match the unique characteristics of each scene type.
Skin retouching, facial lighting, background bokeh, and expression-preserving enhancements tailored for human subjects.
Sky replacement, color vibrancy, dynamic range expansion, and atmospheric depth for outdoor scenes.
Perspective correction, structural clarity, night scene lighting, and geometric detail enhancement for cityscapes.
Color saturation, texture sharpening, warm tone grading, and appetizing presentation for close-up subjects.
Noise reduction, exposure recovery, tone mapping, and light source enhancement for challenging lighting conditions.
White balance correction, shadow fill, detail enhancement, and ambient lighting adjustment for interior scenes.
Each scene category activates a specialized prompt template and tool preference profile, ensuring the agent's editing strategy is never generic — it is always adapted to the content at hand.
State-of-the-art on instruction adherence and visual quality; preferred in user studies.
Rather than applying a fixed recipe, PhotoAgent reasons about the aesthetic intent behind each image and autonomously selects the most suitable operations. It respects the photographer's original narrative while unlocking creative transformations that elevate visual quality — all without explicit step-by-step instructions.
Intelligent cropping and framing guided by the rule of thirds, leading lines, and subject emphasis — tightening the visual story.
Selectively blurs backgrounds or foregrounds to create cinematic depth-of-field, directing attention to what matters most.
Refines white balance, contrast curves, and color grading to evoke the right mood — warm golden-hour warmth, cool twilight serenity, or vivid punch.
Subtly enriches scene elements — adding reflections, enhancing textures, or refining details — to amplify the image's storytelling power.
Identifies and removes visual clutter — stray objects, blemishes, unwanted passersby — leaving a cleaner, more focused composition.
Injects environmental mood — light rays, mist, motion trails, or luminous particles — breathing life and emotion into static scenes.
Every edit is guided by a single principle: preserve the photographer's intent while maximizing aesthetic quality. The agent decides what to do and how far to push each operation — no two photos receive the same treatment.
Click any image to view full size.
























Step through or auto-play to see how PhotoAgent refines the image each round.
Existing image quality metrics are designed for generic images and ill-suited for user-generated photos. We introduce a comprehensive evaluation framework comprising a UGC-specific preference dataset, a learned aesthetic reward model, and a real-world editing benchmark.
Pipeline for constructing UGC-Edit and training the reward model. Source images from LAION and RealQA are classified by Qwen3-VL, filtered by human annotators, and used to train a reward model via GRPO for fine-grained aesthetic scoring.
~7,000 authentic user-generated photos sourced from LAION Aesthetic and RealQA. A two-stage filtering process—VLM-based classification followed by manual verification—retains only genuine UGC. All aesthetic scores are normalized to a unified 1–5 scale.
Initialized from a pretrained VLM (Qwen2.5-VL) and optimized with Group Relative Policy Optimization (GRPO). Learns from relative rankings within image groups, improving robustness to annotation noise and capturing subtle aesthetic cues.
1,017 real-world photographs covering portraits, landscapes, urban scenes, food, low-light, and more. Provides a diverse and challenging testbed for end-to-end evaluation of autonomous photo editing.
1CUHK Multimedia Lab 2Shanghai AI Lab 3USTC 4Institute of Science Tokyo 5CPII under InnoHK
Project page: https://github.com/mdyao/PhotoAgent