Autonomous Image Editing

PhotoAgent

Where Aesthetic Intent Becomes Visual Reality

Mingde Yao^1,5, Zhiyuan You¹, King-Man Tam⁴, Menglu Wang³, Tianfan Xue^1,2,5

¹Multimedia Laboratory, The Chinese University of Hong Kong ²Shanghai AI Laboratory
³University of Science and Technology of China ⁴Institute of Science Tokyo
⁵CPII under InnoHK

PhotoAgent turns photos into professionally edited results through exploratory visual aesthetic planning—no step-by-step prompts required. One goal, one click, autonomous enhancement.

Paper (arXiv) Code (Coming) HuggingFace Demo (Coming) Dataset (Coming)

PhotoAgent teaser — autonomous editing results at a glance

Editing, Reimagined

Stop Editing Manually.
Let the Agent Handle It.

Traditional photo editing demands expertise, endless parameter tuning, and exhausting trial-and-error. PhotoAgent replaces this fragile human-in-the-loop pipeline with an autonomous agent-in-the-loop system — one that perceives, plans, executes, and evaluates like a seasoned professional.

Human-in-the-loop

Expertise barrier — Users struggle to translate aesthetic intent into editing operations.
Algorithm selection — Which filter? Which tool? Too many knobs, no guidance.
Interaction fatigue — Endless manual iterations lead to frustration and suboptimal results.

Agent-in-the-loop

Perceive — VLM understands the image context and proposes diverse editing actions.
Plan — MCTS explores aesthetic trajectories to find the optimal path.
Execute & Evaluate — Closed-loop feedback ensures quality at every step.

What makes PhotoAgent different

From instruction-following editors to an autonomous editing agent that plans, executes, and evaluates—aligned with human aesthetics.

🔄

Closed-loop planning

Perceive–plan–execute–evaluate cycle with memory and visual feedback. No open-loop, single-shot edits.

🌳

Exploratory aesthetic planning

MCTS-based planner explores editing trajectories and avoids short-sighted or irreversible decisions.

🛠

Rich toolset

Any existing editing tool can serve as a building block — GPT-Image-1, Flux.1 Kontext, Step1X-Edit, Nano Banana, ZImage, and more. The agent selects and orchestrates the best one for each step.

📐

UGC-oriented evaluation

UGC-Edit dataset and reward model for user-generated photos—evaluation that matches real user preferences.

Under the Hood

How PhotoAgent Works

Four core components form a single closed-loop system — perceiving, planning, executing, and evaluating in continuous cycles until the optimal result emerges.

Perceiver

VLM (e.g. Qwen3-VL) interprets the image and proposes K diverse, atomic editing actions. Supports fully autonomous or user-guided editing.

Vision-Language Model

Planner

MCTS explores candidate actions via selection, expansion, simulation, and backpropagation. Selects top-K actions for execution.

Monte Carlo Tree Search

Executor

Runs selected actions with traditional or generative tools. Retains the highest-scoring result as the next state.

Tool Orchestration

Evaluator

Ensemble of no-reference metrics, CLIP-based and instruction-following scores, plus UGC reward model. Drives re-planning when needed.

Multi-metric Scoring

Closed-loop — iterates until convergence

Action Memory & History

PhotoAgent maintains a full editing history across every iteration. Before proposing new actions, the Perceiver reviews all previously applied operations — preventing redundant edits, avoiding conflicting adjustments, and ensuring each step moves the image toward a genuinely better state.

No repeated operations

Context-aware decisions

Faster convergence

Conflict-free edits

Scene-Aware Design

Fine-grained Scene Classification

Not all photos are created equal. PhotoAgent first classifies each input into a fine-grained scene category, then activates a scene-specific editing strategy — selecting the most appropriate tools, adjusting aesthetic targets, and tailoring evaluation criteria to match the unique characteristics of each scene type.

Portrait & People

Skin retouching, facial lighting, background bokeh, and expression-preserving enhancements tailored for human subjects.

Landscape & Nature

Sky replacement, color vibrancy, dynamic range expansion, and atmospheric depth for outdoor scenes.

Urban & Architecture

Perspective correction, structural clarity, night scene lighting, and geometric detail enhancement for cityscapes.

Food & Product

Color saturation, texture sharpening, warm tone grading, and appetizing presentation for close-up subjects.

Low-light & Night

Noise reduction, exposure recovery, tone mapping, and light source enhancement for challenging lighting conditions.

Indoor & Still Life

White balance correction, shadow fill, detail enhancement, and ambient lighting adjustment for interior scenes.

Each scene category activates a specialized prompt template and tool preference profile, ensuring the agent's editing strategy is never generic — it is always adapted to the content at hand.

Results

State-of-the-art on instruction adherence and visual quality; preferred in user studies.

Intent-Preserving · Aesthetics-First

One Photo, Many Possibilities

Rather than applying a fixed recipe, PhotoAgent reasons about the aesthetic intent behind each image and autonomously selects the most suitable operations. It respects the photographer's original narrative while unlocking creative transformations that elevate visual quality — all without explicit step-by-step instructions.

Recompose

Intelligent cropping and framing guided by the rule of thirds, leading lines, and subject emphasis — tightening the visual story.

Depth & Bokeh

Selectively blurs backgrounds or foregrounds to create cinematic depth-of-field, directing attention to what matters most.

Color & Tone

Refines white balance, contrast curves, and color grading to evoke the right mood — warm golden-hour warmth, cool twilight serenity, or vivid punch.

Content Enhancement

Subtly enriches scene elements — adding reflections, enhancing textures, or refining details — to amplify the image's storytelling power.

Distraction Removal

Identifies and removes visual clutter — stray objects, blemishes, unwanted passersby — leaving a cleaner, more focused composition.

Atmosphere & Dynamics

Injects environmental mood — light rays, mist, motion trails, or luminous particles — breathing life and emotion into static scenes.

Every edit is guided by a single principle: preserve the photographer's intent while maximizing aesthetic quality. The agent decides what to do and how far to push each operation — no two photos receive the same treatment.

Input / Output examples

Click any image to view full size.

Best CLIP similarity & BRISQUE

1,017 Real photos in editing benchmark

Editing process over iterations

Step through or auto-play to see how PhotoAgent refines the image each round.

Original — Canton Tower under overcast sky

Original — Overcast sky, flat tones, muted colors

After iteration 1 — color and tone correction

Iteration 1 — Color & tone correction: sky restored to blue, vibrant palette

After iteration 2 — birds added for dynamic atmosphere

Iteration 2 — Atmosphere & dynamics: flying birds inject life and movement

After iteration 3 — refined depth and final color polish

Iteration 3 — Final polish: enhanced depth-of-field, refined color grading

Auto-play

UGC-Edit & benchmark

Existing image quality metrics are designed for generic images and ill-suited for user-generated photos. We introduce a comprehensive evaluation framework comprising a UGC-specific preference dataset, a learned aesthetic reward model, and a real-world editing benchmark.

Pipeline for constructing UGC-Edit and training the reward model. Source images from LAION and RealQA are classified by Qwen3-VL, filtered by human annotators, and used to train a reward model via GRPO for fine-grained aesthetic scoring.

UGC-Edit Dataset

~7,000 authentic user-generated photos sourced from LAION Aesthetic and RealQA. A two-stage filtering process—VLM-based classification followed by manual verification—retains only genuine UGC. All aesthetic scores are normalized to a unified 1–5 scale.

UGC Reward Model

Initialized from a pretrained VLM (Qwen2.5-VL) and optimized with Group Relative Policy Optimization (GRPO). Learns from relative rankings within image groups, improving robustness to annotation noise and capturing subtle aesthetic cues.

Editing Benchmark

1,017 real-world photographs covering portraits, landscapes, urban scenes, food, low-light, and more. Provides a diverse and challenging testbed for end-to-end evaluation of autonomous photo editing.

Paper & resources

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning