FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

Focusing on the front-end development task, FronTalk explores a novel setting with multi-turn interactions and multi-modal user feedback.

The textual and visual instructions are produced by user simulators based on pre-defined user intents. The simulator contextualizes the user intent based on the website generated in prior turns.

Overview of AceCoder. AceCoder first generates a candidate website, collects agent-based critique by interacting with the website and verifying the implementation of all prior instructions, and then use the critique to enhance user instructions.

Introduction

We present FronTalk, a benchmark for front-end code generation to pioneer the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. We curate a collection of 100 multi-turn dialogues derived from real-world websites. Each turn features a textual instruction and an equivalent visual instruction, each representing the same user intent. For evaluation, we propose a novel agent-based framework using a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience.

Evaluation of 20 models reveals two key challenges: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs).

We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0%→65.3%).

Leaderboard

Click metrics to sort the leaderboard.

#	Model	Single-Turn		Multi-Turn (Textual)			Multi-Turn (Visual)
#	Model	PR↑	UX↑	🏅PR↑	FR↓	UX↑	PR↑	FR↓	UX↑
1	Gemini-2.5-Pro 🥇	57.2	56.8	75.0	4.5	71.0	68.7	6.2	73.8
2	Qwen3-Coder-480B 🥈	67.0	64.0	62.5	15.4	62.3
3	GPT-OSS-20B 🥉	64.3	68.5	61.6	15.8	71.5
4	Qwen3-Coder-30B	54.1	49.8	61.0	8.6	54.3
5	Qwen3-8B	54.3	45.8	59.9	13.9	57.8
6	Qwen3-235B	62.9	57.8	59.0	20.6	61.3
7	GPT-OSS-120B	68.0	73.5	57.9	9.2	66.8
8	GPT-4o	51.4	50.0	56.0	21.4	55.0	55.0	8.0	52.5
9	Qwen3-VL-30B	54.6	50.0	54.1	4.7	50.3	39.1	12.3	35.8
10	Qwen2.5-VL-72B	43.0	33.8	52.8	17.8	46.5	44.6	15.4	37.3
11	Claude-4-Sonnet	77.5	71.8	45.5	22.8	60.8	59.3	10.6	64.8
12	Gemma3-12B	47.1	43.5	43.8	30.3	43.0	30.4	22.1	38.5
13	Gemma3-27B	36.7	37.2	37.2	42.1	39.5	35.5	16.7	41.0
14	Ovis2.5-9B	51.5	49.3	35.3	12.2	34.3	27.0	10.5	39.5
15	GLM-4.5V-108B	24.4	23.5	35.2	9.8	40.5	7.9	27.7	7.8
16	Qwen2.5-VL-32B	44.5	38.0	33.2	46.9	37.5	36.9	25.7	36.5
17	Llama-3.3-70B	38.4	35.8	28.9	46.3	40.8
18	DeepSeek-R1-685B	60.0	53.5	24.7	34.5	40.0
19	Qwen2.5-VL-7B	24.5	19.5	23.7	28.2	26.8	16.1	39.3	17.0
20	GLM-4.1V-Thinking-9B	36.6	32.3	18.2	5.8	41.5	15.3	4.3	32.0

🚨 To submit your results to the leaderboard, please send to this email with the zip file of your outputs.

FronTalk aims to pioneer the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. Each conversation in the dataset consists of a sequence of user intents, where each intent is further paired with a set of test cases. In inference, at each turn, we employ a user simulator to transform the static intent into a context-aware instruction (textual or visual) based on the website generated in prior turns.

We curate 100 conversations containing 1,000 turns of user intents from real websites in C4 dataset, spanning diverse domains such as e-commerce platforms, financial service sites, and digital art portfolios. The test cases are first automatically generated by LLMs and then manually refined. We curate 3,676 test cases in total. You can download the dataset on Huggingface Dataset.

Inference pipeline.

Website topic distribution.

To evaluate the quality of the generated websites, we calculate two metrics: (1) pass rate (PR), measuring the instruction following aspect, calculated based on the predefined test cases; and (2) usability (UX), measuring the user experience aspect, defined as how easily a new user can learn to navigate the website and accomplish tasks. To measure the forgetting issue, we additionally calculate forgetting rate (FR) that measures how much functionality correctly implemented in prior turns is lost in the final output.

We employ an interactive web agent that evaluates the website through direct interaction. For pass rate, it acts as an informed expert, using the full context to verify a test case. For usability, it simulates a naive user that explores the site and attempts self-proposed tasks. We adapt the WebVoyager framework and additionally enhance it with a suite of image manipulation tools to enhance its visual perception.

Focusing on instances where a model fails a visual instruction but succeeds on its textual equivalent, we categorize the errors into three types: (1) Failure to implement implicit functionalities, the most common error, where VLMs replicate the UI layout but fail to implement the underlying functionalities; (2) Missing text annotations, where model miss some text annotations in the image, especially in densely annotated images; and (3) Misinterpretation of visual clues, the least common error, which occurs when visual clues are slightly ambiguous

In addition to users that express their intents through detailed, explicit instructions, we aim to simulate more diverse interaction patterns between users and LLMs. So, we simulate two additional types of users: (1) clarification-only user that responds to direct questions but provides no unsolicited details, and (2) preference-only user that verbalizes only minimal requirements but instead reveals intent through choices among model-generated options. The degraded performance with both user types highlights models limited capability to actively elicit and infer user intent, directly affecting the task completion pass rate and the usability of generated websites.

Human alignment for the automatic evaluation of pass rate. Our evaluator better aligns with human preference compared with image-based evaluator, code-based evaluator, and an ablated version without image manipulation tools (w/o IM tools).

First, AceCoder generates a candidate website (L1). A web agent then interacts with this site to verify that all features – from both current and past instructions – are correctly implemented (L3-L8). The gathered critiques are used to augment the original instructions, and this enhanced prompt is fed back to the model to regenerate an improved website (L10).

Main results of AceCoder with both textual (T) and visual (V) instructions.

Ablation study of AceCoder, evaluated on GPT-4o with textual instructions.

BibTeX

@article{wu2025frontalk,
    title={FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback},
    author={Wu, Xueqing and Xue, Zihan and Yin, Da and Zhou, Shuyan and Chang, Kai-Wei and Peng, Nanyun and Wen, Yeming},
    year={2025},
    month={oct},
    url={https://github.com/shirley-wu/frontalk}
}

FronTalk

Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

Introduction

Leaderboard

Task and Dataset

Analysis

AceCoder: Improved Baseline

BibTeX