AI Agents Struggle with Real-World Online Tasks, Highlighting Need for User-Centred Design

Category: User-Centred Design · Effect: Strong effect · Year: 2026

Current AI agents demonstrate significant limitations in autonomously completing common, multi-step online tasks across diverse platforms, indicating a gap between AI capabilities and user needs for practical assistance.

Design Takeaway

Designers should focus on developing AI agents that can handle the inherent complexity and variability of real-world online interactions, prioritizing user needs for reliable and comprehensive digital assistance.

Why It Matters

This research underscores that while AI can handle isolated functions, its ability to act as a general-purpose assistant is still nascent. Designers and developers must focus on bridging this gap by creating AI systems that are more robust, adaptable, and aligned with the complexities of human workflows and user expectations for seamless digital interactions.

Key Finding

AI agents, even advanced ones, are currently not capable of reliably handling the majority of common online tasks that humans perform regularly, revealing a significant shortfall in their practical utility as general assistants.

Key Findings

AI agents can only complete a small fraction of everyday online tasks.
Current AI models struggle with tasks requiring information extraction from user documents, multi-step navigation across diverse platforms, and extensive form filling.

Research Evidence

Aim: Can AI agents reliably complete a broad spectrum of everyday online tasks that require navigating multiple platforms and complex workflows?

Method: Empirical evaluation using a benchmark framework.

Procedure: An evaluation framework named ClawBench was developed, comprising 153 everyday online tasks across 144 live platforms. AI agents were tasked with completing these tasks, with only the final submission requests being intercepted to prevent real-world side effects. Performance was measured by the success rate of task completion.

Sample Size: 7 frontier AI models were evaluated.

Context: Online task completion, AI agent capabilities, digital assistants.

Design Principle

AI systems designed for user assistance must be evaluated and developed within the context of real-world, dynamic environments, not just isolated simulations.

How to Apply

When designing AI-powered tools or assistants, rigorously test their performance on a wide range of realistic, multi-step tasks across different platforms, mimicking actual user behaviour and environmental complexities.

Limitations

The evaluation framework intercepts final submission requests, which might not fully capture all potential failure points in a live transaction. The benchmark focuses on 'simple' tasks, and the complexity of 'everyday' tasks can vary significantly.

Student Guide (IB Design Technology)

Simple Explanation: AI assistants aren't very good yet at doing everyday online jobs for you, like buying things or booking appointments, because the internet is complicated and changes a lot.

Why This Matters: This shows that just making AI smart isn't enough; it needs to be able to handle the messy, real world of the internet to be truly helpful to people.

Critical Thinking: Given the current limitations, what are the most critical design considerations for developing AI agents that can effectively and safely assist users with complex online tasks?

IA-Ready Paragraph: Research indicates that current AI agents exhibit significant limitations in autonomously completing a wide array of everyday online tasks, often failing in scenarios requiring multi-platform navigation and complex data input. This highlights a critical gap between AI capabilities and the demands of real-world user workflows, underscoring the necessity for design approaches that prioritize robustness and user-centred evaluation in dynamic digital environments.

Project Tips

When designing an AI assistant, think about all the different websites and steps a user might go through.
Test your AI on real websites, not just pretend ones, to see if it really works.

How to Use in IA

Use this research to justify the need for robust testing of AI agents in your design project, especially if you are developing or integrating AI functionalities.
Cite this study when discussing the current limitations of AI in practical applications and the importance of user-centred evaluation.

Examiner Tips

Consider the real-world applicability and limitations of AI tools you are proposing or evaluating.
Demonstrate an understanding of the challenges in creating AI that can navigate complex, dynamic online environments.

Independent Variable: Type of AI agent, complexity of online task.

Dependent Variable: Task completion rate, accuracy of form filling, success in multi-step workflows.

Controlled Variables: Live production websites, specific task definitions, interception layer mechanism.

Strengths

Utilizes live, production websites, reflecting real-world complexity.
Covers a broad range of everyday tasks across multiple categories and platforms.

Critical Questions

How can the benchmark be expanded to include more nuanced or subjective tasks?
What are the ethical implications of AI agents performing these tasks, even with submission interception?

Extended Essay Application

Investigate the potential for AI agents to assist in specific user-centred design research tasks, such as synthesizing user feedback or generating personas, and evaluate their current efficacy.
Explore the design of user interfaces that can better support or guide AI agents in completing complex online tasks.

Source

ClawBench: Can AI Agents Complete Everyday Online Tasks? · arXiv preprint · 2026