WebAgent - Agents based on large language models

Large language models (LLM) can solve a variety of natural language tasks, such as arithmetic, common sense, logical reasoning, question answering, text generation, and interactive decision-making tasks. Recently, LLM has also achieved great success in autonomous web navigation, the ability of agents to facilitate HTML understanding and multi-step reasoning, by controlling a computer or browsing the Internet to perform a series of computer operations to satisfy a given natural language instruction.

However, web navigation on real-world websites still suffers from the following problems:

(1) Lack of predefined operation space.

(2) HTML watch is longer than simulator.

(3) LLM lacks HTML domain knowledge.

Given the openness of real-world websites and the complexity of instructions, it is challenging to define an appropriate operating space in advance. Furthermore, recent LLMs do not always have optimal designs for processing HTML documents, despite several studies suggesting that HTML understanding and web navigation accuracy can be improved through instruction fine-tuning or reinforcement learning from human feedback. The context length of most LLMs is shorter compared to the average markup of HTML on real-world websites, and no HTML-specific domain knowledge is employed.

In response to the above problems, the researchers introduced WebAgent, which is an agent program driven by LLM, which can complete navigation tasks on real websites according to user instructions by combining normalized network operations. WebAgent performs planning by breaking down instructions into normalized sub-instructions, transforms long HTML documents into task-related fragments, and operates websites through generated Python programs. The researchers combined two LLMs into WebAgent: Flan-U-PaLM for code-based generation, and the newly introduced HTML-T5, a novel pre-trained LLM, for planning and summarizing local long HTML documents.

Experiments have proved that this method can increase the success rate on real websites by more than 50%, and HTML-T5 is currently the best model for solving HTML-based tasks; in the MiniWoB web navigation benchmark test, its success rate is higher than that of the previous state-of-the-art method outperforms by 14.9% and also has better accuracy on offline mission planning evaluation.

WebAgent - Agents based on large language models

Guess you like