NATURAL PLAN: Benchmarking LLMs on natural language planning

Have you ever tried juggling multiple schedules to plan a get-together with friends or meticulously organizing a trip itinerary down to the finest detail? If you have, you know that planning can be a real brain teaser, requiring you to balance numerous constraints and anticipate potential hiccups along the way. It’s a skill we humans tend to take for granted, but it turns out that even the most advanced AI models find this task incredibly challenging.

Recently, our friends over at Google DeepMind have been working on something called NATURAL PLAN—a benchmark designed to evaluate how well large language models (LLMs) can handle planning tasks using natural language prompts. This could be a game-changer for AI, as the next big step is moving beyond simple chatbot functions to actually managing tasks across various platforms on our behalf. Imagine an AI that can not just chat but also arrange meetings, plan vacations, and schedule work, all from a brief command. Sounds amazing, right? But it turns out we’re not there just yet.

What is NATURAL PLAN?

NATURAL PLAN tests AI models on three specific planning tasks:

1. Trip Planning: Creating a detailed travel itinerary considering constraints like flight availability and destination specifics.
2. Meeting Planning: Organizing meetings for multiple people in different locations.
3. Calendar Scheduling: Coordinating work meetings between various individuals, all while navigating their existing schedules and constraints.

To test the models, the researchers used few-shot prompting, where the AI was shown five examples of tasks and their correct outcomes. Then, they were given new, progressively more complex prompts to tackle.

Here’s an example of a prompt and solution provided to the models (I’m keeping it simple—just imagine we’re planning a quaint weekend getaway):

Prompt: “Plan a trip from New York to Paris, considering a 3-day stay and return. The flights must depart after 6 PM New York time and arrive in Paris before noon local time.”

Solution: “Day 1: Depart from JFK at 7 PM, arrive in Paris at 10 AM local time. Day 2-4: Stay in Paris, visit the Eiffel Tower, Louvre Museum, and enjoy local cuisine. Day 4: Return flight departs from Charles de Gaulle at 5 PM, arrives in New York at 8 PM local time.”

The Results: A Mixed Bag

The models tested included GPT-3.5, GPT-4, GPT-4o, Gemini 1.5 Flash, and Gemini 1.5 Pro. Unfortunately, none of these models aced the tests. However, the Gemini 1.5 Pro did outshine the rest. When the tasks got tougher—think scheduling a five-person meeting across two continents—the accuracy of all models nosedived.

Interestingly, the researchers found that giving models more examples (multi-shot prompting) could improve their planning accuracy but only if they had a larger context window. For instance, Gemini 1.5 Pro showed significant improvement in trip planning accuracy when the number of examples increased from one to 800—a leap from a meager 2.7% to an impressive 39.9%.

Another curveball? GPT-4o was particularly bad at trip planning, especially in understanding and adhering to flight connectivity and travel date constraints. Self-correcting, where models were asked to double-check their work, also led to more mistakes. Surprisingly, stronger models like GPT-4 and Gemini 1.5 Pro experienced more significant performance drops than the more modest GPT-3.5 when tasked with self-correction.

So, What’s the Bottom Line?

While the idea of agentic AI that can effortlessly manage complex planning tasks is incredibly exciting, the findings from NATURAL PLAN suggest that we’ve got some ground to cover before AIs can step into roles as proficient trip planners or personal assistants. As put by the DeepMind researchers, “NATURAL PLAN is very hard for state-of-the-art models to solve.”

Yet, don’t lose hope! We’re inching closer to a future where AI can do more than just respond to our queries—it can act as an efficient assistant, managing our schedules and planning trips with aplomb. For now, though, it looks like you might still need to consult your trusty travel agent or personal calendar.

What do you think? Would you trust an AI to plan your next vacation or organize your busy workweek? Share your thoughts in the comments below!

Leave a Reply Cancel reply