Customer service represents a significant application for large language models (LLMs) and AI agents. A substantial portion of these interactions occur via phone, necessitating that customer service bots comprehend voice interactions. Phone conversations can be complex, often involving hostility, interruptions, background noise, and general unpredictability. Salesforce is addressing this challenge by simulating such chaotic scenarios to enhance the responsiveness of its voice agents in real-world phone calls. Silvio Savarese, Chief Scientist and Head of AI Research at Salesforce, discussed the development of eVerse, a simulation tool designed to rigorously test AI agents without involving actual customers.

Some might view AI voice agents and simulated training environments as an overly complex solution for tasks that phone button menus handle adequately. However, Silvio Savarese explains that while phone menus suffice for simple, scripted interactions like checking a balance, they become ineffective for complex, multi-step customer problems that deviate from predefined scripts. This approach also falls short from a user experience standpoint.
AI agents, conversely, can capture the nuance in human language, which extends beyond the nature of the request itself. Customers might struggle to articulate their issues or require clarifying questions. Many edge cases exist where a simple button press is insufficient, prompting individuals to call for assistance. This highlights the importance of simulation environments like eVerse, which can create synthetic representations of numerous edge cases to ensure the best possible customer experience. Should human intervention still be necessary, the conversation can be seamlessly transferred, retaining all previously collected context.
Regarding the determination of real conversation aspects for simulations and why simulating complex scenarios is preferred over simply engineering mitigation, Savarese notes that synthetic data generation offers extreme variety, creating scenarios that might not otherwise be conceived, and at scale. Because this synthetic data is generated independently from the agent's training data, the agent cannot anticipate the nature of the problem. While an agent should ideally handle environmental factors like wind, the primary goal is to train agents for unpredictable scenarios. These could involve various types of noise, language-related challenges, or entirely different issues. Different businesses also face unique challenges, such as ordering at a drive-through or changing a flight in a busy airport. Synthetic data generation allows for extrapolating a small amount of sample data into many different permutations. As the eVerse simulation loop addresses all possible corner cases, these simulation environments will eventually fulfill their purpose and become less necessary.
Clarifying the distinction between simulation training data and agent training data, Savarese explains that LLMs are pre-trained on vast amounts of general data, much of which is not relevant to specific enterprise scenarios. Agents are sophisticated frameworks built around these LLMs, with capabilities like Agentforce allowing them to be dialed for determinism or creativity based on the use case. Simulation data fundamentally differs from LLM pre-training data. It takes small amounts of real enterprise data to generate realistic synthetic scenarios that LLMs would be highly unlikely to have encountered during pre-training. This approach ensures the agent learns to generalize rather than merely memorizing responses.
Identifying and fixing gaps is central to eVerse's operation. After simulating large volumes of synthetic scenarios, agent responses are measured through benchmarking. Some methods are quantifiable, such as whether the agent took the correct action, while others are qualitative, like assessing simulated customer satisfaction. Human annotators are also employed to validate agent responses for critical scenarios. The feedback gathered from evaluating agent performance in these simulated scenarios drives continuous improvement. Crucially, edge cases are not static; they evolve with changes in customer behavior, regulations, and business rules. Similar to how flight simulators remain essential for experienced pilots, eVerse becomes increasingly valuable as agents scale, providing a safe environment to test changes where the cost of production failure is too high.
Addressing the inclusion of less-than-kind customer interactions in simulations without the AI agent responding inappropriately, Savarese highlights that in a simulation, no real humans can have their feelings hurt by a rogue agent. If an agent exhibits inappropriate behavior in certain situations, the simulation environment is precisely where such behavior should be discovered and corrected. The aim is also to ensure agents can handle difficult situations optimally, building empathy and helping to diffuse conversations as an important aspect of customer service. Situations where agents might "curse back" can be detected either by human review to assess inappropriate responses or by using "judge agents/models" trained on sentiment detection to automatically identify such behavior.
Regarding the "Move 37" analogy from the Go match between Lee Sedol and AlphaGo, and how to ensure simulations remain rooted in real-world human interactions without producing baffling but effective moves, Savarese notes that Go, an ancient board game with an immense number of possible moves, saw an unprecedented "Move 37" from AlphaGo that baffled experts. However, rather than trying to prevent such surprising moves, Go players now leverage AI to learn and improve their own game, viewing AI as a tool for enhancement. This reflects AI's potential in business scenarios: a tool to improve the performance of salespeople, service personnel, and other organizational functions. It is also crucial to establish guardrails that ensure agents operate within proper, trusted boundaries and avoid off-chart behavior. This can be enforced by using judge agents/models or by implementing determinism, as is being done in the new release of Agentforce.
Salesforce's partnership with UCSF Health involves testing eVerse in a medical/billing environment. Savarese emphasizes the healthcare space as extremely important, offering a significant opportunity for AI to alleviate pressure on physicians and other workers. The collaboration with UCSF Health began with billing use cases, a major pain point for patients due to the numerous systems required to provide answers and the knowledge often trapped within subject matter experts. The pilot with UCSF Health is showing promising results. By creating a Learning Engine with eVerse, AI agents can involve humans when they lack an answer, preventing "hallucinations" and allowing humans to "teach" the AI the correct way to handle a situation. Industry data suggests that 60-70% of inbound calls to healthcare contact centers are routine inquiries that can be fully automated. For the more complex 30-40% of cases, eVerse continuously improves performance through human-in-the-loop feedback, gradually expanding coverage. The results indicate a potential increase in coverage from the 60-70% range to 84-88%. This means that new skills taught by human experts can be generalized and retained by the Learning Engine, improving coverage and allowing humans to focus on the most complex tasks.

