This article presents insights from two interviews conducted at AWS re:Invent. The first segment features Stefano Ermon, CEO and Co-founder of Inception, discussing diffusion language models. The second segment includes Aldo Luévano, Chairman of Roomie, who elaborates on purpose-built AI models for both physical and software applications, with a focus on return on investment.

Inception specializes in researching and developing diffusion language models aimed at achieving faster and more efficient AI.
Roomie operates as a robotics and enterprise AI company, offering an ROI-first platform designed to monitor the effectiveness of its AI solutions.
Stefano Ermon can be found on LinkedIn.
Aldo Luévano can be found on LinkedIn.
Diffusion Language Models: A New Approach to Text Generation
Diffusion models are currently the leading method for generating images, video, and audio. Historically, their application to text and code generation has been challenging. Inception is developing the first large-scale, commercial-grade diffusion language models. These models operate distinctly from traditional large language models (LLMs), which generate text sequentially, one token at a time from left to right. This sequential process creates a bottleneck, making it inherently slow. In contrast, diffusion language models generate multiple tokens in parallel. The process begins with an initial approximation of the answer, which is then iteratively refined. While still utilizing large neural networks, each network evaluation in a diffusion model can modify multiple tokens simultaneously, leading to significantly faster generation. These models can be 5 to 10 times faster than autoregressive models of comparable quality.
The generation process for diffusion language models mirrors that of image diffusion models. It starts with random tokens, akin to pure noise. Through an iterative diffusion process, this noise is progressively removed, allowing the model to accurately determine token values. This refinement chain continues until a high-quality answer is produced.
Modern image diffusion models predominantly utilize transformers. While early diffusion models, pioneered at Stanford in 2019, used ConvNets for dense image prediction, the field has largely transitioned to diffusion transformers. These transformer-based neural networks are trained by adding noise to an image and then teaching the transformer to predict or remove that noise. A similar approach is used for diffusion language models. The underlying neural network is a large transformer. It is trained by taking clean text or code, intentionally introducing errors to disrupt its structure, and then training the network to reconstruct the original, clean signal. Essentially, the network learns to correct mistakes rather than predicting the next token. During inference, the model iteratively fixes as many errors as possible through a denoising process until the output is sufficiently clean for the user.
Diffusion language models are statistical, generative models, similar to autoregressive models. They learn a probability distribution representing the data-generating process, based on extensive datasets of code or text. The goal is not merely to memorize training data but to generalize, enabling the generation of new, unseen content. When presented with a novel prompt, the model applies its learned probability distribution to produce a reasonable, generalized response.
While the denoising approach trains the model to correct errors, providing a built-in error correction mechanism, diffusion language models are not yet perfect and can still exhibit hallucinations. Unlike autoregressive models, which cannot retract an outputted token, diffusion models are designed to fix mistakes. This inherent error correction is expected to eventually lead to higher reliability. Currently, these models perform comparably to leading speed-optimized models like Frontier Labs’ Mini models, Flash, and Anthropic’s Haiku models in terms of accuracy for tasks such as question answering and programming.
Diffusion language models offer flexible approaches to compute utilization during inference. Research is ongoing to determine the optimal balance between computational resources and output quality. Efforts are focused on developing reasoning capabilities that allow the model to autonomously determine the necessary number of refinement steps. This reasoning methodology differs significantly from that found in traditional autoregressive models, presenting novel capabilities.
The underlying stack for diffusion language models is distinct. While transformers are still employed, the training objective and inference process are entirely different. Inference involves efficiently processing and denoising multiple tokens simultaneously, requiring careful determination of the optimal number of denoising steps. Inception has developed proprietary technology, including a complex serving engine, to manage these models at scale, incorporating continuous patching and various caching strategies for maximum efficiency in serving numerous customers.
Many design choices optimized for autoregressive models prove suboptimal for diffusion language models. This presents significant opportunities for improvement by re-evaluating and adjusting these design parameters. The rapid progress of diffusion language models in matching the performance of leading autoregressive models, despite these suboptimal initial choices, indicates substantial future potential.
A common issue observed in text diffusion models, akin to the ‘six-finger problem’ in image generation, involves degeneracies where the model repeatedly generates similar phrases, creating a recursive loop. This phenomenon was also noted in early demonstrations of Google’s Gemini diffusion model. While largely addressed, its recurrence suggests a potential inherent characteristic of the diffusion approach to text generation.
Handling variable-length content poses a significant technical challenge in developing diffusion language models. A core difficulty lies in adapting continuous diffusion mathematics, which relies on partial differential equations, to the discrete nature of text tokens. Bridging this gap required developing specialized mathematical frameworks. Further challenges include optimizing variable-length generation and implementing system-level optimizations for efficient model serving in practical applications.
Diffusion language models still require substantial memory due to their large neural networks. However, a key advantage is their reduced bottleneck from memory bandwidth. In autoregressive generation, memory bandwidth—the speed at which data moves across memory hierarchies—is often the primary limitation, more so than raw compute power. Diffusion models are designed for greater memory bandwidth efficiency by processing multiple elements in parallel. This allows model weights to be loaded once and applied to many tokens simultaneously, leading to significantly higher arithmetic efficiency and faster practical performance compared to traditional autoregressive models.
While current production models are transformer-based, prototype models are exploring alternative architectures. Research into state-based models, for instance, offers architectures that avoid quadratic scaling with context length. These alternative architectures can be trained in a diffusion manner, functioning as denoisers rather than next-token predictors. This represents an algorithmic improvement that, when combined with architectural advancements, offers orthogonal pathways to enhance efficiency.
Many of the most effective world models currently in existence are based on diffusion technology. This approach is favored for its higher accuracy and significant speed advantages. World models are typically used to predict future events, informing decision-making in applications like autonomous driving. Their efficiency makes diffusion models particularly suitable for this domain. Similarly, this technology is applied to language generation, where inference speed is a critical bottleneck. While hardware advancements contribute to increased token serving capacity, software innovations like diffusion language models offer up to a 10x improvement, independent of hardware gains.
Roomie: Purpose-Built AI with an ROI-First Approach
Roomie offers an enterprise AI platform that leverages various LLMs to automate diverse back-office processes for top-tier and mid-market clients. A key differentiator is its ‘ROI-first’ approach. While terms like ‘AI-first’ and ‘data-first’ are prevalent, organizations primarily seek tangible return on investment. To address this, the platform includes a core module designed to track the ROI for every dollar invested in AI. This platform also integrates with physical AI, allowing the agentic world to connect with physical use cases involving humans, robots (including bipedal robots), or AI on edge smart devices for processes in factories, distribution centers, and other organizations.
The ROI model is LLM-based. It first calculates the current Total Cost of Ownership (TCO) for manual or semi-automated processes, potentially involving legacy technology. Following this, it forecasts the future TCO after implementing the enterprise AI solution. By comparing these TCOs, the final ROI for an organization is determined. This process involves consultants engaging in conversations with clients to understand their business needs, strategy, and company structure, which informs the ROI estimation for implementation. Each module within the platform begins with the application of this ROI-first core module.
With over a decade of market experience, Roomie has developed numerous project-based, tailor-made solutions for clients across financial services, banking, consumer packaged goods, retail, and the public sector. This extensive experience has provided a deep understanding of customer businesses, enabling the training of the ROI model with rich historical data. Currently, the model calculates ROI for well-understood use cases, with plans to integrate more use cases and their associated ROI calculations into the platform. The model’s training is rooted in a thorough understanding of business needs and historical implementation data.
The Roomie platform includes eight distinct modules, one of which addresses legacy systems. While many modern coding tools focus on developing new applications with contemporary architectures like JavaScript, a significant portion of the software development market remains tied to legacy systems such as mainframes and COBOL-based systems, which process millions of transactions daily. Roomie identifies an opportunity to leverage AI to create new functionalities for these legacy systems using natural language, acting as a ‘vibe-coding’ tool. This involves a specialized LLM or SLM. A challenge for general LLMs, often trained on public data like Wikipedia, is generating code from natural language for proprietary legacy systems. Roomie overcomes this by utilizing over a decade of access to client source code, enabling the development of enterprise AI solutions specifically for legacy environments.
Monolithic legacy applications often exhibit remarkable speed compared to multi-layered, service-oriented, or cloud-based applications, which can be slower. Consequently, many large banks and financial service organizations are opting to maintain their operations on legacy systems. Roomie’s enterprise AI solution primarily focuses on providing maintenance and support for these systems. While clients can choose to migrate to new architectures, the platform enables them to retain legacy solutions and develop new functionalities, as well as maintain existing modules, using natural language.
While the platform offers migration options for clients seeking new architectures, its primary strength lies in legacy system maintenance. A significant industry challenge is the aging workforce of COBOL developers and the lack of new talent entering the field. Tools like Roomie’s ROI-first enterprise platform are crucial for accelerating the development of new legacy code within mainframe infrastructures.
Roomie includes a module entirely dedicated to physical AI. The company originated as a B2B robotics startup but later diversified into artificial intelligence solutions due to the early stage of robotics adoption, despite advancements in humanoids. While AI development currently drives growth, continued investment in robotics is maintained, anticipating future market opportunities. A physical AI module acts as an integrator, connecting the ROI-first platform’s use cases with physical applications. This involves operating and controlling physical devices, such as humanoids. For instance, a computer vision model facilitates self-checkout and picking tasks in factories or consumer packaged goods organizations. The mobility of humanoid robots expands the scope of agentic computer vision applications, enabling various opportunities in anomaly detection, out-of-stock alerts, and checkout solutions.
Many computer vision models are readily available, with Convolutional Neural Networks (CNNs) being a primary approach. Roomie’s distinct advantage lies in its agentic approach, where computer vision is not merely for alerting personnel. Instead, the inference from CNNs is integrated with a system that enables specific actions and tasks to resolve particular industry use cases.
Currently, a robot cannot perform tasks like taxes. The robotics industry is largely focused on general-purpose humanoids, attracting significant venture capital despite a lack of immediately defined specific use cases. While potential applications include factory picking or home companionship, the technology’s ability to replicate human capabilities—such as object manipulation, locomotion, and conversation—is vast. Developing clear ROI or unit economics for humanoid deployment remains challenging. The current strategy involves securing funding for humanoids and showcasing technology with general-purpose capabilities for potential deployment in 5-10 years.
The ethical implications of humanoid robots, particularly regarding workforce reduction, are a significant concern. The drive towards hyper-automation in industries aims for cost and payroll reduction, inevitably impacting the workforce. This transformation extends beyond physical robots; agentic AI can automate roles without physical presence. Roomie’s ROI-first approach explicitly includes workforce reduction as a dimension. While many claim AI will only enhance human capabilities without job displacement, Roomie acknowledges that its solutions aim to reduce organizational payroll, a reality that requires careful management.
The common association of AI with robots often fuels anxiety, partly due to science fiction narratives. While job displacement is an inevitable part of societal and industrial evolution, new roles will emerge, such as those related to teleoperating and training robots. The current landscape makes it challenging for a single use case to serve as a primary product. The democratization of AI through no-code capabilities allows companies to rapidly develop diverse use cases. Roomie’s strategy is to provide an enterprise layer with a broad range of use cases, rather than specializing in a single one. This approach aims to democratize AI access, initially for Latin American companies, with plans to extend to the United States. The ability to rapidly develop technology with modern software development approaches supports this comprehensive offering, encompassing both physical and enterprise AI.

