Provided byCapital One
Generative AI, particularly large language models (LLMs), will play a vital role in the long run of customer and worker experiences, software development, and more. Constructing a solid foundation in machine learning operations (MLOps) might be critical for corporations to effectively deploy and scale LLMs, and generative AI capabilities broadly. On this uncharted territory, improper management can result in complexities organizations might not be equipped to handle.
Back to basics for emerging AI
To develop and scale enterprise-grade LLMs, corporations should reveal five core characteristics of a successful MLOps program, starting with deploying ML models consistently. Standardized, consistent processes and controls monitor production models for drift, and data and have quality. Corporations should give you the chance to duplicate and retrain ML models with confidence: through quality assurance and governance processes to deployment, without much manual work or rewriting. Lastly, they need to ensure their ML infrastructure is resilient (ensuring multiregional availability and failure recovery), consistently scanned for cyber vulnerabilities, and well managed.
Once these components are in place, more complex LLM challenges would require nuanced approaches and considerations—from infrastructure to capabilities, risk mitigation, and talent.
Deploying LLMs as a backend
Inferencing with traditional ML models typically involves packaging a model object as a container and deploying it on an inferencing server. Because the demands on the model increase—more requests and more customers require more run-time decisions (higher QPS inside a latency certain)—all it takes to scale the model is so as to add more containers and servers. In most enterprise settings, CPUs work high-quality for traditional model inferencing. But hosting LLMs is a far more complex process which requires additional considerations.
LLMs are comprised of tokens—the fundamental units of a word that the model uses to generate human-like language. They typically make predictions on a token-by-token basis in an autoregressive manner, based on previously generated tokens until a stop word is reached. The method can develop into cumbersome quickly: tokenizations vary based on the model, task, language, and computational resources. Engineers deploying LLMs needn’t only infrastructure experience, equivalent to deploying containers within the cloud, in addition they have to know the most recent techniques to maintain the inferencing cost manageable and meet performance SLAs.
Vector databases as knowledge repositories
Deploying LLMs in an enterprise context means vector databases and other knowledge bases have to be established, and so they work together in real time with document repositories and language models to provide reasonable, contextually relevant, and accurate outputs. For instance, a retailer may use an LLM to power a conversation with a customer over a messaging interface. The model needs access to a database with real-time business data to call up accurate, up-to-date details about recent interactions, the product catalog, conversation history, company policies regarding return policy, recent promotions and ads out there, customer support guidelines, and FAQs. These knowledge repositories are increasingly developed as vector databases for fast retrieval against queries via vector search and indexing algorithms.
Training and fine-tuning with hardware accelerators
LLMs have a further challenge: fine-tuning for optimal performance against specific enterprise tasks. Large enterprise language models could have billions of parameters. This requires more sophisticated approaches than traditional ML models, including a persistent compute cluster with high-speed network interfaces and hardware accelerators equivalent to GPUs (see below) for training and fine-tuning. Once trained, these large models also need multi-GPU nodes for inferencing with memory optimizations and distributed computing enabled.
To satisfy computational demands, organizations might want to make more extensive investments in specialized GPU clusters or other hardware accelerators. These programmable hardware devices will be customized to speed up specific computations equivalent to matrix-vector operations. Public cloud infrastructure is a crucial enabler for these clusters.
A brand new approach to governance and guardrails
Risk mitigation is paramount throughout your complete lifecycle of the model. Observability, logging, and tracing are core components of MLOps processes, which help monitor models for accuracy, performance, data quality, and drift after their release. That is critical for LLMs too, but there are additional infrastructure layers to think about.
LLMs can “hallucinate,” where they occasionally output false knowledge. Organizations need proper guardrails—controls that implement a particular format or policy—to make sure LLMs in production return acceptable responses. Traditional ML models depend on quantitative, statistical approaches to use root cause analyses to model inaccuracy and drift in production. With LLMs, that is more subjective: it might involve running a qualitative scoring of the LLM’s outputs, then running it against an API with pre-set guardrails to make sure a suitable answer.
Governance of enterprise LLMs might be each an art and science, and lots of organizations are still understanding easy methods to codify them into actionable risk thresholds. With recent advances emerging rapidly, it’s clever to experiment with each open source and industrial solutions that will be tailored for specific use cases and governance requirements. This requires a really flexible ML platform, especially the control plane with high levels of abstraction as a foundation. This enables the platform team so as to add or subtract capabilities, and keep pace with the broader ecosystem, without impacting its users and applications. Capital One views the importance of constructing out a scaled, well-managed platform control plane with high levels of abstraction and multitenancy as critical to handle these requirements.
Recruiting and retaining specialized talent
Depending on how much context the LLM is trained on and the tokens it generates, performance can vary significantly. Training or fine-tuning very large models and serving them in production at scale poses significant scientific and engineering challenges. It will require corporations to recruit and retain a big selection of AI experts, engineers, and researchers.
For instance, deploying LLMs and vector databases for a service agent assistant to tens of hundreds of employees across an organization means bringing together engineers experienced in a wide selection of domains equivalent to low-latency/high throughput serving, distributed computing, GPUs, guardrails, and well-managed APIs. LLMs also have to deploy on well-tailored prompts to offer accurate answers, which requires sophisticated prompt engineering expertise.
A deep bench of AI research experts is required to remain abreast of the most recent developments, construct and fine-tune models, and contribute research to the AI community. This virtuous cycle of open contribution and adoption is essential to a successful AI strategy. Long-term success for any AI program will involve a various set of talent and experience combining data science, research, design, product, risk, legal, and engineering experts that keep the human user at the middle.
Balancing opportunity with safeguards
While it continues to be early days for enterprise LLMs and recent technical capabilities evolve every day, one among the keys to success is a solid foundational ML and AI infrastructure.
AI will proceed accelerating rapidly, particularly within the LLM space. These advances promise to rework in ways in which haven’t been possible before. As with every emerging technology, the potential advantages have to be balanced with well-managed operational practices and risk management. A targeted MLOps strategy that considers your complete spectrum of models can offer a comprehensive approach to accelerating broader AI capabilities.