Product strategies from Classical ML to adapt (or ditch) for the generative AI world
Years ago, the primary piece of recommendation my boss at Opendoor gave me was succinct: “Spend money on backtesting. AI product teams succeed or fail based on the standard of their backtesting.” On the time, this recommendation was tried-and-true; it had been learned the hard way by teams across search, recommendations, life sciences, finance, and other high-stakes products. It’s advice I held dear for the higher a part of a decade.
But I’ve come to consider it’s not axiomatic for constructing generative AI products. A 12 months ago, I switched from classical ML products (which produce easy output: numbers, categories, ordered lists) to generative AI products. Along the way in which, I discovered many principles from classical ML not serve me and my teams.
Through my work at Tome, where I’m Head of Product, and conversations with leaders at generative AI startups, I’ve recognized 3 behaviors that distinguish the teams shipping probably the most powerful, useful generative AI features. These teams:
- Concurrently work backwards (from user problems) and forwards (from technology opportunities)
- Design low-friction feedback loops from the outset
- Reconsider the research and development tools from classical ML
These behaviors require “unlearning” quite a few things that remain best practices for classical ML. Some could appear counter-intuitive at first. Nonetheless, they apply to generative AI applications broadly, starting from horizontal to vertical software, and startups to incumbents. Let’s dive in!
(Wondering why automated backtesting isn’t any longer a tenet for generative AI application teams? And what to exchange it with? Read on to Principle 3)
(More concerned about tactics, fairly than process, for the way generative AI apps’ UI/UX should differ from classical ML products? Try this blog post.)
“Working backwards” from user problems is a credo in lots of product and design circles, made famous by Amazon. Study users, size their pain points, write UX requirements to mitigate the highest one, discover one of the best technology to implement, then rinse and repeat. In other words, work out “That is a very powerful nail for us to hit, then which hammer to make use of.”
This approach makes less sense when enabling technologies are advancing very rapidly. ChatGPT was not built by working backwards from a user pain point. It took off since it offered a robust, recent enabling technology through an easy, open-ended UI. In other words: “We’ve invented a brand new hammer, let’s see which nails users will hit with it.”
The perfect generative AI application teams work backwards and forwards concurrently. They do the user research and understand the breadth and depth of pain points. But they don’t simply progress through a ranked list sequentially. Everyone on the team, PMs and designers included, is deeply immersed in recent AI advances. They connect these unfolding technological opportunities to user pain points in ways which are often more complex than one-to-one mappings. For instance, a team will see that user pain points #2, #3, and #6 could all be mitigated via model breakthrough X. Then it might make sense for the following project to deal with “working forwards” by incorporating model breakthrough X, fairly than “working backwards” from pain point #1.
Deep immersion in recent AI advances means understanding how they apply to your real-world application, not only reading research papers. This requires prototyping. Until you’ve tried a brand new technology in your application environment, estimates of user profit are only speculation. The elevated importance of prototyping requires flipping the normal spec → prototype → construct process to prototype → spec → construct. More prototypes are discarded, but that’s the one approach to spec features consistently that match useful recent technologies to broad, deep user needs.
Feedback for system improvement
Classical ML products produce relatively easy output types: numbers, categories, ordered lists. And users tend to just accept or reject these outputs: you click a link within the Google search results page, or mark an email as spam. Each user interaction provides data that’s fed directly back into model retraining, so the link between real-world use and model improvement is robust (and mechanical).
Unfortunately, most generative AI products tend not to supply recent, ground-truth training data with each user interaction. This challenge is tied to what makes generative models so powerful: their ability to supply complex artifacts that mix text, images, video, audio, code, etc. For a posh artifact, it’s rare for a user to “take it or leave it”. As an alternative, most users refine the model output, either with more/different AI or manually. For instance, a user may copy ChatGPT output into Word, edit it, after which send it to a colleague. This behavior prevents the applying (ChatGPT) from “seeing” the ultimate, desired type of the artifact.
One implication is to permit users to iterate on output inside your application. But that doesn’t eliminate the issue: when a user doesn’t iterate on an output, does that mean “wow” or “woe”? You may add a sentiment indicator (e.g. thumbs up/down) to every AI response, but interaction-level feedback response rates are likely to be very low. And the responses which are submitted are likely to be biased towards the extremes. Users mostly perceive sentiment collection efforts as additional friction, as they mostly don’t help the user immediately get to a greater output.
A greater strategy is to discover a step within the user’s workflow that signifies “this output is now adequate”. Construct that step into your app and be sure to log what the output looked like at this point. For Tome, where we help users craft presentations with AI, the important thing step is sharing a presentation with one other person. To bring this into our app, we’ve invested heavily in sharing features. After which we evaluate which AI outputs were “sharable” and which required massive manual editing to be shareable.
Feedback for user assistance
Free text has emerged because the dominant user-desired approach to interacting with generative AI applications. But free text is a Pandora’s box: give a user free text input to AI, they usually’ll ask the product to do all kinds of things it cannot. Free text is a notoriously difficult input mechanism through which to convey a product’s constraints; in contrast, an old-fashioned web form makes it very clear what information can and have to be submitted, and in precisely what format.
But users don’t want forms when doing creative or complex work. They need free text — and guidance on easy methods to craft great prompts, specific to their task at hand. Tactics for assisting users include example prompts or templates, guidance around optimal prompt length and formatting (should they include few-shot examples?). Human-readable error messages are also key (for instance: “This prompt was in language X, but we only support languages Y and Z.”)
One upshot of free text inputs is that unsupported requests could be a implausible source of inspiration for what to construct next. The trick is to give you the option to discover and cluster what users are attempting to do in free text. More on that in the following section…
Something to construct, something to maintain, something to discard
Construct: natural language analytics
Many generative AI applications allow users to pursue very different workflows from the identical entry point: an open-ended, free-text interface. Users will not be choosing from a drop-down “I’m brainstorming” or “I would like to unravel a math problem” — their desired workflow is implicit of their text input. So understanding users’ desired workflows requires segmenting that free text input. Some segmenting approaches are prone to be enduring — at Tome, we’re all the time concerned about desired language and audience type. There are also ad hoc segmentations, to reply specific questions on the product roadmap — for instance, what number of prompts request a visible element like a picture, video, table or chart, and thus which visual element should we put money into?
Natural language analytics should complement, not supplant, traditional research approaches. NLP is very powerful when paired with structured data (e.g., traditional SQL). Plenty of key data shouldn’t be free text: when did the user join, what are the user’s attributes (organization, job, geography, etc). At Tome, we tend to have a look at language clusters by job function, geography, and free/paid user status — all of which require traditional SQL.
And quant insights should never be relied on without qualitative insights. I’ve found that watching a user navigate our product live can sometimes generate 10x the insight of a user interview (where the user discusses their product impression post-hoc). And I’ve found scenarios where one good user interview unlocked 10x the insight of quant evaluation.
Keep: tooling for low-code prototyping
Two tooling types enable high-velocity, high-quality generative AI app development: prototyping tools and output quality assessment tools.
There are lots of other ways to enhance an ML application, but one strategy that’s each fast and accessible is prompt engineering. It’s fast since it doesn’t require model retraining; it’s accessible since it involves natural language, not code. Allowing non-engineers to control prompt engineering approaches (in a dev or local environment) can dramatically increase velocity and quality. Often this could be implemented via a notebook. The notebook may contain numerous code, but a non-engineer could make significant advances iterating on the natural language prompts without touching the code.
Assessing prototype output quality is commonly quite hard, especially when constructing a net-new feature. Reasonably than investing in automated quality measurement, I’ve found it significantly faster and more useful to poll colleagues or users in a “beta tester program” for 10–100 structured evaluations (scores + notes). The enabling technology for a “polling approach” could be light: a notebook to generate input/output examples at modest scale and pipe them right into a Google Sheet. This enables manual evaluation to be parallelized, and it’s normally easy to get ~100 examples evaluated, across a handful of individuals, in under a day. Evaluators’ notes, which offer insights into patterns of failure or excellence, are an added perk; notes are likely to be more useful for identifying what to repair or construct next than the numeric scores.
Discard: automated, backtested measures of quality
A tenet of classical ML engineering is to speculate in a sturdy backtest. Teams retrain classical models regularly (weekly or every day), and a great backtest ensures only good recent candidates are released to production. This is smart for models outputting numbers or categories, which could be scored against a ground-truth set easily.
But scoring accuracy is harder with complex (perhaps multi-modal) output. You could have a text that you think about great and thus you’re inclined to call it “ground truth”, but when the model output deviates from it by 1 word, is that meaningful? By 1 sentence? What if the facts are all the identical, however the structure is different? What if it’s text and pictures together?
But not all is lost. Humans are likely to find it easy to evaluate whether generative AI output meets their quality bar. That doesn’t mean it’s easy to rework bad output into good, just that users are likely to give you the option to make a judgment about whether text, image, audio, etc. is “good or bad” in a couple of seconds. Furthermore, most generative AI systems at the applying layer will not be retrained on a every day, weekly, and even monthly basis, due to compute costs and/or the long timelines needed to amass sufficient user signal to warrant retraining. So we don’t need quality evaluation processes which are run day-after-day (unless you’re Google or Meta or OpenAI).
Given the convenience with which humans can evaluate generative AI output, and the infrequency of retraining, it often is smart to judge recent model candidates based on internal, manual testing (e.g. the polling approach described within the subsection above) fairly than an automatic backtest.