Why AI cannot simulate your customers behaviour
"Founders would literally rather boil the ocean than talk to customers."
There’s a new paper on LLMs that’s making the rounds, and obviously, it’s the most mindblowing thing that you’ve ever seen, at least according to LinkedIn: “Omg we can simulate customers!”
Anyways…
The title of the paper is:
“LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings”
or, in simpler words, including an AI em dash:
“AI Models Can Now Predict What Customers Would Buy—Almost Like Humans”
Let’s examine the paper to see if its claims are valid and why I think you should know about it, because it will affect product and growth in tech, regardless of whether it’s right or wrong.
Let’s map it out.
What the paper claims & how it works
The good
The not so good
Human Panels (suck)
Intent vs Action
Limited coverage
Reality
Oversimplification
Future
Model Poisoning
Degrading Data Quality
Conclusion
1. What the paper claims
The relevant section from the abstract:
“Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. […] Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.”
The experiment
The researchers wanted to see if large language models (LLMs) like GPT-4o and Gemini 2.0 could act like real people in purchase-intent surveys, basically, to check whether an AI could respond to product ideas in the same way humans do.
They had access to 57 real consumer surveys conducted by a major personal care brand (like toothpaste, deodorant, skincare)... about 9,300 real human participants in total.
Each survey asked people to rate “How likely are you to buy this product?” on a 1–5 scale (a typical “Likert” rating).
Showed the same product concepts to the LLMs that human participants saw.
The AI was also “told” what kind of person it was supposed to be (for example, a 35-year-old woman from California with medium income).Tested three different ways for the AI to answer:
Direct Likert Rating (DLR): simply asking the AI to choose a number between 1 and 5.
→ Result: too robotic — models gave mostly 3s.Follow-up Likert Rating (FLR): first, the AI writes how it feels (“Seems good but a bit expensive”), then rates it 1–5.
→ Better but still limited.Semantic Similarity Rating (SSR): the AI gives a short opinion in text, which is then matched (using embedding similarity) against five example statements representing each Likert rating, like “Definitely won’t buy” vs. “Definitely will buy.” The AI’s response is scored based on how close it is to these anchors.
→ This was the main innovation and worked the best.
The SSR method reproduced human purchase intent almost as well as people themselves — achieving about 90% of human-to-human test reliability.
The SSR Method
The authors of the study used a framework called SSR when asking an AI to simulate users:
In simple terms, Semantic Similarity Rating (SSR) is a smarter way to get numerical survey results (like a 1–5 “purchase intent” score) from an AI without directly asking it for numbers.
Here’s how it works step-by-step:
Ask for a natural answer, not a number.
Instead of telling the AI “Rate this product from 1 to 5,” you ask it to describe its opinion, like a human would:
“I’d probably buy this- it looks useful and fairly priced.”.Compare meaning, not words.
That answer is then turned into a numerical embedding (a mathematical representation of its meaning). The model checks how similar this response is to several reference statements that represent each Likert scale point - for example:“Definitely would not buy” → 1
“Probably not” → 2
“Maybe” → 3
“Likely to buy” → 4
“Definitely would buy” → 5
If the AI’s answer is closest in meaning to the “likely to buy” reference, it gets rated as a 4.
End result:
You get a 1-5 rating (or a distribution of likely ratings) that reflects the meaning of the AI’s reply, not just a random number.
2. The good
It avoids robotic or uniform answers. LLMs asked directly for numbers tend to overuse the middle (like “3”) and produce unnatural results. SSR keeps the richness of natural language but still turns it into measurable data.
It keeps the “why.” You not only get the score but also the reasoning behind it, which can explain user attitudes more clearly - useful for product teams or marketers exploring motivation and perception under some very limited, certain conditions. (Like simple consumer products)
It’s more human-like. By working through meaning and emotion rather than pure scoring, the AI mimics how people think and express opinions also along demographic lines like age and income status. (These are well represented in their datasets)
3. The not so good
The following points are particularly relevant if you’re in a company that sells a B2B SaaS product.
3.1 Human panels (suck)
This study simulated a human survey panel, which is one of many tools to get customer sentiment in a market.
The point of a human panel is to see whether people like something before you build it out. This might be relevant for hardware products and physical consumer products because the ratio between building it and having people react to it is so extreme.
But in tech, this is not a good approach:
AI made human panels for software services even less relevant in general because even the best human panel cannot beat exposing a flow or product to real users. Building those flows and products has never been easier, so why even bother with a human panel?
They are also impossible to put together for B2B products because you commonly deal with teams, compliance, processes, and not individuals who want to buy a shampoo.
3.2 Intent vs. Action
This is probably my biggest gripe with this paper. The study measures stated intent, not revealed actions. (Like a human panel would, and that’s why they suck)
The number one rule when interviewing people is to focus on their past actions rather than their future plans. Humans are comically bad at predicting their future but comparatively accurate when describing things they did do in the past.
In software, “intent” may not (and very often does not) equal trial adoption due to integration complexity, lack of knowledge, switching costs, or compliance fears - barriers that SSR cannot infer without real behavioral data.
3.3 Limited coverage
The original experiments succeeded because personal care products are heavily represented in public discourse. B2B SaaS decision-making appears sparsely in general web data, meaning model priors are likely misaligned unless fine-tuned or augmented with internal CRM or call transcripts.
And even if they are fine-tuned with customer data, it remains to be seen how practical the generated insights can be due to data bias (your customers are not an accurate reflection of the market) and oversimplification (see next chapter).
4. Reality
I can almost guarantee you that you will hear variations of the following sentences in the next weeks or months in your company or uttered by leadership:
“AI (through SSR) can now simulate consumer behaviour; we should use that to speed up our A/B experimentation or run AI through our experiment before we ship them.”
“We can use AI to predict consumer purchase intent. We should use it to inform/speed up our strategy.”
Both of these statements are highly problematic because they are a) untrue and b) will be used to make sweeping decisions that damage companies on a fundamental level and create unrealistic expectations towards their GTM teams.
When you trace back problems in most companies today, they have very little to do with process improvements or whether their departments can ship stuff fast enough.
My biggest challenge to this day is to convince founders that they and their leadership (and PMs) should talk to more real, existing people rather than just look at the quantified data dashboards.
If they don’t, they fast-track themselves into a red queen company death. (No more innovation)
It is impossible to come up with a good strategy unless it contains something that is hard to detect for your competition: underserved needs (differentiators) in a big enough, monetizable market segment (ICP).
It’s not that the current models are not good enough just yet to come up with a great strategy that checks this box:
Insights you can easily find with an AI, especially with publicly available data, cannot yield insights that lead you to differentiators because they are almost never (sometimes in the form of market studies, but these are questionable as well and generally extremely broad) present in that dataset.
Also, by definition, if it’s easy for you to find, it’s easy for others, so it’s probably not a differentiator you’re looking for.
A primer on what table stakes and differentiators are and how to find them:
An LLM can definitely help you with fleshing out table stakes, and also help with objectively verifiable problems like UX, grammar, and testing for consistency (are we using the same CTA’s everywhere, etc.), really well. But it won’t find underserved needs without additional data in your industry.
But differentiators are what a company needs to position and stay alive in any market environment.
Oversimplification
Leadership teams have a strong tendency to oversimplify findings like these, and the danger with overusing AI (and overestimating its effect) is that it produces results that sound extremely substantiated but are often confidently incorrect.
This has in part to do with how models are currently trained; we “reward” correct outcomes and “punish” “I don’t know” outcomes, which, ironically, leads them to say something rather than nothing: hallucinations are the result.
Two years down the line, these people will write LinkedIn posts and point to other factors as the reason their business went under.
You should always verify differentiators against a specific segment of prospects to ensure the need is real and comes up in qualitative conversations before you tell the whole company to go after it.
The way to do that is not to produce more spaghetti with AI and throw it at the wall in the hopes that the revenue from customers will validate it for you (you probably don’t have enough prospects to test against); the correct approach is:
Formulate a hypothesis by talking to real customers and understanding their problems. (Don’t ask them what they want, know what they struggle with. Don’t talk about your company at all.)
Try to size the problem against the market, not your existing customers. What percentage of that segment do we think is struggling with this problem?
Verify the percentage on a larger sample (e.g., survey) if possible. If the segment is too small, redefine it or find a different problem.
Develop a potential solution or prototype for the problem and then introduce it to the market.
If you outsource steps 1 - 3 to an AI, you will most likely develop something that has no chance in the market before you code the first line.
People and businesses don’t buy good products because they are good; they buy them if those products solve an underserved need, which means they have to be decisively better than what they use at the time.
Surfacing these needs happens by talking to real people. Treat them like a data source accessible only to you.
Not good: Not talking to ICPs and customers.
Still not good but better: Feeding an AI with customer interviews and sales calls transcripts
Really good: talking to prospects yourself regularly and developing a hypothesis based on your developing product sense, and then using AI to help develop.
This applies to all PMs, Founders, and the leadership team (at a minimum, product, marketing, and growth).
5. Future
I’m skeptical about how good future models will be on outcome and reality-based performance (how true a statement is, if it cannot be independently verified) for two reasons:
5.1 Model poisoning
It’s incredibly easy to poison models (source: Anthropic research) to skew their results despite the proportion of the false information regarding the total dataset (meaning a bigger model is not protected from this, even more vulnerable):
This is especially problematic when AI is used for insights that rely on how close the outcome is to reality rather than following a ruleset (like playing a game, or correcting grammar) or just opinions.
5.2 Data quality is degrading, not increasing
The existing data that we had a couple of years ago on the internet was not always correct, but it was human-generated.
We are quickly going to a future where most content (including papers) is AI-generated, not AI-assisted. And that content is geared towards generating attention, not necessarily representing the truth.
It’s also indistinguishable more and more from “real” insights.
This is a huge problem because the popularity of content (how many people have upvoted a video, for instance) is not a good indicator of objective truth for models, but is often used as one.
Kurzgesagt made a great video on this topic that explains why the internet as a primary data source becomes worse, even if models are becoming better under the hood:
Conclusion
There’s not much in this paper that we can use for software development. While the abstract of its paper is technically correct, it’s also phrased in a way that is very easy to misunderstand.
What the paper “proves” is that it can mimic human panels to a high degree, not human behaviour. And they are an old method that doesn’t work well for the same reasons anyway.
No. We cannot use synthetic people to simulate real ones in tech, broadly speaking.
It’s a great example of how some leaders will try to overfit those findings and use them as an excuse to talk to customers even less.
I leave you with a comment on LinkedIn that made me laugh because of how accurate it is in this situation:








I would also point out that the experiment had to (by definition) rely on LLMs producing output that it already had in the training data. The outputs of the reference panel groups are very likely available one way or another.
Since LLMs are just guessing the most likely answers, their output will generally narrow the bell curve and cut out the long tails on both sides.
So we add insult to injury. Not only do we generate a signal on "what people say, not what they do," but we also narrow it down to the most likely answers.
If we happen to aspire to innovate in any niche, then it will literally get us going backwards. It would be like a classic description of Ford's "If I had asked people what they wanted, they would have said faster horses."
I read that paper with a sceptically raised eyebrow and I'm very glad someone who fully understands how it all works under the hood is here to explain it properly.