What Product Managers Need to Know About AI in 2025
You do not need to understand transformer architecture. You do need to understand these seven things.
Every month, at least one PM asks me what courses they should take to "learn AI." They are usually thinking about machine learning theory — gradient descent, attention mechanisms, loss functions. My answer is always the same: you do not need to understand how transformers work any more than you need to understand TCP/IP to build a web application. What you need is a mental model for what AI can do, what it cannot do, what it costs, and how to evaluate whether it is working.
I have spent over fifteen years building enterprise products, and the last several years building specifically with and around large language models. Here are the seven things that actually matter for product managers working with AI in 2025. Not the theory. The practice.
1. What LLMs Can and Cannot Do
LLMs are pattern-completion engines. They predict the next token based on patterns in their training data. Understanding this at a functional level is the single most important thing a PM can learn about AI.
What they do well: generation. Drafting emails, writing code, summarizing documents, translating languages, creating structured data from unstructured input. Generation is the sweet spot because the model is doing what it was designed to do — producing plausible next tokens.
What they do adequately: reasoning. LLMs can perform multi-step reasoning, but they do it by pattern-matching against similar reasoning chains in their training data. For common patterns — syllogisms, arithmetic, code logic — they are reliable. For novel reasoning that requires genuine abstraction, they are inconsistent. Think of LLM reasoning as "junior analyst" level: correct for straightforward problems, unreliable for edge cases.
What they do poorly: retrieval of specific facts. LLMs do not have a database. When you ask for a specific date, number, or name, the model is generating what it predicts should come next, not looking it up. This is why LLMs hallucinate — the model is not "making things up," it is predicting the most likely next token, and sometimes that token is wrong.
The PM takeaway: Generation features are the sweet spot. Reliable fact retrieval requires RAG or a different architecture. Novel reasoning requires extensive testing and fallback design.
2. Prompt Engineering Is Product Design
For AI-powered features, the prompt is the product specification. It is not a technical implementation detail. It is a product decision that determines the user experience.
The prompt defines the personality, constraints, format, tone, guardrails, and failure behavior. Changing a single sentence can shift the product from helpful to useless, from professional to casual, from safe to dangerous.
I have seen teams spend three months building an AI feature only to discover that the prompt — written by an engineer in twenty minutes — produces output that does not match the product vision. The fix took two days of prompt iteration by the PM. The engineering work was fine. The twenty-minute prompt was the bottleneck.
Practical implications: You should be writing and iterating on prompts yourself. You should have a prompt versioning system. You should test changes against a consistent set of inputs before deploying them. Treat the system prompt as a product artifact that requires the same review process as a UI mockup.
Prompt engineering is not a technical skill. It is a communication skill — specifying desired behavior clearly and completely. PMs are, in my experience, better at this than most engineers.
3. Evaluation Is the Hardest Part
In traditional software, the button either opens the modal or it does not. In AI features, "works" is probabilistic. The model generates a good summary 85% of the time and a mediocre one 15% of the time. Defining "good," measuring it consistently, and improving it is the hardest problem in AI product management.
You need three evaluation approaches.
Automated evaluation uses a second AI model to score the first. Fast, scalable, useful for catching regressions — but the evaluator has its own biases. Necessary but not sufficient.
Human evaluation uses real people to rate output on defined criteria. Slow and expensive, but the only way to measure whether output meets user needs. Use a structured rubric with 3 to 5 criteria, at least 100 examples per iteration cycle.
Production metrics measure what users actually do. Do they accept or reject? Edit extensively or minimally? Come back tomorrow? These matter most but take weeks to accumulate.
The PM takeaway: Build an evaluation pipeline before you build the feature. Budget 20 to 30% of AI feature development time for evaluation.
4. Cost Structure: Tokens, Latency, and Model Selection
AI features have a fundamentally different cost structure than traditional software. Understanding this is critical for pricing, margin analysis, and architecture.
The unit economics are per-query, not per-user. A traditional SaaS feature costs the same whether the user clicks once or a hundred times. An AI feature costs money on every request. The primary cost drivers are input tokens (data sent to the model) and output tokens (the response generated).
The numbers matter. A typical API call to a frontier model costs $0.01 to $0.08. A support triage feature processing 10,000 tickets per day at $0.03 per ticket costs $9,000 per month in API fees alone. A content generation feature producing 500 drafts per day at $0.06 per draft costs $900 per month.
Latency is the hidden cost. Frontier models take 2 to 8 seconds to generate a response. Fine for content drafting. Not fine for real-time suggestion features targeting 200-millisecond response time. Faster models exist but are less capable.
Model selection is a product decision. There is no single best model — only a frontier of trade-offs between capability, speed, cost, and context window size. For most enterprise use cases, the right answer is not the most powerful model. It is the cheapest model that meets the quality bar.
5. RAG vs. Fine-Tuning: When to Use Each
When the base model does not know enough about your domain, you have two options: give it information at query time (RAG), or train it on your data (fine-tuning). This decision shapes architecture, cost, and maintenance.
RAG finds relevant documents from your data, inserts them into the prompt, and lets the model generate an informed response. It is the right choice 80% of the time. No model training required, works with any base model, updates instantly when data changes, and sources are traceable.
Fine-tuning trains the model on your data so it internalizes domain patterns, terminology, and style. Use it when you need consistent tone and format, when domain knowledge is about patterns rather than facts, or when you need to reduce per-query cost by eliminating the retrieval step.
The decision framework:
| Criterion | Choose RAG | Choose Fine-Tuning |
|---|---|---|
| Data changes frequently | Yes | No |
| Need to cite sources | Yes | No |
| Domain has specialized patterns/style | No | Yes |
| Budget for training runs ($500-5,000+) | No | Yes |
| Time to first deployment | Days | Weeks |
| Ongoing maintenance | Data pipeline | Retraining pipeline |
The PM takeaway: Default to RAG. Move to fine-tuning only when RAG cannot achieve your quality bar on style, tone, or domain-specific patterns.
6. The Build-vs-Buy Decision for AI Features
This decision is more nuanced for AI because the technology moves so fast that what you build today may be commoditized in six months.
Build when the AI feature is your core differentiator, you have proprietary data that creates a moat, or the use case is novel enough that no vendor serves it well.
Buy when the AI feature is table stakes, the vendor's solution is 80%+ of what you need, or you need to ship in weeks rather than months.
The trap I see most often: companies building commodity AI features from scratch. If you are building a meeting summarizer or a support chatbot, there are a dozen vendors who have spent millions of dollars on exactly that problem. Your engineering team will not build a better one in three months.
My rule of thumb: if the AI feature is why customers buy your product, build it. If it is why they do not leave, buy it.
7. AI Products Need Different Success Metrics
Traditional product metrics — DAU, retention, NPS — still matter. But they are not sufficient. AI products introduce new dimensions, and ignoring them leads to products that appear healthy while quietly degrading.
Accuracy (or quality score): What percentage of outputs meet your quality bar? Track weekly. For most enterprise use cases, 85% is the minimum viable accuracy. Below that, users lose trust faster than you can build it.
Latency at the 95th percentile matters more than average latency. If your feature responds in 2 seconds on average but takes 12 seconds for 5% of queries, those users will be disproportionately vocal.
Cost per query determines unit economics. Track by feature, model, and user segment. A feature costing $0.02 per query at launch can creep to $0.08 as prompts grow.
Rejection rate — the percentage of AI suggestions users discard — is your real-time quality signal. A code suggestion tool with a 70% acceptance rate is transforming productivity. A meeting summarizer with a 40% rejection rate is saving no one time.
Fallback rate — how often the AI fails entirely — determines reliability. A 5% fallback rate means one in twenty users has a broken experience every session. Traditional uptime metrics will not catch it.
The Meta-Lesson
The through-line across all seven points: AI product management requires the same core skills as any product management — understanding users, defining problems, measuring outcomes — applied to a technology with fundamentally different characteristics.
AI is probabilistic, not deterministic. It is expensive per-query, not per-user. It requires evaluation, not just testing. And it improves through iteration on prompts and data, not just code.
The PMs who thrive will not be the ones who understand transformer architecture. They will be the ones who build evaluation pipelines on day one, treat prompts as product artifacts, track cost per query alongside DAU, and make build-vs-buy decisions based on strategic positioning rather than engineering ego.
Onil Gunawardana is the founder of Business of AI and a product management executive with 15+ years building enterprise software. He writes about AI strategy, product development, and practical business use cases for AI.
Related Articles

Founder, BusinessOfAI.com
Product management executive with 15+ years building enterprise software. Created 8 major products generating $2B+ in incremental revenue.