All posts

Inside the Audit (09/12): Whether AI Actually Recommends Your App

The AI Discovery Engine queries ChatGPT, Gemini, Claude, Perplexity, and Google AI Overviews to measure if and how LLMs recommend your app.

Yevhen Tarasenko & Thomas Purnell-Fisher /

Apptonomy is an ASO intelligence and execution platform. Paste an App Store or Google Play URL, and the platform runs a full audit across multiple specialized engines, delivering scored findings and prioritized recommendations in minutes. For the full picture of how an audit works, read What You Get From an Apptonomy Audit.

The Orchestrator

Behind every audit is the Audit Engine, an orchestrator that spins up each specialized subengine in parallel and synthesizes their findings into a single unified report with an ASO Readiness Score (0-100). The current subengines:

  • Keyword Engine
  • Store Text Engine
  • Screenshot Engine
  • Icon Engine
  • Sentiment Engine
  • Competitor Discovery Engine
  • Policy Checker
  • Content Engine
  • AI Discovery Engine
  • Search Term Engine
  • Intent Engine

This post covers the AI Discovery Engine and what happens when it evaluates your app’s visibility to AI assistants.

The ASO Problem: A New Discovery Layer You Cannot See

A growing share of users now ask an AI assistant for app recommendations before they visit the App Store or Google Play. The pattern is straightforward: “What’s the best running app for beginners?” goes into ChatGPT, Gemini, or Perplexity, and the user gets a shortlist with reasoning. By the time they open the store, their decision is mostly made. If your app was not on that shortlist, you lost the install before traditional ASO even had a chance to work.

The challenge for ASO teams is that this discovery layer is invisible to existing tools. No standard ASO dashboard tells you whether ChatGPT mentions your app when a user asks for a recommendation in your category. And no dashboard shows you why it does or does not.

LLM recommendations are also inconsistent. Ask the same question twice and you may get different answers. A single test query tells you almost nothing. What matters is your recommendation frequency across many queries, across multiple LLMs, across the full range of user intents your app serves. That is a measurement problem that manual spot-checking cannot solve.

Our position: AI discoverability is a new layer on top of traditional ASO. The apps that get recommended by LLMs tend to have clear problem-solution language in their metadata, verifiable claims, consistent messaging across store listing and website, and strong community signals. These are ASO fundamentals, reframed for a different audience. The AI Discovery Engine measures where you stand and tells you exactly what to fix. This is a gap across the entire ASO tooling landscape. No existing platform provides systematic AI discoverability measurement.

Here is what the AI Discovery Engine found for a fitness app: mentioned in 4 out of 50 AI queries (8% presence), with a share of voice of 8% against a category leader at 52%. The engine identified three high-value user intents where no LLM recommended the app, and flagged a messaging gap between the store listing and website. Every technical section below explains how the engine produces findings like these.

Under the Hood

The AI Discovery Engine runs a multi-step pipeline: generate realistic user prompts, query five LLM providers, parse every response for app mentions and positioning, compute share of voice against competitors, map intent coverage, score across five weighted dimensions, and compare your store listing against your website for consistency. Here is what each stage produces.

Prompt Battery Generation

The problem: To know whether AI recommends your app, you need to ask the right questions. A single generic prompt like “best fitness app” gives you one data point. Users ask LLMs in many different ways: by category (“top budgeting apps”), by problem (“how do I track my spending”), by audience (“best finance app for college students”), by comparison (“Mint vs YNAB”), and by switching intent (“alternative to Mint”). Your measurement needs to cover all of these query patterns.

How the engine handles it: The engine generates a battery of 10 natural-language prompts using an AI model that takes your app’s category, primary use case, competitors, and user intents (from the Intent Engine if available) as context. Prompts are distributed across seven types: direct category, problem-solution, audience-specific, comparative, feature-specific, switching, and brand-aware. When intent data is available, prompts are mapped to specific user intents for later coverage analysis. For non-English locales, prompts are wrapped with language instructions so the LLMs respond in the target market’s language. For teams operating across multiple markets, the engine generates locale-specific prompt batteries for each target market, capturing regional differences in how users query AI assistants. A user in Japan asks for app recommendations differently than a user in Germany, both linguistically and in terms of category conventions. If AI generation fails, the engine falls back to a template system with pre-built prompt patterns.

What you get: A diverse battery of realistic user queries that simulate how people actually ask AI assistants for app recommendations in your category.

Multi-LLM Query Execution

The problem: Different LLMs recommend different apps. ChatGPT, Gemini, Claude, and Perplexity each draw from different training data and use different retrieval mechanisms. Google AI Overviews are a separate channel entirely, blending web search results with generative summaries. If you only test one LLM, you are seeing a fraction of the picture.

How the engine handles it: The engine queries five targets in parallel: ChatGPT (GPT-4.1-nano), Gemini (2.5 Flash Lite), Claude (Haiku 4.5), Perplexity (Sonar), and Google AI Overviews. These models were selected for cost-efficient batch querying at scale; recommendation patterns are validated against flagship model outputs during engine calibration. Each prompt is sent to all five. Per-LLM rate limiting ensures no single provider’s downtime blocks the full analysis. If one provider goes down, the engine continues with the others. Perplexity and Google AI Overviews also return citations (the URLs they reference in their answers), which are collected for later analysis.

What you get: Raw text responses from up to five AI providers for every prompt in the battery, capturing the full spectrum of how different LLMs handle your category.

Response Analysis

The problem: A raw text response from an LLM is unstructured. You need to know: was your app mentioned? In what position? With what sentiment? Which competitors also appeared? Which user intents were addressed?

How the engine handles it: The engine runs a two-stage analysis. First, a heuristic pre-filter checks whether your app name appears in each response, handling name variations automatically. Responses that clearly do not mention your app are classified without an LLM call. Responses that pass the filter are analyzed by an AI model that extracts: mention position in the recommendation list (1st, 2nd, 3rd, etc.), mention context, sentiment (positive, neutral, negative), recommendation strength (strong, moderate, weak, absent), competitor co-mentions with positions, and intent matches.

What you get: Structured analysis for every query result: mention detection, position ranking, sentiment classification, competitor co-occurrence, and intent matching.

Share of Voice

The problem: Knowing your app was mentioned in some queries tells you something. Knowing how often it was mentioned relative to your competitors tells you much more. If ChatGPT recommends Competitor X in 70% of relevant queries and your app in 15%, that gap is the real finding.

How the engine handles it: The engine counts your app’s mentions across all query results and compares them against each competitor. Share of voice is your mention count divided by total mentions (yours plus all competitors), expressed as a percentage. Each competitor gets their own SoV, average mention position, and dominant sentiment, sorted by share of voice so the biggest threats surface first.

What you get: Your share of voice percentage, per-competitor share of voice with position and sentiment data, total responses analyzed, and a count of unique apps mentioned across all queries. When a competitor dominates share of voice, the engine’s findings feed into the Store Text Engine and Content Engine to identify the specific language patterns driving that competitor’s AI visibility, giving your team a concrete optimization target.

Intent Coverage Mapping

The problem: Your app might get recommended for “best budgeting app” but never for “track spending across multiple accounts” or “budget app for couples.” Each unmatched intent represents a segment of users who will never find your app through AI recommendations.

How the engine handles it: The engine maps prompts to user intents (from the Intent Engine) and checks which intents result in your app being recommended. For each intent, it calculates a mention rate, average position when mentioned, and which LLMs mentioned your app for that intent. The overall intent coverage rate is the percentage of intents where your app was mentioned at least once.

What you get: A per-intent coverage map showing which user needs AI associates with your app and which it does not, along with an overall coverage rate (0-100%).

AI Discovery Score

The problem: You need a single number that captures your overall AI discoverability health, with enough granularity to know where to focus.

How the engine handles it: The engine computes a composite score (0-100) from five weighted dimensions:

  • Presence (30% weight): What percentage of queries resulted in your app being mentioned?
  • Position (25%): When mentioned, where do you rank? First recommendation scores 100, second scores 80, third 60, and so on.
  • Sentiment (15%): What proportion of mentions frame your app positively versus neutrally or negatively?
  • Intent Coverage (20%): What percentage of your relevant user intents are covered by AI recommendations?
  • Competitive Standing (10%): How does your share of voice compare to the strongest competitor?

Presence carries the highest weight because an app that never appears in recommendations cannot benefit from favorable positioning or sentiment. Intent Coverage ranks second because coverage breadth determines how many user segments can discover the app through AI. Competitive Standing is weighted lowest because it is a relative benchmark derived from the other dimensions. The default weights reflect cross-category analysis of AI recommendation patterns.

When community signal data is available, a sixth dimension (Community Health) is added at 10% weight and the other dimensions are rebalanced. Position scoring uses a fixed map: 1st place = 100 points, 2nd = 80, 3rd = 60, 4th = 40, 5th = 20, 6th or later = 10. Competitive standing is scaled by absolute presence, so low-SoV apps do not get inflated scores simply because their competitors are also invisible.

What you get: An overall AI Discovery Score (0-100) with a full breakdown by dimension. Each dimension score identifies exactly where your AI discoverability is strong and where it needs work.

A low Presence score means LLMs are not surfacing your app at all. The platform’s Store Text Engine and Content Engine use this signal to recommend specific metadata changes that improve AI visibility. A low Intent Coverage score pinpoints the user needs where your app is invisible, giving you a focused gap list for your next metadata update.

Cross-Channel Consistency Analysis

The problem: LLMs cross-reference multiple sources when deciding whether to recommend an app. If your store listing says one thing and your website says another, the conflicting signals reduce the LLM’s confidence in recommending you. If your store listing leads with AI-powered features but your website emphasizes privacy and simplicity, LLMs see mixed signals about what your app actually does.

How the engine handles it: When a website URL is available, the engine fetches your website content, extracts structured messaging (value proposition, key features, target audience, tone, claims), and does the same for your store listing. An AI model compares the two across five dimensions: value proposition alignment, feature emphasis, audience targeting, tone and voice, and claims and proof points. Each dimension gets a score (0-100). Misalignments are flagged by severity (high, medium, low) with fix recommendations.

What you get: An overall consistency score, a clarity score (how unambiguous your messaging is), and a credibility score (how well your claims are supported), plus per-dimension breakdowns and specific misalignment fixes. For teams where store listings and website content are managed by different groups, this analysis doubles as an alignment tool, surfacing messaging gaps across organizational boundaries with data rather than opinion.

Citation Analysis

The problem: Perplexity and Google AI Overviews cite their sources. Knowing which sources LLMs reference when recommending apps in your category reveals where you need to be present: editorial sites, community forums, your own website, or the store listings themselves.

How the engine handles it: The engine collects all citation URLs from Perplexity and Google AI Overview responses and classifies each by source type: community (Reddit, Stack Overflow, Hacker News), store listing (App Store, Google Play), owned (your website), editorial (TechCrunch, CNET, The Verge, and similar), or other. Citations are aggregated by category, domain, and LLM provider.

What you get: A breakdown by source category with percentages, a ranked list of the most-cited domains, and a per-LLM view of citation sources. If citations are heavily concentrated in one category, the engine flags this as a diversification opportunity.

Bringing It Together

The result is a clear, scored picture of where your app stands in AI recommendations, which competitors are ahead of you, and what to change. The multi-LLM querying, structured response analysis, competitive benchmarking, intent mapping, consistency scoring, and citation tracking behind it provide the statistical rigor that ad hoc spot-checking cannot match.

Running this analysis once gives you a clear picture of where your app stands in AI recommendations. Running it as part of every audit, with competitor data from the Competitor Discovery Engine and intent data from the Intent Engine, is where the compounding value builds. Because the engine runs on every audit, teams managing multiple apps can track AI discoverability trends over time, measuring whether metadata updates, review campaigns, or content changes move the needle on share of voice and intent coverage.

Querying individual LLMs and scanning responses works for spot-checking, but it falls apart when you need consistent measurement across providers, markets, and intents at scale. The engine delivers that measurement in minutes.

Run a free Quick Audit now Paste your App Store or Google Play URL at apptonomy.ai and see what the AI Discovery Engine finds.


Inside the Audit: The Full Series