Sentiment Engine

The Sentiment Engine analyzes your app’s user reviews to extract structured insights that a star rating alone cannot provide. It classifies review sentiment (accounting for sarcasm and rating-text disagreement), extracts primary sentiment drivers with priority and frequency scores, maps sentiment to specific app features, identifies topics and flags, tracks temporal trends, and classifies every negative finding by ASO actionability. The output includes a Sentiment Health Score, executive summary, prioritized risks and strengths, and cross-engine outputs for store text and screenshot recommendations.

A 4.2-star rating tells you the app is “pretty good.” It does not tell you that 35% of recent reviewers complain about the same onboarding flow, that your most loyal users consistently praise a feature you barely mention in screenshots, or that sentiment trended downward over the past 30 days even though the aggregate rating held steady. The Sentiment Engine extracts these signals from the review corpus.

Under the Hood

The Sentiment Engine runs your reviews through several layers of analysis, each designed to answer a different question about what your users are actually saying. Here is what each layer produces.

Rating Metrics and Review Sampling

The problem: Before analyzing what users say, you need statistical context. How reliable is a dataset of 47 reviews versus 4,700? How does a 4.3 rating compare across the store’s grading landscape? And when your app has 15,000 reviews, processing all of them is wasteful when a well-constructed sample captures the same signals.

How the engine handles it: The engine starts with two deterministic computations. A graded score (0-100) maps the 1-5 star rating to a normalized scale using calibrated thresholds (a 4.0 maps to 80, a 3.5 to 55), reflecting the competitive reality that anything below 4.0 is materially disadvantaged. A review volume score (0-100) measures statistical reliability across three tiers: under 30 reviews (low confidence), 30-299 (moderate), and 300+ (high confidence, capping at 3,000+).

For apps with large review corpora, the engine uses smart sampling weighted toward recent reviews while preserving the original star-rating distribution. The sample is statistically representative without processing every single review. The AI models process reviews in whatever language they appear, so multi-language corpora are analyzed natively.

What you get: A graded score, a volume confidence score, a daily rating timeseries (designed for dot-plot visualization from earliest review to present), and the score histogram from the store. These metrics become inputs to every downstream module.

Sentiment Classification and Driver Extraction

The problem: Star ratings and review text frequently disagree. A 5-star review that says “great app, crashes every time I open the camera” is negative despite its rating. A 2-star review that says “love the concept but needs more features” contains useful positive signal alongside the complaint. Rating-only analysis misses both.

How the engine handles it: The engine classifies every review as positive, neutral, or negative using AI, with text taking precedence over the star rating when they conflict. Sarcasm detection is built into the classification criteria: “Great app! Crashes every time.” is classified as negative regardless of the accompanying star count. Reviews are processed in batches of 50 with concurrency controls, and semantically similar results are deduplicated.

From the classified reviews, the engine extracts primary sentiment drivers: the 3-5 most significant factors driving positive and negative sentiment. Each driver gets a priority score and a frequency score so you can see at a glance which issues matter most and which are mentioned most often. Priority is scored 1-10: on the negative side, a 1 is a minor inconvenience mentioned by a single user, a 5 means users are considering alternatives, and a 9 means users report actual harm or data loss. On the positive side, a 1 is generic positive sentiment without feature attribution, a 7 means users recognize this sets the app apart, and a 10 means users consider this sacred. Frequency uses a logarithmic scale reflecting what percentage of reviews mention the topic.

What you get: A sentiment distribution (positive/neutral/negative percentages), a confidence score (0-100) blending volume, recency, consistency, and rating-text alignment, and ranked positive and negative drivers with priority and frequency scores.

Topic Extraction and Flag Identification

The problem: Knowing that 30% of reviews are negative tells you there is a problem. Knowing that the top negative topic is “subscription pricing” with an impact score of 82 and representative quotes like “I would pay for premium but $15/month is absurd” tells you exactly what the problem is and gives you language for the fix.

How the engine handles it: Topic extraction identifies up to 5 positive and 5 negative topics, each with a descriptive summary, 2-3 representative review excerpts, an impact score (1-100), a confidence score, and a user impact percentage. Semantically similar topics (above 80% similarity) get merged, combining evidence and summing user impact.

Flag identification surfaces green flags and red flags in parallel. Green flags are consistent strengths (minimum 5% user impact and 60+ impact score). Red flags are critical issues: crashes, login failures, payment problems, security concerns. Red flag detection uses both AI analysis and keyword-based scanning for high-severity terms. Each flag follows the same scored, evidence-backed format as topics.

What you get: Positive and negative topic lists with scores and evidence. Green and red flag lists with the same structure. A topic consistency score (0-100) and a review recency score.

Feature Sentiment Mapping

The problem: “Users like the app” is too vague to act on. You need to know which specific features drive satisfaction and which drive complaints, so you can highlight the right things in screenshots and address the right things in your roadmap.

How the engine handles it: The engine maps sentiment to up to 10 specific app features, each with a sentiment classification (positive, negative, or mixed), a mention count, a sentiment score, and an app screen reference when reviews indicate a specific UI context. The mapping uses differentiated thresholds: positive features need only 2 mentions to qualify (positive feedback scatters), while negative features need 3 (complaints concentrate). The classification is strict about “mixed” status. If reviews praise lesson content but criticize lesson pacing, those become two separate features with distinct classifications rather than one ambiguous entry.

What you get: A feature sentiment map with up to 10 features, each scored and classified. For a language learning app, the map might show: Lesson content (positive, 47 mentions), Speech recognition (negative, 31 mentions), Progress tracking (mixed, 18 mentions). The marketing team knows to highlight lesson content in screenshots. The product team knows speech recognition is the priority fix.

Temporal Trend Analysis

The problem: A healthy sentiment snapshot can mask a deteriorating trajectory. If your sentiment shifted from 75% positive to 60% positive over the past month, the current aggregate still looks acceptable, but the trend matters more than the snapshot.

How the engine handles it: The engine computes temporal analysis using a configurable window (7-day or 30-day), comparing the current period’s sentiment distribution against the prior equivalent period. It computes a sentiment change percentage and classifies the trend as improving, stable, or declining, with confidence levels (high, medium, low) based on review counts in each comparison period. When data is insufficient, the engine reports “insufficient_data” rather than producing a noisy result.

What you get: A trend direction, the change percentage, review counts for both comparison periods, and a confidence level. Declining trends with high confidence carry a -10 penalty in the overall health score. Improving trends earn a +5 bonus. The trend direction is often more actionable than the snapshot score — it answers whether things are getting better or worse since the last cycle.

ASO Actionability Classification

The problem: Not every review complaint is something an ASO team can fix. “The app crashes on Android 14” is a product issue. “The description says it works offline but it doesn’t” is an ASO issue. “The subscription price is too high” has both ASO implications (how you frame value in the listing) and product implications (the pricing itself). Teams waste time arguing about ownership when each finding lacks a clear classification.

How the engine handles it: Every negative finding gets classified as aso_direct (addressable through keywords, descriptions, screenshots, developer responses, category selection), product_escalation (requires engineering or product changes), or hybrid (ASO can mitigate while product addresses the root cause). Each classification includes specific ASO actions and escalation briefs when product involvement is needed.

What you get: Every negative topic, red flag, risk, and priority action tagged with its actionability classification and concrete next steps. Growth teams can route findings to the right people without a triage meeting. Your ASO specialist gets copy angles with evidence. Your product manager gets bug reports with user impact data and severity.

Overall Health Score and Executive Summary

The problem: All the individual findings need to roll up into something that can be read quickly and used to prioritize work.

How the engine handles it: The engine computes a Sentiment Health Score (1-100) by blending the star rating graded score (50% weight) with a sentiment score derived from the distribution (50% weight), then applying adjustments for high-impact red flags (-5 each), green flags (+3 each), and temporal trends. The score maps to a health grade: excellent (85+), good (70-84), average (50-69), poor (30-49), or critical (below 30).

From this score and all upstream data, the engine generates an AI-written executive summary (2-3 sentences), up to 3 primary risks with severity scores and recommended actions, up to 3 primary strengths with marketing angle recommendations, and up to 5 priority actions ordered by impact. A trust risk score combines the graded score (35%), sentiment health (40%), and review volume (25%) into a composite risk indicator. The competitive benchmarking module compares the app’s sentiment profile against category benchmarks, so a health score of 72 comes with context about where that stands relative to competitors in the same category.

What you get: A single health score and grade. An executive summary ready for stakeholder reporting. Prioritized risks and strengths with action recommendations. A trust risk score. All scored, all evidence-backed.

Engine Integrations: Editor and Studio Insights

The problem: Sentiment findings are most valuable when they feed directly into what you write and what you show. Key phrases from positive reviews should inform description copy. Feature-level sentiment should influence which screenshots you prioritize.

How the engine handles it: Editor insights extract key phrases and marketing angles from positive findings for store text updates. Studio insights recommend visual elements and specific app screens to showcase in screenshots based on feature sentiment. Audit alerts scan red flags for critical keywords (crash, login, payment, privacy, performance) and generate high-priority alerts.

What you get: Marketing copy angles pulled from user language. Screenshot recommendations based on praised features. Automated alerts when reviews surface issues needing immediate attention. Actionability findings include developer response recommendations where relevant. The engine flags which reviews warrant a response and what to address, though response drafting is handled outside the engine.

Understanding Your Report — how engine scores roll up into your audit
Audit Engines Overview — how the engines fit together
Intent Engine — uses review-extracted intents from this engine
Store Text Engine — editor insights feed into store text recommendations
Screenshot Engine — studio insights inform screenshot recommendations