Integrity Measurement — Bryan Carroll

Context

The Team's North Star — and What Was Missing

The Meta Integrity Experiences team designs front-end UIs for content enforcement, reporting, and support. Any time a user encountered harmful content or had content removed, they interacted with one of our interfaces.

Their north star metric was Positive Support Outcomes (PSOs) — a count of reports, unfollows, blocks, and hides. Roadmap prioritisation was therefore focused on discoverability: more entrypoints, less dropoff, more users reaching these tools.

Through ongoing qualitative research, I consistently heard users say: "It feels like Meta doesn't care about harmful content." Report action rate was below 1% — meaning over 99% of reports resulted in no visible action from the user's perspective.

If the team's goal was increasing PSOs while those tools were leaving people deeply disappointed, we were likely optimising for the wrong thing entirely.

Business Case

Connecting User Dissatisfaction to Business Goals

User dissatisfaction alone wasn't enough to convince the PM to change the metric. I needed to translate the qualitative signal into a business impact that leadership could act on.

My hypothesis: people who feel that "Meta doesn't care" will be less likely to use Meta. Engagement — not PSOs — was what the business ultimately cared about.

Engagement impact after a "report ignored" outcome

I partnered with Data Science to pull engagement data (VPVs, likes, shares, comments) for each day in the two weeks before and after a user received a "report ignored" message — across a random sample of users with L1 global engagement.

−0.7%

Drop in engagement after a negative report outcome

18M

Daily reporters — making the scale significant

Getting buy-in — working with PM and DS leadership

How the engagement analysis was designed and presented to drive change

+

The initial PM response to the qualitative findings was essentially: "users are unhappy, but we don't know if that affects the business." The engagement analysis was designed to directly address that objection.

I partnered with DS to design the analysis carefully — pulling a random sample of users with L1 engagement globally, isolating the "report ignored" message as the event, and measuring engagement in the two-week windows before and after. The goal was to make the causal story as clean as possible given observational data constraints.

Presenting a 0.7% engagement drop across 18M daily reporters reframed the conversation: this wasn't a soft user satisfaction problem, it was a quantifiable drag on platform engagement at scale. That shift in framing was what opened the door to the PM agreeing to invest in a proper measurement programme.

I was explicit about the limitations of the analysis — it was correlational, not causal — but argued that the prior qualitative evidence gave us reasonable confidence that the relationship was directional. The combination of qual signal + quant scale was sufficient to justify the investment.

Approach

Developing a Sentiment Measure

I set out to develop a sentiment metric that captured the user experience of the integrity flow, and use it to identify a product proxy the team could goal on — all within a 3-month window ahead of H2 goal setting.

Step 1 — Identify Factors

Analysed existing qualitative transcripts using text coding to identify 18 candidate UX driver items. Sanity-checked with product team, then operationalised each as a 5-point unipolar scale.

Step 2 — EFA Survey

Targeted global reporter population with sub-groups of ignored reporters and actioned reporters. Ran exploratory factor analysis to identify the underlying factor structure.

Step 3 — Cognitive Testing

Tested the instrument across 5 markets to ensure translation fidelity and scale usability. Made two key changes before piloting: word choice and scale format.

Step 4 — Pilot Experiment

Ran an in-product pilot to validate that the survey actually detected meaningful changes. Designed to eliminate alternative explanations for null results.

EFA survey methodology detail

Survey design, sampling approach, quality checks, analysis decisions

+

Decision	Approach & Rationale
Item generation	Used text coding on existing qualitative transcripts to identify 18 candidate items that could drive the user experience. Reviewed with product team for face validity before operationalising.
Scale format	5-point unipolar scales for consistency across all 18 items. Randomised question order and flipped response scales on some items to reduce order effects and acquiescence bias.
Target population	Global reporter population. Sub-groups: users who received a "report ignored" message, and users who received a "report actioned" message — to allow comparison across outcome types.
Sample size	Assumed high communalities given qual grounding, settled on n=350 per item per group. Sample size was adequate to detect large factor loadings reliably.
Data quality checks	Straightlining detection; speed-run check (identified mean completion time, eliminated completions at ¼ mean); response distribution checks for outliers, skewness, kurtosis; checked for MAR vs MNAR patterns. Note: Meta does not allow attention checks on-platform.
Factor analysis approach	Worked with DS on EFA. Scree plot and parallel analysis both suggested 3 or 4 factors. On examination of item content, 4-factor solution aligned most closely with prior qual findings and was more interpretable.
The Satisfaction factor	One factor (Satisfaction) showed near-perfect correlation with report outcome — making it unsuitable as a metric since the team has no control over policy decisions. Excluded from the final measure. Remaining 3 factors retained.

Findings

Factor Structure: Three Dimensions of "Supportiveness"

EFA identified 4 factors. After excluding Satisfaction (which was almost perfectly predicted by report outcome, making it outside the team's control), the remaining three became the core sentiment measure.

🔍 Transparency

Clearness of process

Clarity of decision

Clearness of communication

Clarity of policy

🗣️ Voice

Feeling listened to

Promptness

Understanding submitter's POV

Justification of decision

Fairness of procedure

⭐ Support

Due diligence done

Feeling listened to

Getting a response

Problem taken seriously

Empathy

⚠ Excluded: Satisfaction factor

A 4th factor (Satisfaction) included items like "problem resolved", "happiness with outcome", and "agreement with policy" — and was almost perfectly correlated with whether the report was actioned. Since the team has no control over policy enforcement decisions, this factor was excluded from the metric. The team cannot influence whether reports result in takedowns; they can influence how supported users feel throughout the process.

Validation

Cognitive Testing Across 5 Markets

Before piloting the instrument in-product, I ran cognitive testing to ensure the questions were translating correctly and that users understood the scale format as intended.

🇹🇭 Thailand 🇮🇩 Indonesia 🇩🇪 Germany 🇹🇷 Turkey 🇧🇷 Brazil

Cognitive testing methodology and key changes

Market selection rationale, what changed and why

+

Market selection: I prioritised markets with the highest report volumes where I had the least internal team understanding of language nuances. Spanish was a larger language by speaker volume, but I had several native Spanish speakers on the team who could provide informal review — so I prioritised markets where that internal check wasn't available.

Change 1 — "Supported" → "Supportive": The word "Supported" consistently translated poorly in all 5 markets. Participants interpreted it as "holding something up physically" rather than emotional care or institutional support. I tested "Supportive" as an alternative and it resonated consistently across all markets. This was a non-obvious change that would have significantly degraded data quality if missed.

Change 2 — 5-point → 4-point scale: Participants struggled to meaningfully differentiate all 5 scale points. I observed central tendency bias (clustering around the midpoint) and satisficing behaviour (selecting the middle option to move through the survey quickly). Moving to a 4-point scale removed the midpoint, forcing a directional response and improving discrimination. The slight loss of granularity was worth the gain in data quality.

Participants were anchored on a recent reporting experience at the start of the cognitive interview, which grounded responses in specific memory rather than general attitude — important for reducing abstract generalisation bias.

Validation

In-Product Pilot Experiment

Before recommending the team adopt the sentiment measure as their north star, I needed to demonstrate that the survey could actually detect meaningful product changes. I designed a pilot experiment to do this.

The core logic

There are three reasons a sentiment measure might show no change from a product experiment: (1) sample too small, (2) effect size too small, (3) the survey doesn't work. My strategy was to eliminate reasons 1 and 2, so that a null result could only be explained by reason 3.

Pilot experiment design detail

Sample strategy, feature selection, selection bias controls

+

Sample size: Ensured the experiment was powered to detect a 0.1% difference in means — deliberately setting a very sensitive threshold to eliminate "sample too small" as an explanation for a null result.

Feature selection: I worked with the product team to identify a "no regrets" change — a meaningful product improvement that we were confident would be shipped regardless of outcome. The experiment needed a feature large enough to actually move user experience, but one where shipping it didn't depend on the research result. This eliminated the risk of a null result being confounded by an underwhelming product change.

The shipped feature was a three-screen explainer sequence that contextualised the reporting outcome for users — significantly more information than the existing single-screen outcome message.

Selection bias control: This is where the design required careful thought. The three-screen explainer made the flow longer in the test condition. If I registered survey respondents at the end of the flow, the test and control populations would have different completion profiles — users who stuck through the longer flow might be more engaged or more determined reporters, creating an imbalanced comparison. I solved this by registering respondents at the start of the flow, before the experimental treatment was applied, ensuring balanced populations.

Results

The Survey Worked

R²=.68

Variance in user experience explained by sentiment factors

p=.001

Statistically significant test vs. control difference (t-test)

~0%

Variance explained by PSOs

The methodological win: The survey measured what it was supposed to measure. Our sentiment factors (Transparency, Voice, Support) captured meaningful variation in user experience that PSOs completely missed. An R² of .68 with p=.001 on a stepwise regression was strong validation that the measure was working. This gave the team confidence to ship the survey directly into the product.

Notably, no product proxies were found in the available product data that could predict the sentiment result — meaning the next stage of proxy work was indefinitely postponed. The team would use the in-product sentiment measure directly as their primary signal.

Impact

How a New Metric Changed Product Strategy

The shift from PSOs to Supportiveness changed not just the number on a dashboard — it changed what the team built. With PSOs as the goal, discoverability of existing tools was the primary lever. With Supportiveness, the question became: does this make people feel genuinely supported?

⭐ New North Star — "Supportiveness": The team shipped the in-product version of the sentiment measure and updated their formal goal from driving PSOs to driving Supportiveness. This shift in metric changed how the entire backlog was evaluated and prioritised.
⚠️ Warning Screens preserved: The team was considering removing Warning Screens in favour of content demotion — a change that would have looked neutral on PSOs. Analysis of Supportiveness showed WS meaningfully improved the user experience while reducing engagement equivalently to demotion. WS stayed on the platform.
🍕 Pizza Tracker shipped: A long-requested feature showing report progress and next steps — consistently de-prioritised because it clearly wouldn't improve PSOs. The Supportiveness measure gave the team a way to justify the launch quantitatively. It detected significant Supportiveness improvements and also showed mitigation of the post-outcome engagement drop.

My Role

Responsibilities

Identified the gap between team metric and user reality through ongoing qualitative research

Built the quantitative business case: designed and ran the engagement impact analysis with DS support

Led the full measurement programme: qualitative item generation, survey design, EFA (partnered with DS), cognitive testing across 5 markets

Designed the pilot experiment methodology including selection bias controls and "no regrets" feature alignment

Ran all analysis: t-test, stepwise regression, ANOVA comparing PSOs vs. sentiment factors

Influenced PM, ENG, and DS leadership to change the metric, which shifted the team's entire product strategy and feature prioritisation

Cross-functional collaboration details

Working with DS, PM, ENG, and leadership to drive the metric change

+

Data Science: Two significant DS partnerships. First, on the engagement impact analysis — iterating on the event definition, sample construction, and analytical approach to make the observational data as clean as possible. Second, on the EFA itself — running the factor analysis jointly, debating the 3- vs 4-factor solution, and ensuring the statistical approach was defensible to a DS-sceptical PM audience.

PM — from sceptic to sponsor: The PM's initial position was that user satisfaction was a "soft" metric that couldn't be tied to business outcomes. The engagement analysis changed this. Once the 0.7% drop was on the table, the PM became an active co-investor in the measurement work — helping identify the "no regrets" pilot feature and providing air cover for the timeline.

Engineering: Worked closely with ENG to design the in-product survey implementation — specifically the decision about where in the flow to register respondents. This required ENG to instrument the product differently than they would have done by default, which required me to explain the selection bias risk in terms they could act on.

Influencing the metric change: The final step — getting the team to formally adopt Supportiveness as their H2 goal — required presenting the pilot results to PM, ENG, and DS leadership simultaneously. I framed the presentation around the methodological win (the survey works, PSOs don't explain user experience) rather than leading with the business case, which had already been established. The leadership presentation was structured as: here's what we set out to prove, here's that we proved it, here's what changes.