October
Book a demo
Transparency TestingRecruiting AI · GPT-5.2 · 17 Feb 2026

Flip Testing — Hiring Tool

Bias evaluation results for October People's AI-powered candidate screening. 400 pairwise tests across gender, ethnicity, and cross-intersectional dimensions.

01ResultGrade A
ABias Grade

No practically meaningful bias detected

Across 400 pairwise tests spanning gender, ethnicity, and cross-intersectional categories, score variations remained within normal statistical range. The AI hiring tool produced consistent recommendations regardless of candidate demographic indicators.

400

Pairs tested

8

Job types

800

API calls

100%

Consistent labels

02What this test evaluates

Identical profiles, flipped names.

If the AI is fair, scores and recommendations stay consistent when only demographic indicators change.

Flip testing is a fairness evaluation method where identical candidate profiles are submitted to the AI with only demographic indicators changed — specifically names that signal gender and/or ethnicity. If the AI is fair, the scores and recommendations should remain consistent regardless of these demographic signals.

27

Gender pairs

Across 6 ethnic groups (same surname, different gendered first name)

20

Ethnicity pairs

Same gender, different ethnic names

12

Cross-intersectional pairs

Different gender AND different ethnicity

Tested across all job types

EngineeringMarketingHRFinanceSalesDesignOperationsData Science

Test parameters

Model: GPT-5.2 · Temperature: 0.7 · 800 API calls · Duration: ~116 minutes · Generated 17 February 2026

03Score delta summaryNo bias detected

A delta near zero.

The mean score delta is the average difference in AI-assigned scores between paired candidates. All categories are negligible.

CategoryPairsMean ΔStd DevMax |Δ|p-value
Gender216-0.10.530.01(***)
Ethnicity160+0.10.620.10(*)
Cross-Intersectional24-0.20.620.10(*)

Significance: ns = not significant · * = p < 0.10 · ** = p < 0.05 · *** = p < 0.01 · scores on a 0–100 scale

Mean score delta by category · ±50-point scale

Gender
-0.1 pts
Ethnicity
+0.1 pts
Cross-Intersectional
-0.2 pts
-500+50

The bars are barely visible because the actual differences (0.1–0.2 points) are negligible on the 100-point scale.

04Per-job-type breakdownAll within threshold

Consistent across every role.

Each job type was tested with 50 candidate pairs. No role exceeds the 0.5-point mean-delta threshold.

Job typePairsMean ΔStd DevMax |Δ|
Engineering500.00.92
Marketing500.00.62
HR50-0.40.93
Finance500.00.21
Sales50+0.10.31
Design50+0.10.32
Operations500.00.31
Data Science50-0.10.52

Mean delta by job type · ±50-point scale

Engineering
0.0 pts
Marketing
0.0 pts
HR
-0.4 pts
Finance
0.0 pts
Sales
+0.1 pts
Design
+0.1 pts
Operations
0.0 pts
Data Science
-0.1 pts
-500+50
05Demographic group scoresEqual across groups

A remarkably tight range.

Average scores fall between 92.0 and 92.3 — a spread of just 0.3 points on a 100-point scale.

Average score by group · 0–100 scaleFemaleMale
African American Female
92.1n=56
African American Male
92.1n=48
Anglo Female
92.3n=152
Anglo Male
92.1n=152
East Asian Female
92.2n=56
East Asian Male
92.0n=56
Eastern European Female
92.1n=8
Eastern European Male
92.3n=8
Hispanic Female
92.1n=40
Hispanic Male
92.1n=48
Middle Eastern Female
92.2n=32
Middle Eastern Male
92.1n=32
South Asian Female
92.2n=56
South Asian Male
92.1n=48
West African Male
92.0n=8
Demographic groupAvg scoreMedianSamples
African American Female92.19256
African American Male92.19248
Anglo Female92.392152
Anglo Male92.192152
East Asian Female92.29256
East Asian Male92.09256
Eastern European Female92.1928
Eastern European Male92.3928
Hispanic Female92.19240
Hispanic Male92.19248
Middle Eastern Female92.29232
Middle Eastern Male92.19232
South Asian Female92.29256
South Asian Male92.19248
West African Male92.0928
06Recommendation consistency100% consistent

The label never changed.

Beyond scores, we measure whether the final recommendation label changes between paired candidates. A change would mean demographics affect the outcome.

100%Consistent
Same label

400/400

same recommendation for both candidates

Label changes

0/400

demographic-driven outcome changes

In every single test pair, the AI assigned the same recommendation label to both candidates. Changing only the candidate’s name — to signal a different gender or ethnicity — had zero effect on the hiring recommendation.

07Effect sizes · Cohen's dNo practical impact

Amplified by low variance.

Cohen's d standardises the difference between two group means. Below 0.2 is negligible, 0.2–0.5 small, 0.5–0.8 medium, above 0.8 large.

Interpreting these values

Our scoring uses a 1–100 integer scale. Because the AI scores are highly consistent (standard deviation of just 0.5–0.6 points), even trivial absolute differences of 0.1–0.2 points produce non-negligible Cohen’s d values. A d of 0.354 represents a real-world difference of just 0.2 points on a 100-point scale — a fraction of a single point with zero practical impact on any candidate outcome. The 100% recommendation consistency (400/400 identical labels) confirms this.

CategoryCohen's dConventionActual differencePractical impact
Gender0.219Small0.1 pts / 100None
Ethnicity0.137Negligible0.1 pts / 100None
Cross-Intersectional0.354Small0.2 pts / 100None

Effect size scale

Genderd = 0.219
Ethnicityd = 0.137
Cross-Intersectionald = 0.354
00.2 small0.5 medium0.8 large

While conventional Cohen’s d thresholds classify some values as “small”, this is an artefact of the very low variance in scores — the AI is so consistent that any tiny fluctuation appears amplified in standardised terms. On the actual 100-point scale, the largest mean difference is 0.2 points, and no candidate received a different recommendation based on their demographic profile.

08Our commitment to fair AI hiring

October Health is committed to ensuring that our AI-powered hiring tools evaluate candidates solely on their skills, experience, and qualifications — never on their gender, ethnicity, age, or any other protected characteristic.

We run these flip tests regularly as part of our AI governance framework. Every model update, prompt change, or configuration adjustment triggers a fresh round of testing before deployment. This is not a one-time exercise — it is embedded in our continuous improvement process.

Our hiring AI outputs are always advisory. Final hiring decisions are made by humans, and our tools are designed to augment — not replace — human judgement in recruitment.