Flip Testing — Hiring Tool
Bias evaluation results for October People's AI-powered candidate screening. 400 pairwise tests across gender, ethnicity, and cross-intersectional dimensions.
No practically meaningful bias detected
Across 400 pairwise tests spanning gender, ethnicity, and cross-intersectional categories, score variations remained within normal statistical range. The AI hiring tool produced consistent recommendations regardless of candidate demographic indicators.
400
8
800
100%
Identical profiles, flipped names.
If the AI is fair, scores and recommendations stay consistent when only demographic indicators change.
Flip testing is a fairness evaluation method where identical candidate profiles are submitted to the AI with only demographic indicators changed — specifically names that signal gender and/or ethnicity. If the AI is fair, the scores and recommendations should remain consistent regardless of these demographic signals.
27
Gender pairs
Across 6 ethnic groups (same surname, different gendered first name)
20
Ethnicity pairs
Same gender, different ethnic names
12
Cross-intersectional pairs
Different gender AND different ethnicity
Tested across all job types
Model: GPT-5.2 · Temperature: 0.7 · 800 API calls · Duration: ~116 minutes · Generated 17 February 2026
A delta near zero.
The mean score delta is the average difference in AI-assigned scores between paired candidates. All categories are negligible.
| Category | Pairs | Mean Δ | Std Dev | Max |Δ| | p-value |
|---|---|---|---|---|---|
| Gender | 216 | -0.1 | 0.5 | 3 | 0.01(***) |
| Ethnicity | 160 | +0.1 | 0.6 | 2 | 0.10(*) |
| Cross-Intersectional | 24 | -0.2 | 0.6 | 2 | 0.10(*) |
Mean score delta by category · ±50-point scale
Consistent across every role.
Each job type was tested with 50 candidate pairs. No role exceeds the 0.5-point mean-delta threshold.
| Job type | Pairs | Mean Δ | Std Dev | Max |Δ| |
|---|---|---|---|---|
| Engineering | 50 | 0.0 | 0.9 | 2 |
| Marketing | 50 | 0.0 | 0.6 | 2 |
| HR | 50 | -0.4 | 0.9 | 3 |
| Finance | 50 | 0.0 | 0.2 | 1 |
| Sales | 50 | +0.1 | 0.3 | 1 |
| Design | 50 | +0.1 | 0.3 | 2 |
| Operations | 50 | 0.0 | 0.3 | 1 |
| Data Science | 50 | -0.1 | 0.5 | 2 |
Mean delta by job type · ±50-point scale
A remarkably tight range.
Average scores fall between 92.0 and 92.3 — a spread of just 0.3 points on a 100-point scale.
| Demographic group | Avg score | Median | Samples |
|---|---|---|---|
| African American Female | 92.1 | 92 | 56 |
| African American Male | 92.1 | 92 | 48 |
| Anglo Female | 92.3 | 92 | 152 |
| Anglo Male | 92.1 | 92 | 152 |
| East Asian Female | 92.2 | 92 | 56 |
| East Asian Male | 92.0 | 92 | 56 |
| Eastern European Female | 92.1 | 92 | 8 |
| Eastern European Male | 92.3 | 92 | 8 |
| Hispanic Female | 92.1 | 92 | 40 |
| Hispanic Male | 92.1 | 92 | 48 |
| Middle Eastern Female | 92.2 | 92 | 32 |
| Middle Eastern Male | 92.1 | 92 | 32 |
| South Asian Female | 92.2 | 92 | 56 |
| South Asian Male | 92.1 | 92 | 48 |
| West African Male | 92.0 | 92 | 8 |
The label never changed.
Beyond scores, we measure whether the final recommendation label changes between paired candidates. A change would mean demographics affect the outcome.
400/400
same recommendation for both candidates
0/400
demographic-driven outcome changes
In every single test pair, the AI assigned the same recommendation label to both candidates. Changing only the candidate’s name — to signal a different gender or ethnicity — had zero effect on the hiring recommendation.
Amplified by low variance.
Cohen's d standardises the difference between two group means. Below 0.2 is negligible, 0.2–0.5 small, 0.5–0.8 medium, above 0.8 large.
Our scoring uses a 1–100 integer scale. Because the AI scores are highly consistent (standard deviation of just 0.5–0.6 points), even trivial absolute differences of 0.1–0.2 points produce non-negligible Cohen’s d values. A d of 0.354 represents a real-world difference of just 0.2 points on a 100-point scale — a fraction of a single point with zero practical impact on any candidate outcome. The 100% recommendation consistency (400/400 identical labels) confirms this.
| Category | Cohen's d | Convention | Actual difference | Practical impact |
|---|---|---|---|---|
| Gender | 0.219 | 0.1 pts / 100 | ||
| Ethnicity | 0.137 | 0.1 pts / 100 | ||
| Cross-Intersectional | 0.354 | 0.2 pts / 100 |
Effect size scale
While conventional Cohen’s d thresholds classify some values as “small”, this is an artefact of the very low variance in scores — the AI is so consistent that any tiny fluctuation appears amplified in standardised terms. On the actual 100-point scale, the largest mean difference is 0.2 points, and no candidate received a different recommendation based on their demographic profile.
October Health is committed to ensuring that our AI-powered hiring tools evaluate candidates solely on their skills, experience, and qualifications — never on their gender, ethnicity, age, or any other protected characteristic.
We run these flip tests regularly as part of our AI governance framework. Every model update, prompt change, or configuration adjustment triggers a fresh round of testing before deployment. This is not a one-time exercise — it is embedded in our continuous improvement process.
Our hiring AI outputs are always advisory. Final hiring decisions are made by humans, and our tools are designed to augment — not replace — human judgement in recruitment.

