← back to Transparency
Flip Testing — Hiring Tool
Bias evaluation results for October People's AI-powered candidate screening. 400 pairwise tests across gender, ethnicity, and cross-intersectional dimensions.
Overall Bias Grade
No Practically Meaningful Bias Detected
Across 400 pairwise tests spanning gender, ethnicity, and cross-intersectional categories, score variations remained within normal statistical range. The AI hiring tool produced consistent recommendations regardless of candidate demographic indicators.
400
Pairs tested
8
Job types
800
API calls
100%
Consistent labels
What This Test Evaluates
Flip testing is a fairness evaluation method where identical candidate profiles are submitted to the AI with only demographic indicators changed — specifically names that signal gender and/or ethnicity. If the AI is fair, the scores and recommendations should remain consistent regardless of these demographic signals.
27
Gender pairs
Across 6 ethnic groups (same surname, different gendered first name)
20
Ethnicity pairs
Same gender, different ethnic names
12
Cross-intersectional pairs
Different gender AND different ethnicity
Tested across all job types
Test parameters
Model: GPT-5.2 · Temperature: 0.7 · 800 API calls · Duration: ~116 minutes · Generated 17 February 2026
Score Delta Summary
No bias detectedThe mean score delta measures the average difference in AI-assigned scores between paired candidates. A delta near zero indicates no systematic preference. All categories show negligible mean deltas well within acceptable bounds.
| Category | Pairs | Mean Delta | Std Dev | Max |Delta| | p-value |
|---|---|---|---|---|---|
| Gender | 216 | -0.1 | 0.5 | 3 | 0.01(***) |
| Ethnicity | 160 | +0.1 | 0.6 | 2 | 0.10(*) |
| Cross-Intersectional | 24 | -0.2 | 0.6 | 2 | 0.10(*) |
Significance: ns = not significant, * = p < 0.10, ** = p < 0.05, *** = p < 0.01. Scores are on a 0-100 scale.
Mean Score Delta by Category (on a ±50 point scale)
The bars are barely visible because the actual differences (0.1–0.2 points) are negligible on the 100-point scale.
Per-Job-Type Breakdown
All within thresholdEach job type was tested independently with 50 candidate pairs. Results show consistent performance across all roles, with no job type showing a mean delta exceeding our 0.5-point threshold.
| Job Type | Pairs | Mean Delta | Std Dev | Max |Delta| |
|---|---|---|---|---|
| Engineering | 50 | 0.0 | 0.9 | 2 |
| Marketing | 50 | 0.0 | 0.6 | 2 |
| HR | 50 | -0.4 | 0.9 | 3 |
| Finance | 50 | 0.0 | 0.2 | 1 |
| Sales | 50 | +0.1 | 0.3 | 1 |
| Design | 50 | +0.1 | 0.3 | 2 |
| Operations | 50 | 0.0 | 0.3 | 1 |
| Data Science | 50 | -0.1 | 0.5 | 2 |
Mean Delta by Job Type (on a ±50 point scale)
Demographic Group Scores
Equal across groupsAverage scores across all demographic groups fall within a remarkably tight range of 92.0 to 92.3 (a spread of just 0.3 points on a 100-point scale). This demonstrates that the AI evaluates candidates based on their qualifications, not their demographic background.
Average Score by Group (0–100 scale)
| Demographic Group | Avg Score | Median | Samples |
|---|---|---|---|
| African American Female | 92.1 | 92 | 56 |
| African American Male | 92.1 | 92 | 48 |
| Anglo Female | 92.3 | 92 | 152 |
| Anglo Male | 92.1 | 92 | 152 |
| East Asian Female | 92.2 | 92 | 56 |
| East Asian Male | 92.0 | 92 | 56 |
| Eastern European Female | 92.1 | 92 | 8 |
| Eastern European Male | 92.3 | 92 | 8 |
| Hispanic Female | 92.1 | 92 | 40 |
| Hispanic Male | 92.1 | 92 | 48 |
| Middle Eastern Female | 92.2 | 92 | 32 |
| Middle Eastern Male | 92.1 | 92 | 32 |
| South Asian Female | 92.2 | 92 | 56 |
| South Asian Male | 92.1 | 92 | 48 |
| West African Male | 92.0 | 92 | 8 |
Recommendation Consistency
100% consistentBeyond numerical scores, we measure whether the AI's final recommendation label (e.g. "Strongly Recommended", "Recommended", "Not Recommended") changes between paired candidates. A label change would indicate that demographic signals affect the hiring outcome.
400/400
Same label for both candidates
0/400
Label changes
In every single test pair, the AI assigned the same recommendation label to both candidates. Changing only the candidate's name — to signal a different gender or ethnicity — had zero effect on the hiring recommendation.
Effect Sizes (Cohen's d)
No practical impactCohen's d measures the standardised difference between two group means. Conventionally, values below 0.2 are negligible, 0.2–0.5 is small, 0.5–0.8 is medium, and above 0.8 is large.
Important context on interpreting these values
Our scoring uses a 1–100 integer scale. Because the AI scores are highly consistent (standard deviation of just 0.5–0.6 points), even trivial absolute differences of 0.1–0.2 points produce non-negligible Cohen's d values. A d of 0.354, for example, represents a real-world difference of just 0.2 points on a 100-point scale — a fraction of a single point that has zero practical impact on any candidate outcome. The 100% recommendation consistency (400/400 identical labels) confirms this.
| Category | Cohen's d | Convention | Actual Difference | Practical Impact |
|---|---|---|---|---|
| Gender | 0.219 | Small | 0.1 pts / 100 | None |
| Ethnicity | 0.137 | Negligible | 0.1 pts / 100 | None |
| Cross-Intersectional | 0.354 | Small | 0.2 pts / 100 | None |
Effect Size Scale
While conventional Cohen's d thresholds classify some values as "small", this is an artefact of the very low variance in scores — the AI is so consistent that any tiny fluctuation appears amplified in standardised terms. On the actual 100-point scale, the largest mean difference is 0.2 points, and no candidate received a different recommendation based on their demographic profile.
Our Commitment to Fair AI Hiring
October Health is committed to ensuring that our AI-powered hiring tools evaluate candidates solely on their skills, experience, and qualifications — never on their gender, ethnicity, age, or any other protected characteristic.
We run these flip tests regularly as part of our AI governance framework. Every model update, prompt change, or configuration adjustment triggers a fresh round of testing before deployment. This is not a one-time exercise — it is embedded in our continuous improvement process.
Our hiring AI outputs are always advisory. Final hiring decisions are made by humans, and our tools are designed to augment — not replace — human judgement in recruitment.