← back to Transparency

Flip Testing — Hiring Tool

Bias evaluation results for October People's AI-powered candidate screening. 400 pairwise tests across gender, ethnicity, and cross-intersectional dimensions.

A

Overall Bias Grade

No Practically Meaningful Bias Detected

Across 400 pairwise tests spanning gender, ethnicity, and cross-intersectional categories, score variations remained within normal statistical range. The AI hiring tool produced consistent recommendations regardless of candidate demographic indicators.

400

Pairs tested

8

Job types

800

API calls

100%

Consistent labels

What This Test Evaluates

Flip testing is a fairness evaluation method where identical candidate profiles are submitted to the AI with only demographic indicators changed — specifically names that signal gender and/or ethnicity. If the AI is fair, the scores and recommendations should remain consistent regardless of these demographic signals.

27

Gender pairs

Across 6 ethnic groups (same surname, different gendered first name)

20

Ethnicity pairs

Same gender, different ethnic names

12

Cross-intersectional pairs

Different gender AND different ethnicity

Tested across all job types

EngineeringMarketingHRFinanceSalesDesignOperationsData Science

Test parameters

Model: GPT-5.2 · Temperature: 0.7 · 800 API calls · Duration: ~116 minutes · Generated 17 February 2026

Score Delta Summary

No bias detected

The mean score delta measures the average difference in AI-assigned scores between paired candidates. A delta near zero indicates no systematic preference. All categories show negligible mean deltas well within acceptable bounds.

CategoryPairsMean DeltaStd DevMax |Delta|p-value
Gender216-0.10.530.01(***)
Ethnicity160+0.10.620.10(*)
Cross-Intersectional24-0.20.620.10(*)

Significance: ns = not significant, * = p < 0.10, ** = p < 0.05, *** = p < 0.01. Scores are on a 0-100 scale.

Mean Score Delta by Category (on a ±50 point scale)

Gender-0.1 pts
Ethnicity+0.1 pts
Cross-Intersectional-0.2 pts
-500+50

The bars are barely visible because the actual differences (0.1–0.2 points) are negligible on the 100-point scale.

Per-Job-Type Breakdown

All within threshold

Each job type was tested independently with 50 candidate pairs. Results show consistent performance across all roles, with no job type showing a mean delta exceeding our 0.5-point threshold.

Job TypePairsMean DeltaStd DevMax |Delta|
Engineering500.00.92
Marketing500.00.62
HR50-0.40.93
Finance500.00.21
Sales50+0.10.31
Design50+0.10.32
Operations500.00.31
Data Science50-0.10.52

Mean Delta by Job Type (on a ±50 point scale)

Engineering
0.0 pts
Marketing
0.0 pts
HR
-0.4 pts
Finance
0.0 pts
Sales
+0.1 pts
Design
+0.1 pts
Operations
0.0 pts
Data Science
-0.1 pts
-500+50

Demographic Group Scores

Equal across groups

Average scores across all demographic groups fall within a remarkably tight range of 92.0 to 92.3 (a spread of just 0.3 points on a 100-point scale). This demonstrates that the AI evaluates candidates based on their qualifications, not their demographic background.

Average Score by Group (0–100 scale)

Female
Male
African American Female
92.1n=56
African American Male
92.1n=48
Anglo Female
92.3n=152
Anglo Male
92.1n=152
East Asian Female
92.2n=56
East Asian Male
92.0n=56
Eastern European Female
92.1n=8
Eastern European Male
92.3n=8
Hispanic Female
92.1n=40
Hispanic Male
92.1n=48
Middle Eastern Female
92.2n=32
Middle Eastern Male
92.1n=32
South Asian Female
92.2n=56
South Asian Male
92.1n=48
West African Male
92.0n=8
0255075100
Demographic GroupAvg ScoreMedianSamples
African American Female92.19256
African American Male92.19248
Anglo Female92.392152
Anglo Male92.192152
East Asian Female92.29256
East Asian Male92.09256
Eastern European Female92.1928
Eastern European Male92.3928
Hispanic Female92.19240
Hispanic Male92.19248
Middle Eastern Female92.29232
Middle Eastern Male92.19232
South Asian Female92.29256
South Asian Male92.19248
West African Male92.0928

Recommendation Consistency

100% consistent

Beyond numerical scores, we measure whether the AI's final recommendation label (e.g. "Strongly Recommended", "Recommended", "Not Recommended") changes between paired candidates. A label change would indicate that demographic signals affect the hiring outcome.

100%Consistent

400/400

Same label for both candidates

0/400

Label changes

In every single test pair, the AI assigned the same recommendation label to both candidates. Changing only the candidate's name — to signal a different gender or ethnicity — had zero effect on the hiring recommendation.

Effect Sizes (Cohen's d)

No practical impact

Cohen's d measures the standardised difference between two group means. Conventionally, values below 0.2 are negligible, 0.2–0.5 is small, 0.5–0.8 is medium, and above 0.8 is large.

Important context on interpreting these values

Our scoring uses a 1–100 integer scale. Because the AI scores are highly consistent (standard deviation of just 0.5–0.6 points), even trivial absolute differences of 0.1–0.2 points produce non-negligible Cohen's d values. A d of 0.354, for example, represents a real-world difference of just 0.2 points on a 100-point scale — a fraction of a single point that has zero practical impact on any candidate outcome. The 100% recommendation consistency (400/400 identical labels) confirms this.

CategoryCohen's dConventionActual DifferencePractical Impact
Gender0.219Small0.1 pts / 100None
Ethnicity0.137Negligible0.1 pts / 100None
Cross-Intersectional0.354Small0.2 pts / 100None

Effect Size Scale

Genderd = 0.219
Ethnicityd = 0.137
Cross-Intersectionald = 0.354
00.2 (Small)0.5 (Medium)0.8 (Large)

While conventional Cohen's d thresholds classify some values as "small", this is an artefact of the very low variance in scores — the AI is so consistent that any tiny fluctuation appears amplified in standardised terms. On the actual 100-point scale, the largest mean difference is 0.2 points, and no candidate received a different recommendation based on their demographic profile.

Our Commitment to Fair AI Hiring

October Health is committed to ensuring that our AI-powered hiring tools evaluate candidates solely on their skills, experience, and qualifications — never on their gender, ethnicity, age, or any other protected characteristic.

We run these flip tests regularly as part of our AI governance framework. Every model update, prompt change, or configuration adjustment triggers a fresh round of testing before deployment. This is not a one-time exercise — it is embedded in our continuous improvement process.

Our hiring AI outputs are always advisory. Final hiring decisions are made by humans, and our tools are designed to augment — not replace — human judgement in recruitment.

Ready to see October?