← back to Transparency

Flip Testing — Hiring Tool

Bias evaluation results for October People's AI-powered candidate screening. 400 pairwise tests across gender, ethnicity, and cross-intersectional dimensions.

Overall Bias Grade

No Practically Meaningful Bias Detected

Across 400 pairwise tests spanning gender, ethnicity, and cross-intersectional categories, score variations remained within normal statistical range. The AI hiring tool produced consistent recommendations regardless of candidate demographic indicators.

400

Pairs tested

Job types

800

API calls

100%

Consistent labels

What This Test Evaluates

Flip testing is a fairness evaluation method where identical candidate profiles are submitted to the AI with only demographic indicators changed — specifically names that signal gender and/or ethnicity. If the AI is fair, the scores and recommendations should remain consistent regardless of these demographic signals.

Gender pairs

Across 6 ethnic groups (same surname, different gendered first name)

Ethnicity pairs

Same gender, different ethnic names

Cross-intersectional pairs

Different gender AND different ethnicity

Tested across all job types

EngineeringMarketingHRFinanceSalesDesignOperationsData Science

Test parameters

Model: GPT-5.2 · Temperature: 0.7 · 800 API calls · Duration: ~116 minutes · Generated 17 February 2026

Score Delta Summary

No bias detected

The mean score delta measures the average difference in AI-assigned scores between paired candidates. A delta near zero indicates no systematic preference. All categories show negligible mean deltas well within acceptable bounds.

Category	Pairs	Mean Delta	Std Dev	Max \|Delta\|	p-value
Gender	216	-0.1	0.5	3	0.01(***)
Ethnicity	160	+0.1	0.6	2	0.10(*)
Cross-Intersectional	24	-0.2	0.6	2	0.10(*)

Significance: ns = not significant, * = p < 0.10, ** = p < 0.05, *** = p < 0.01. Scores are on a 0-100 scale.

Mean Score Delta by Category (on a ±50 point scale)

Gender-0.1 pts

Ethnicity+0.1 pts

Cross-Intersectional-0.2 pts

-500+50

The bars are barely visible because the actual differences (0.1–0.2 points) are negligible on the 100-point scale.

Per-Job-Type Breakdown

All within threshold

Each job type was tested independently with 50 candidate pairs. Results show consistent performance across all roles, with no job type showing a mean delta exceeding our 0.5-point threshold.

Job Type	Pairs	Mean Delta	Std Dev	Max \|Delta\|
Engineering	50	0.0	0.9	2
Marketing	50	0.0	0.6	2
HR	50	-0.4	0.9	3
Finance	50	0.0	0.2	1
Sales	50	+0.1	0.3	1
Design	50	+0.1	0.3	2
Operations	50	0.0	0.3	1
Data Science	50	-0.1	0.5	2

Mean Delta by Job Type (on a ±50 point scale)

Engineering

0.0 pts

Marketing

0.0 pts

-0.4 pts

Finance

0.0 pts

Sales

+0.1 pts

Design

+0.1 pts

Operations

0.0 pts

Data Science

-0.1 pts

-500+50

Demographic Group Scores

Equal across groups

Average scores across all demographic groups fall within a remarkably tight range of 92.0 to 92.3 (a spread of just 0.3 points on a 100-point scale). This demonstrates that the AI evaluates candidates based on their qualifications, not their demographic background.

Average Score by Group (0–100 scale)

Female

Male

African American Female

92.1n=56

African American Male

92.1n=48

Anglo Female

92.3n=152

Anglo Male

92.1n=152

East Asian Female

92.2n=56

East Asian Male

92.0n=56

Eastern European Female

92.1n=8

Eastern European Male

92.3n=8

Hispanic Female

92.1n=40

Hispanic Male

92.1n=48

Middle Eastern Female

92.2n=32

Middle Eastern Male

92.1n=32

South Asian Female

92.2n=56

South Asian Male

92.1n=48

West African Male

92.0n=8

0255075100

Demographic Group	Avg Score	Median	Samples
African American Female	92.1	92	56
African American Male	92.1	92	48
Anglo Female	92.3	92	152
Anglo Male	92.1	92	152
East Asian Female	92.2	92	56
East Asian Male	92.0	92	56
Eastern European Female	92.1	92	8
Eastern European Male	92.3	92	8
Hispanic Female	92.1	92	40
Hispanic Male	92.1	92	48
Middle Eastern Female	92.2	92	32
Middle Eastern Male	92.1	92	32
South Asian Female	92.2	92	56
South Asian Male	92.1	92	48
West African Male	92.0	92	8

Recommendation Consistency

100% consistent

Beyond numerical scores, we measure whether the AI's final recommendation label (e.g. "Strongly Recommended", "Recommended", "Not Recommended") changes between paired candidates. A label change would indicate that demographic signals affect the hiring outcome.

400/400

Same label for both candidates

0/400

Label changes

In every single test pair, the AI assigned the same recommendation label to both candidates. Changing only the candidate's name — to signal a different gender or ethnicity — had zero effect on the hiring recommendation.

Effect Sizes (Cohen's d)

No practical impact

Cohen's d measures the standardised difference between two group means. Conventionally, values below 0.2 are negligible, 0.2–0.5 is small, 0.5–0.8 is medium, and above 0.8 is large.

Important context on interpreting these values

Our scoring uses a 1–100 integer scale. Because the AI scores are highly consistent (standard deviation of just 0.5–0.6 points), even trivial absolute differences of 0.1–0.2 points produce non-negligible Cohen's d values. A d of 0.354, for example, represents a real-world difference of just 0.2 points on a 100-point scale — a fraction of a single point that has zero practical impact on any candidate outcome. The 100% recommendation consistency (400/400 identical labels) confirms this.

Category	Cohen's d	Convention	Actual Difference	Practical Impact
Gender	0.219	Small	0.1 pts / 100	None
Ethnicity	0.137	Negligible	0.1 pts / 100	None
Cross-Intersectional	0.354	Small	0.2 pts / 100	None

Effect Size Scale

Genderd = 0.219

Ethnicityd = 0.137

Cross-Intersectionald = 0.354

00.2 (Small)0.5 (Medium)0.8 (Large)

While conventional Cohen's d thresholds classify some values as "small", this is an artefact of the very low variance in scores — the AI is so consistent that any tiny fluctuation appears amplified in standardised terms. On the actual 100-point scale, the largest mean difference is 0.2 points, and no candidate received a different recommendation based on their demographic profile.

Our Commitment to Fair AI Hiring

October Health is committed to ensuring that our AI-powered hiring tools evaluate candidates solely on their skills, experience, and qualifications — never on their gender, ethnicity, age, or any other protected characteristic.

We run these flip tests regularly as part of our AI governance framework. Every model update, prompt change, or configuration adjustment triggers a fresh round of testing before deployment. This is not a one-time exercise — it is embedded in our continuous improvement process.

Our hiring AI outputs are always advisory. Final hiring decisions are made by humans, and our tools are designed to augment — not replace — human judgement in recruitment.

Read our AI Governance Policy·View AI Model Register·Contact DPO

Back to Transparency View all AI governance