Abstract
The application of large language models (LLMs) within clinical decision-support frameworks is receiving growing research attention, yet their fairness and demographic robustness remain insufficiently understood. This study introduces MedQA-Demog, a purpose-built, label-invariant extension of the MedQA-USMLE benchmark, designed to enable systematic auditing of demographic bias in medical reasoning models. Using a deterministic augmentation framework, we generated 4,659 question-answer items that incorporated counterfactual variations in gender, race/ethnicity, and age, and validated them through automated integrity and balance checks. We evaluated the Mistral 7B-Instruct model under stochastic (temperature = 0.7) and deterministic (temperature = 0.0) inference rules via the Ollama local environment, applying Wilson's 95 % confidence intervals, χ²/z-tests, McNemar’s paired analysis, and Cohen’s h effect sizes to quantify fairness. Across all demographic variants, diagnostic accuracy remained consistent (Δ < 0.04; p > 0.05), and all performance gaps fell within Minimal or Low Bias thresholds. Confusion-matrix and prediction-balance analyses revealed no systematic over- or under-prediction patterns, while power analysis confirmed that observed fluctuations were below the minimum detectable effect (≈ 0.057). A stratified robustness analysis further confirms that these fairness patterns persist across question difficulty levels and are not an artefact of uniformly limited performance. These findings demonstrate that open-weight, instruction-tuned LLMs can maintain demographic stability in clinical reasoning when evaluated through reproducible, controlled pipelines. This framework provides a practical foundation for bias evaluation in open clinical LLMs, supporting their ethical integration into digital health tools and clinical decision-support systems.
| Original language | English |
|---|---|
| Pages (from-to) | 12526-12543 |
| Number of pages | 18 |
| Journal | IEEE Access |
| Volume | 14 |
| Early online date | 20 Jan 2026 |
| DOIs | |
| Publication status | Published - 26 Jan 2026 |
Bibliographical note
Copyright © 2026 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/Funding
This work was supported in part by the Sir Peter Rigby Digital Futures Institute, Aston University, Funding Scheme
Keywords
- Large language models (LLMs); demographic bias; fairness auditing; medical question answering; MedQA benchmark; Mistral 7B-Instruct; open-weight models; Ollama; Wilson confidence interval; statistical bias evaluation; digital health; ethical AI.
Fingerprint
Dive into the research topics of 'Auditing Demographic Bias in Mistral: An Open-Source LLM’s Diagnostic Performance on the MedQA Benchmark'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver