Skip to main content
Qingze Gu

Qingze Gu

Health data scientist and clinical epidemiologist — real-world evidence from population-scale cohorts and electronic health records

I am a health data scientist and clinical epidemiologist who generates real-world evidence (RWE) from population-scale health data. I design and run cohort and observational studies that turn electronic health records (EHR), registries, and other real-world data into evidence supporting clinical, regulatory, and commercial decisions — using causal inference, comparative effectiveness, survival analysis, mixed-effects and latent-class models, machine learning, and clinical NLP/LLMs in R, Python, and SQL.

As a Research Fellow at NTU Singapore, I work on PRECISE-SG100K — a multi-ancestry Asian population cohort of ~100,000 participants whose deep phenotypes and whole-genome data are linked to electronic health records and analysed within the secure TRUST platform. I build reproducible endpoint-analysis pipelines for chronic diseases (cardiovascular disease, type 2 diabetes, chronic kidney disease, liver disease, and cancer) and LLM pipelines that normalise free-text medications to OMOP/RxNorm concepts.

Before NTU, my Oxford DPhil exploited hospital electronic health records to improve infection management — modelling infection-response trajectories, antibiotic prescribing, and drug dosing. As a postdoctoral health data scientist I led a national study of rare-disease prevalence and COVID-19 burden across 62.5M people in the NHS England Secure Data Environment, and collaborated on genetic analyses of shared mechanisms between hypertension and type 2 diabetes. Earlier, I worked on the industry side of RWE at IQVIA and Oracle (Cerner Enviza).


Experience #

Research Fellow · Nanyang Technological University (LKCMedicine) · Singapore · Oct 2025 – present

  • Build reproducible cohort and endpoint-analysis pipelines for PRECISE-SG100K (~100,000 multi-ancestry participants), linking research phenotypes and whole-genome data with electronic health records in the secure TRUST platform.
  • Deliver disease-specific endpoint analyses across chronic diseases (cardiovascular disease, cancer, and others) — SNOMED/ICD codelist mapping, incidence and survival modelling, and absolute-risk estimation — on a versioned, config-driven framework reused across studies.
  • Develop LLM pipelines normalising 35,000+ free-text and self-reported medications to generic ingredients and OMOP/RxNorm drug concepts.

Postdoctoral Health Data Scientist · University of Oxford · Oxford, UK · Oct 2024 – Sep 2025

  • Led a national cross-sectional study estimating the prevalence of 406 rare diseases and their COVID-19 burden across 62.5M people and 19 ethnic groups, using linked primary-care, hospital, and mortality records in the NHS England Secure Data Environment (Databricks/PySpark/R/SQL).
  • Collaborated on genetic analyses of shared mechanisms between hypertension and type 2 diabetes, accounting for adiposity.

Consultant in Clinical NLP · Laboratory of Data Discovery for Health (D24H) · Remote / Hong Kong · Jun 2025 – Oct 2025

  • Built and validated an end-to-end LLM pipeline for TNM staging of non-small cell lung cancer (NSCLC) from pseudonymised oncology clinical notes — OCR text extraction, gold-standard labelling, and feature-based staging.
  • Led data curation (annotation-schema design, gold-standard labelling) and built a component-wise accuracy-evaluation harness; the pipeline reached over 90% accuracy on each of the T, N, and M staging components.

PhD Researcher (Biomedical Data Science) · University of Oxford · Oxford, UK · Oct 2020 – Sep 2024 Thesis: “Exploiting electronic health records to improve infection management” (viva passed with no corrections).

  • Characterised pathogen-specific inflammatory-marker and vital-sign trajectories in suspected bloodstream infection via latent-class mixed models on five years of hospital EHR, deriving centile reference charts to guide infection management.
  • Applied transformers and LLMs (BERT, GPT) to free-text antibiotic indications to infer infection sources, benchmarked against ICD-10 coding.
  • Evaluated an institutional vancomycin dosing guideline using regression, survival analysis, and population-pharmacokinetic simulation.

Real-World Solutions Intern · IQVIA · Beijing, China · Jul – Aug 2024

  • Assessed regional EHR databases and built feasibility table shells to support real-world study planning; desk research on disease burden, clinical trials, and patient-reported outcomes.

Real-World Evidence Intern · Cerner Enviza, an Oracle company · Shanghai, China · Sep 2023 – Mar 2024

  • Assembled treatment and marketed-drug data for survey-based RWE studies; developed case report forms and health-economics indicators; reviewed protocols on patient characteristics, disease burden, and treatment patterns.

Education #

  • PhD, Clinical Medicine (Biomedical Data Science) · University of Oxford · 2020–2024
  • MSc, Pharmacology (Distinction) · University of Oxford · 2019–2020
  • BEng, Pharmaceutics and Food · Harbin Institute of Technology · 2015–2019

Selected publications #

  1. Prevalence of 406 rare diseases by ethnicity and their COVID-19 burden — Gu Q, et al. medRxiv (2026). First author.
  2. Transformers and large language models are efficient feature extractors for EHR studies — Yuan K, Yoon CH, Gu Q, et al. Communications Medicine (2025). Joint first author.
  3. Distinct patterns of vital sign and inflammatory marker responses in suspected bloodstream infection — Gu Q, et al. Journal of Infection (2024). First author.
  4. Assessment of an institutional guideline for vancomycin dosing and predictive factors — Gu Q, et al. Journal of Infection (2022). First author.

View all publications →