Synthetic Data in Medical Research: What, Why and How?

Submitted by S.A. Norwood on Mon, 14/04/2025 - 15:35

Synthetic data holds great promise, but considerable uncertainties remain about how to fully and responsibly realise its potential. Urgent collaboration between medical, ethical, and machine learning communities, and the public, is essential to develop standardised governance frameworks, appropriate-use guidelines, and consensus-based definitions of quality, safety, and fairness. The time to act is now.”Governing synthetic data in medical research: the time is now¹

Cambridge researchers Dr Daniela Boraschi, Prof Mihaela van der Schaar, Dr Alessia Costa and Dr Richard Milne urge policy-makers to take action. Their recent article 'Governing synthetic data in medical research: the time is now', published in The Lancet Digital Health, asks scientists and regulators alike to consider how synthetic data should be used.¹ But what is synthetic data? When, why and how is it currently used? And what considerations should we be taking into account in future?

What is synthetic data?

Synthetic data is information generated by artificial intelligence algorithms that mimics real-world data. It is based on existing data and aims to reproduce the characteristics and structure of the original data. Good quality synthetic data allows researchers to expand small datasets and should produce similar results to original data when analysed statistically.

Why use synthetic data in biomedical research?

Synthetic data is increasingly of interest in biomedical research for its capacity to provide extensive, anonymous datasets. These datasets are essential for developing and studying machine learning algorithms and AI applications, with several project-specific and publicly available datasets accessible for researchers to download.²^,³^,⁴

How is synthetic data currently used?

Synthetic medical data has applications in a number of areas and could address some problems posed by traditional data collection. Potential uses of synthetic medical data are currently being developed and explored by researchers, but could include:

2025-04-synthetic-data-figure-1-1-v2.jpg

Illustration of an anonymous dataset. Four white circles with question marks represent anonymous hypothetical patients. Four white lines represent AI-generated patient data.

Anonymous datasets to train AI

Synthetic datasets on cancer,⁵ cardiovascular diseases,⁶ and COVID-19⁷ have been created using patient information, to help researchers train machine learning and AI models on data that does not contain any sensitive personal information.

2025-04-synthetic-data-figure-1-2-v2.jpg

Illustration of a dataset completed with AI-generated data. Four dark circles and three dark lines represent real patients and their data. A fourth line starts dark with a question mark in it and is completed in white to represent synthetic data.

Replacing missing data

Real-world datasets often contain ‘censored’ data where a full set of data is not available if, for example, a patient pulls out of a study. Synthetic data can be used to ‘fill in’ missing data and allow for better statistical analyses.⁸

2025-04-synthetic-data-figure-1-3-v2.jpg

Illustration of a dataset expanded with AI-generated data. One dark circle and line represents a real patient and their data. Three white circles and lines represent AI=generated data used to expand the dataset.

Expanding small datasets

Small datasets can hinder statistical analysis and cause biases in AI. This is particularly a problem in healthcare data where information on underrepresented and minority groups is lacking. Synthetic data can be used to expand small datasets, allowing AI models to perform better for a wider range of people.⁹

How should we use synthetic data in the future?

Synthetic data has the potential to improve the openness, fairness, and efficiency of healthcare data analysis. However, as Dr Boraschi and her colleagues highlight in their recent paper¹ and blog,¹⁰ there are a number of considerations that should be explored as the use of synthetic data becomes more common in biomedical research.

2025-04-synthetic-data-figure-2-1-v2.png	Data Quality How can we measure the quality of a synthetic dataset and understand how well it reflects real populations and healthcare scenarios?
2025-04-synthetic-data-figure-2-2-v2.png	Privacy How do we ensure that privacy is maintained and generative models do not reproduce any sensitive data?¹¹
2025-04-synthetic-data-figure-2-3-v2.png	Equity How do we apply findings based on synthetic data to patients? How do we test whether a model based on synthetic data performs better for one group of people than another?
2025-04-synthetic-data-figure-2-4-v2.png	Transparency and Accountability How should synthetic data be used compared with real-world data? How do we keep track of real vs synthetic data and ensure transparency? How do we establish accountability?

In their paper the researchers call for action to start regulating synthetic data and ensure that these aspects are taken into account. Standardised frameworks for measuring the quality of synthetic data should be adopted, and guidelines for appropriate use should be produced. With these tools in hand, synthetic data will enable open, efficient, and equitable research, and the development of new AI-based technologies and practices.

Learn More

Read the full paper

Read Dr Boraschi’s interview with synthetic data researcher Dr Fergus Imrie

References & Links

Governing synthetic data in medical research: the time is now. The Lancet Digital Health (2025)
Synthetic data in medical research. BMJ Medicine (2022)
Synthetic data in health care: A narrative review. PLOS Digital Health (2023)
Synthetic data generation methods in healthcare: A review on open-source tools and methods. Computational and Structural Biotechnology Journal (2024)
Simulacrum v2 synthetic dataset. Simulacrum (2023)
CPRD cardiovascular disease synthetic dataset. CPRD (2020)
CPRD COVID-19 symptoms and risk factors synthetic dataset. CPRD (2021)
SurvivalGAN: Generating Time-to-Event Data for Survival Analysis. Proceedings of Machine Learning Research (2023)
Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. NeurIPS Proceedings (2023)
Synthetic Data and Healthcare: Ethics Q&A. Kavli Centre for Ethics, Science, and the Public (2025
Are synthetic health data ‘personal data’. The PHG Foundation (2024)

Author Information

Dr. Stephanie Norwood

Coordinator and Events Manager

Engineering Biology IRC

University of Cambridge

Credits & Acknowledgements

Many thanks to the authors Dr Daniela Boraschi, Prof Mihaela van der Schaar, Dr Alessia Costa and Dr Richard Milne and to Dr Fergus Imrie for sharing their thoughts and reviewing this article.

Subject:

News

Synthetic Data in Medical Research: What, Why and How?

What is synthetic data?

Why use synthetic data in biomedical research?

How is synthetic data currently used?

2025-04-synthetic-data-figure-1-1-v2.jpg

Anonymous datasets to train AI

2025-04-synthetic-data-figure-1-2-v2.jpg

Replacing missing data

2025-04-synthetic-data-figure-1-3-v2.jpg

Expanding small datasets

How should we use synthetic data in the future?

2025-04-synthetic-data-figure-2-1-v2.png

2025-04-synthetic-data-figure-2-1-v2.png

Data Quality

2025-04-synthetic-data-figure-2-2-v2.png

Privacy

2025-04-synthetic-data-figure-2-3-v2.png

Equity

2025-04-synthetic-data-figure-2-4-v2.png

Transparency and Accountability

Learn More

Read the full paper

Read Dr Boraschi’s interview with synthetic data researcher Dr Fergus Imrie

References & Links

Author Information

Credits & Acknowledgements

Contact us

Keep up to date

Website Policies

Study at Cambridge

About the University

Research at Cambridge