Handling missing data in large electronic healthcare record datasets

Status: This project is ongoing

Electronic healthcare records (EHRs) are created when healthcare professionals record information about the health of their patients. These records include variables (features likely to vary/change) such as:

disease symptoms and diagnoses
measurements such as body mass index
patient characteristics such as ethnicity
laboratory measurements

EHR data is used to support health care. Researchers also use it to explore the relationships between patients’ exposure to medicines or risk factors and their disease outcomes. As data isn’t always consistently recorded, researchers sometimes find that information from their datasets (collections of data) is missing.

Missing information can cause bias when researchers attempt to estimate the relationship between an exposure and disease outcome. To address this problem, statisticians use methods aimed at dealing with missing data.

In statistics, imputation is the process of replacing missing data with predicted values based on their other characteristics. Multiple imputation allows for uncertainty in predicting missing values and can reduce bias.

Newly available, large scale EHR datasets provide researchers with the opportunity to address many new research questions. When answering these questions, researchers may need to use multiple imputation to reduce bias associated with missing data. However, this is hard to do when datasets contain records on many millions of people, because the analyses can take a long time even when very fast computers are available.

Project aims

During this study we will:

Explore the impact of increasing variable numbers when missing data results in bias and the dataset is very large
Explore the impact of reducing the number of imputations and using different sampling strategies when implementing multiple imputation

What we hope to achieve

Our findings will provide useful guidance for practice and will inform further research that can begin to explore the use of artificial intelligence in the selection of variables.