loading page

Exploring the complexity of real-world health data record linkage - An exemplary study linking cancer registry and claims data
  • Nadja Lendle,
  • Kollhorst B,
  • Timm Intemann
Nadja Lendle
Leibniz-Institut fur Praventionsforschung und Epidemiologie - BIPS GmbH

Corresponding Author:[email protected]

Author Profile
Kollhorst B
Leibniz-Institut fur Praventionsforschung und Epidemiologie - BIPS GmbH
Author Profile
Timm Intemann
Leibniz-Institut fur Praventionsforschung und Epidemiologie - BIPS GmbH
Author Profile

Abstract

Purpose: Record linkage based on quasi-identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, reasons for the failure of a linkage based on quasi-identifiers were examined. Furthermore, informed algorithms using information on gold-standard links were developed to investigate the potentially achievable linkage quality based on quasi-identifiers. Methods: Linkage algorithms were applied on German claims and cancer registry data using information on gold-standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for failure of linkage such as discrepancies between data sources. Results: A linkage approach based on gradient boosting performed best and reached a precision of 77%, a recall of 81% and an F*-measure of 64%. Of 641 patients in GePaRD, 8% were not uniquely identifiable using birth year, sex, area of residence, year and quarter of diagnosis, whereas 33% of 42,817 cancer registries patients of Bremen and Lower Saxony were not uniquely identifiable with these quasi-identifiers. Conclusions: Linkage of German claims and cancer registry data based on quasi-identifiers does result in insufficient linkage quality since subjects cannot be uniquely identified. It is advisable to use unique identifiers from a subsample, if available, to derive informed linkage algorithms for the entire sample. In this case the machine learning technique gradient boosting has been found to outperform other methods.
Submitted to Pharmacoepidemiology and Drug Safety
Submission Checks Completed
Assigned to Editor
Reviewer(s) Assigned
21 Jul 2024Reviewer(s) Assigned
04 Sep 2024Review(s) Completed, Editorial Evaluation Pending
13 Sep 2024Editorial Decision: Revise Major
13 Nov 20241st Revision Received
14 Nov 2024Submission Checks Completed
14 Nov 2024Assigned to Editor
14 Nov 2024Review(s) Completed, Editorial Evaluation Pending
15 Nov 2024Reviewer(s) Assigned