Exploring the complexity of real-world health data record linkage - An
exemplary study linking cancer registry and claims data
Abstract
Purpose: Record linkage based on quasi-identifiers remains an
important approach as not every data source provides a comprehensive
unique identifier. In this study, reasons for the failure of a linkage
based on quasi-identifiers were examined. Furthermore, informed
algorithms using information on gold-standard links were developed to
investigate the potentially achievable linkage quality based on
quasi-identifiers. Methods: Linkage algorithms were applied on
German claims and cancer registry data using information on
gold-standard links. Informed linkage algorithms based on deterministic
linkage, logistic regression, random forests, gradient boosting and
neural networks were derived and compared. Descriptive analyses were
performed to identify reasons for failure of linkage such as
discrepancies between data sources. Results: A linkage approach
based on gradient boosting performed best and reached a precision of
77%, a recall of 81% and an F*-measure of 64%. Of 641 patients in
GePaRD, 8% were not uniquely identifiable using birth year, sex, area
of residence, year and quarter of diagnosis, whereas 33% of 42,817
cancer registries patients of Bremen and Lower Saxony were not uniquely
identifiable with these quasi-identifiers. Conclusions: Linkage
of German claims and cancer registry data based on quasi-identifiers
does result in insufficient linkage quality since subjects cannot be
uniquely identified. It is advisable to use unique identifiers from a
subsample, if available, to derive informed linkage algorithms for the
entire sample. In this case the machine learning technique gradient
boosting has been found to outperform other methods.