Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets

M. Hong, David Froelicher, R. Magner, V. Popic, B. Berger, H. Cho

December, 2023

Abstract

Finding related individuals in genomic datasets is a necessary step in many genetic analysis workflows and has broader societal value as a tool for retrieving lost relatives. However, detecting such relationships is often infeasible when the dataset is distributed across multiple entities due to privacy concerns. Although cryptographic techniques for secure computation offer ways to jointly analyze distributed datasets with privacy guarantees, the sheer computational burden of operations required for identifying kinship, such as pairwise sequence comparison of all individuals across large datasets, presents key challenges to developing a practical privacy-preserving solution for this task. We introduce SF-Relate, a secure federated algorithm for identifying genetic relatives across data silos that scales efficiently to large datasets that include hundreds of thousands of individuals. We leverage the key insight that the number of individual pairs to compare can be vastly reduced, while maintaining accurate detection, through innovative locality-sensitive hashing of individuals who are likely to be related together into buckets and then testing relationships only between individuals in corresponding buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with our novel bucketing strategies are key to achieving accurate and practical private relative detection. To guarantee privacy, we devise an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow the parties to cooperatively compute the relatedness coefficients between pairs of individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on real genomic datasets of varying sizes, from the UK Biobank and All of Us datasets. On the largest dataset of 200K individuals split between two parties, SF-Relate securely detects 94.9% of third degree relatives, and 99.9% of relatives second-degree or closer within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

Type

Publication

Recomb 2024 (M.H. received best student paper award) & Under Revision at Genome Research

Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets

Abstract

David Froelicher

Research Manager