Genomic Data from Multiple Data Sets: Methods, Pros, and Cons

Genomic data from multiple data sets: Methods, pros, and cons

Monday, May 11, 2015

Jane Costello Ph.D., Duke University

 

Description

In understanding the cause of many diseases, the search for genes has moved from the identification of rare high-risk variants to that of common low-risk variants. The size of samples required for adequate power has correspondingly increased. In the study of gene by environment interaction (G-E) models of disease risk, pooling data from different completed or ongoing studies is viewed as a time- and cost-effective alternative to the conduct of large, new investigations designed to collect detailed phenotypic and “envirotypic” information. Unfortunately, attempts to pull together cases from existing data sets have faced significant challenges to date, in part because studies lack consistent rules and methods making diagnoses and for defining environmental risk.

Our project sought to develop and test a new methodology for pooling data from studies that used different measures to assess the same or similar constructs. In the present investigation, data was pooled from the National Longitudinal Study of Adolescent Health (Add Health), Great Smoky Mountains Study (GSMS), and Child Development Project (CDP) data sets. The proposed data harmonization methodology involved the creation of a calibration data set, in which two or more measures of the same or similar constructs, obtained from the same participants, are compared and the scores on each measure are mapped onto the other. Calibration samples may be internal to the primary samples of scientific interest (if both measures were used in an existing data set), or may be external (obtained de novo); our work involved both types of samples. We hope our investigation will provide an important tool for research across many areas of genomic research.