Researchers build SEQSpark to analyze massive genetic data sets

June 30, 2017, Baylor College of Medicine

Uncovering rare susceptibility variants that contribute to the causes of complex diseases requires large sample sizes and massively parallel sequencing technologies. These sample sizes, often made up of exome and genome data from tens to hundreds of thousands of individuals, are often too large for current analytical tools to process. A team at Baylor College of Medicine, led by Dr. Suzanne Leal, professor of molecular and human genetics, has developed new software called SEQSpark to overcome this processing obstacle. A study on the new technology appears in The American Journal of Human Genetics.

"To handle these large , we built the SEQSpark tool based on the commonly used Spark program, which allows SEQSpark to utilize multiple processing platforms to increase the speed and efficiency of performing data quality control, annotation and rare association analysis," Leal said.

To test and validate the versatility and speed of SEQSpark, Leal and her team analyzed benchmarks from the whole from the UK10K, testing specifically for waist-to-hip ratios.

"The analysis and related tasks took about one and a half hours to complete, in total. This includes loading the data, annotation, principal components analysis and single and rare variant aggregate association analysis for the more than 9 million variants present in this sample set," explained Di Zhang, a postdoctoral associate in the Leal lab at Baylor and first author on the paper.

To evaluate SEQSpark's performance in a larger data set, Leal and the research team generated 50,000 simulated exomes. The SEQSprak program ran the analysis for a quantitative trait using several variant aggregate association methods in an hour and forty-five minutes.

When compared to other variant association tools, SEQSpark was consistently faster, reducing computation to a hundredth of the time in some cases.

"What is unique about SEQSpark is that it is scalable, and smaller labs can run it without super specific hardware, and it can also be run in a multi-server environment to increase its speed and capacity for large genetic data sets," Zhang said. "It is ideal for large-scale genetic epidemiological studies and is highly efficient from a computational standpoint."

"We see this software as being very useful as the demand for the of massively parallel sequence data grows. SEQSpark is highly versatile, and as we analyze increasingly large sets of rare variant data, it has the potential to play a key role in furthering personalized medicine," Leal said.

In the future, Leal and her team will continue to test and increase SEQSpark's capabilities and will be analyzing soon data sets that have 500,000 samples or more.

Explore further: Genetic test for familial data improves detection genes causing complex diseases such as Alzheimer's

More information: Di Zhang et al. SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies using Whole-Genome and Exome Sequence Data, The American Journal of Human Genetics (2017). DOI: 10.1016/j.ajhg.2017.05.017

Related Stories

Genetic test for familial data improves detection genes causing complex diseases such as Alzheimer's

January 6, 2017
A team of researchers at Baylor College of Medicine has developed a family-based association test that improves the detection in families of rare disease-causing variants of genes involved in complex conditions such as Alzheimer's. ...

Researchers develop guidelines for large-scale sequence-based complex trait association studies

September 27, 2016
Precision medicine, which utilizes genetic and molecular techniques to individually tailor treatments and preventative measures for chronic diseases, has become a major national project, with President Obama launching the ...

Surprising findings from Exome Sequencing Project reported

November 6, 2012
A multi-institutional team of researchers has sequenced the DNA of 6,700 exomes, the portion of the genome that contains protein-coding genes, as part of the National Heart, Lung and Blood Institute (NHLBI)-funded Exome Sequencing ...

Genetic analysis finds rare, damaging variants contribute to the risk of schizophrenia

June 27, 2017
(Medical Xpress)—Via genetic analysis, a large international team of researchers has found rare, damaging gene variants that they believe contribute to the risk of a person developing schizophrenia. In their paper published ...

Researchers develop hybrid computational strategy for scalable whole genome data analysis

September 13, 2016
Human genome sequencing costs have dropped precipitously over the last few years, however the analytical ability to meet the growing demand for making sense of large data sets remains as a bottleneck. With the introduction ...

Complex, large-scale genome analysis made easier

June 16, 2015
Researchers at EMBL-EBI have developed a new approach to studying the effect of multiple genetic variations on different traits. The new algorithm, published in Nature Methods, makes it possible to perform genetic analysis ...

Recommended for you

Add broken DNA repair to the list of inherited colorectal cancer risk factors

February 23, 2018
An analysis of nearly 3,800 colorectal cancer patients—the largest germline risk study for this cancer to date—reveals opportunities for improved risk screening and, possibly, treatment.

Team identifies genetic defect that may cause rare movement disorder

February 22, 2018
A Massachusetts General Hospital (MGH)-led research team has found that a defect in transcription of the TAF1 gene may be the cause of X-linked dystonia parkinsonism (XDP), a rare and severe neurodegenerative disease. The ...

Defects on regulators of disease-causing proteins can cause neurological disease

February 22, 2018
When the protein Ataxin1 accumulates in neurons it causes a neurological condition called spinocerebellar ataxia type 1 (SCA1), a disease characterized by progressive problems with balance. Ataxin1 accumulates because of ...

15 new genes identified that shape human faces

February 20, 2018
Researchers from KU Leuven (Belgium) and the universities of Pittsburgh, Stanford, and Penn State have identified 15 genes that determine facial features. The findings were published in Nature Genetics.

New algorithm can pinpoint mutations favored by natural selection in large sections of the human genome

February 20, 2018
A team of scientists has developed an algorithm that can accurately pinpoint, in large regions of the human genome, mutations favored by natural selection. The finding provides deeper insight into how evolution works, and ...

New software helps detect adaptive genetic mutations

February 20, 2018
Researchers from Brown University have developed a new method for sifting through genomic data in search of genetic variants that have helped populations adapt to their environments. The technique, dubbed SWIF(r), could be ...

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.