Sample menu:

Complex Trait Genetics Studies

My phenotypic interest focuses specifically on understanding the genetic basis of type-2 diabetes, cardiovascular disease, and related cardiometabolic traits. Broadly, we collect publicly available summary data from large-scale studies of these disease, or collaborate with groups that have access to large, genetics resources. We combine efforts for locus discovery with existing and methodologies we develop, but we also build methods to make inference "post-GWAS" - to identify causal variants, genes, mechanisms, tissues, and ultimately therapeutically actionable targets which we can develop phramacological pipelines around.

Studies in the Million Veteran's Project (MVP). Beyond the genetic analysis of publicly available data, I am working collaboratively with investigators at the Million Veteran's Project (MVP), which has genotyped >200,000 participants already, connected to health records and for which numerous phenotypes are already available. Numerous projects are available with these data, including GWAS, multi-phenotype mapping, causal inference via Mendelian Randomization, data integration with existing functional genomics data, and population-genetics efforts.

Studies in the Pakistani Genomics Resource (PGR). In collaboration with Danish Saleheen (Univ. of Penn, CVI, CCEB), and other groups representing populations from Southeast Asian, we can utilize these large scale (>50k with genotyping and/or resequencing variation data, many deeply phenotyped) to help on efforts for locus discovery as well as interpretation causal variants and genes for cardiometabolic disease. The PGR is a unique resource, with extended tracts of homozygosity and the potential for call-back phenotyping. Our role is to develop analytical or methodological approaches that can best utilize these data for population genetics, genetic studies, or to develop leads for drug targetting.

Mendelian Randomization: Methdology and Application. Traditional observational methods in epidemiology are limited in their ability to infer causal relationships between intermediate traits or biomarkers and disease liability. While true, randomized control trials are one standard by which causal relationships can be inferred, naturally segregating genetic variation in the population can also be used to make such inference. This approach, dubbed Mendelian Randomization (MR), capitalizes on the fact that alleles segregating at meiosis assort randomly, facilitating causal inference tests that minimize issues of confounding or reverse causality in interpretation. Research in the lab seeks to statistically extend the application of this methodology, provide resources to the community to make the approach more accessible and sophisticated, ultimately with the aim to identify causal factors implicated in human disease. Active effort in the lab seeks to test epidemiologically correlated factors with a range of important clinical outcomes. See publications for a listing of recent application for T2D, Heart disease, Migraine with a range of intermediate traits.

Human Metabolic Tissue Resource. Model systems and human genetic studies consistently implicate adipose biology as central to cardiometabolic disease. A key insight that has emerged from these data is that master regulators of adipocyte differentiation are clearly fundamental to disease etiology. Yet, the complete set of gene networks, regulatory landscape that these factors control in humans, and their relevance to disease is not perfectly understood. To address these fundamental questions, the community requires unbiased surveys of the genomic regulatory landscape of this important tissue. Unfortunately, we lack specific knowledge of regions of accessible chromatin in primary adipose tissues, how this landscape varies naturally in human populations and across adipose depots and physiologic states of obesity, nor knowledge of genetic determinants associated with accessible chromatin states. Our pilot project, in collaboration with the Soccio lab, aims to perform genomic profiling in n=48 individuals in two distinct adipose tissues (subcutaneous, visceral), including DNA variation, RNA-Seq, TF occupancy (PPARg), assays of open chromatin (ATAC-Seq), and 3D chromatin organization. These data will provide a unique resource in which human genetic associations for cardiometabolic disease can be interpreted.

Methods devleopment. We have a basic interest in developing methods across a range of clinical problems, virtually all of which are motivated by new types of data generated by sequencing, or challenges in data intergration for biological inference. Examples include: a collaboration with Casey Brown (Genetics, UPenn) aims to improve methods for splice-QTL discovery, methods for the analysis of single-cell ATAC-Seq data, multi-variate trait GWAS mapping methods, methods which take advantage of functional genomics data where profiling of multiple tissues with multiple assays is performed in multiple subjects, or machine learning methods to prioritize casual variants, genes, and loci from GWAS scans.

Population Genomics

Algorithms and analysis for biological inference. My approach to problems generically begins with population genetics thinking or frameworks. There are a number of problems for which algorithms that ask basic questions in this space can be repurposed and innovated for contemporary data types we are collecting today. Moreover, the understand of basic population genetics phenomenon (i.e., selection and mutation) empirically in human genomes can provide a resources to improve the power for biological discovery, statistical inference, and new methods which utilize this information to ask basic questions about the role of sequence variation in disease or complex traits. e.g., Empiricial mutation rate models for rare-variant test of association, or inference of background selection to create a list of 'essential' genes or genes that are not tolerant to Loss-of-Function mutations or non-coding sequences that may not tolerate new mutations. Or, algorithms and models that have been used routinely in population genetics that could be deployed to address analogous questions in functional genomics spaces.

Models that capture variation in the mutation rate at base-pair resolution.The rate of mutation varies substantially by position in human and mammalian genomes and fundamentally influences evolution and incidence of genetic disease. Using a novel statistical framework we developed and applied to large-scale human population genomics data, we showed that the three nucleotides of sequence context that flank a polymorphic site - a seven nucleotide window in total - explained >81% of variability in substitution probabilities and highlights new, mutation promoting motifs. There are many open projects available to build upon this methodology, apply it to data, and develop statistics for the analysis of complex disease. Ongoing work in the lab aims to develop these models further, incorporate genomic annotatons into the inference process, characterize rates of private mutation across the genome, or to develop methods to discovery recurrent mutations in genomics data.

Inference of natural selection in the human genome. I have had a long-standing interest in learning about the selective forces that have shaped the evolution of modern humans. We focus on developing intuitive, statistical summaries of data which facilitate the inference of selection from patterns of genetic variation we observed in genomes. This includes methods to detect recent, positive selection, or long-term balancing selection. We also would like to understand the specific targets of selection, and what phenotypes or processes may have been subject to selection in recent human history. E.g., measuring the extent to which recent positive selective sweeps are shared across populations in the human genome. Finally, inference of background and purifying selection is crucial to help improve the inference of causal variants - both coding and non-coding - underlying complex human disease. To make this inference, we aim to use our improved understanding of the underlying mutation rates based on sequence context, functional annotations, or otherwise, to determine sequences that may be constrained in recent history.

Araneae (Spider) Genomics

Next-generation sequencing technologies have delivery transformative science in human genetics, but are only just spilling over into applications into ecology and evolutionary biology. One goal of the lab is to contribute actively to these efforts, taking advantage of previous experience handling large-scale data sets and developing population genetic methods in humans:

Genomics of the order Araneae. Despite their deep mythology, historical interest, unusual sex-dimorphic trait distributions, and production of unusual macromolecules (venom and silk), little genetically is known about species of spider at a high-resolution molecular level. To overcome this gap in knowledge, we have generated the first draft genome of an orb-weaving spider, Nephila clavipes (i.e., the Golden orb-weaver), augmented and annotated with transcriptomic profiles obtained by RNA-seq in whole-body. We used this assembly to discover and characterize the repeat-pattern structure of all silk fibroin ("spidroin") and silk fibroin-like genes in the genome and their expression across morphologically-distinct silk glands. In future collaborative work, we are working to generate a draft genome assembly of Caerostris darwini (Darwin's bark spider), known for the webs it builds over rivers in its native Madagascar, and whose silk draglines are among the toughest biomaterial known.

Developing a spider model system for studies of silk. A central question is silk biology is to understand the contribution, if any, of various repetitive sequences found in spidroins to the biophysical properties that different types of silk exude (i.e., strength, extensibility, adhesion, etc.). There exist in vitro (e.g., minispidroin) and in vivo (e.g., silkworm) systems that can generate some types of artificial silk constructs, but have a range of limitations. Ideally, one would like to co-op the system of silk production from spiders directly, in terms of the glands that produce a given silk, and the organs and mechanism that the spider uses to create silk naturally. Until recently, direct genomic engineering of non-model systems has not be routinely feasible, but disruptive technologies like the CRISPR/Cas9 system could make such feasible experiments which add, subtract, or otherwise transgenically modify silk genes in order determine the sequence-based contributions to silk, and ultimately, the 'minimum sequence unit' required to confer a given property onto silk.