HHMI Hanna H. Gray fellow
Broadly, I am interested in how human genetic variation affects molecular phenotypes with the goal of understanding how genes regulate each other. The majority of my work is in developing computational and statistical methods for understanding modern genomic data, primarily RNA-seq.
Previously, I received my PhD from Berkeley EECS studying computational genomics in Prof. Lior Pachter’s group.
Beyond academics I enjoy cycling, running, and food. I am a large proponent of alternative input devices, particularly voice recognition (read more here). In fact, I use Talon to perform most of my work.
An updated version of my CV (including publications) can be found here.
Generally, I am interested in computational genomics and the role that high dimensional statistics, algorithms, and machine learning play in modern biology.
Most of my graduate work has focused on general methods for RNA-Seq analysis. I have worked on transcript level differential expression, transcript quantification, transcriptome alignment, alternative splicing estimation, intron retention estimation, and other related problems.
transcript level differential expression analysis
A major challenge in transcript level differential expression analysis with RNA-Seq is that we are often in a small sample setting (a total of 6 samples is very common) which results in unstable estimators and poor statistical power. Traditionally, methods have been developed to perform shrinkage across all of the features (e.g. genes or transcripts) in order to reduce the variability in estimates and gain some power. One thing that these methods often fail to do is to decompose the variability into biological variability and inferential variability (resulting from transcript abundance estimation). In our method for differential expression analysis, sleuth, we are able to estimate the inferential and sampling variability using kallisto and do shrinkage directly on the biological variability. This shrinkage estimation leads to increased power when performing transcript level differential expression analysis.
If you’d like to learn more, check out our paper in Nature Methods
transcript abundance estimation
Kallisto is a very fast RNA-Seq transcript abundance estimation tool that eliminates the need to map reads by using a process called pseudoalignment. A pseudoalignment is simply an indicator of what transcripts a read is compatible with. In addition to the pseudoalignment, the expectation-maximization (EM) algorithm is performed on a simplified likelihood based off of “equivalence classes.” This is important because this allows the EM algorithm to scale in complexity of the transcriptome (number of equivalence classes) rather than the number of reads. In fact, this added efficiency is partially what makes sleuth possible. Due to the reduced memory footprint of this representation, we can very rapidly resample the data to produce bootstrap estimates of the inferential variability and sampling variability that is inherent in RNA-Seq.
If you’d like to learn more, check out our paper in Nature Biotechnolgy.
Over the last few years there has been some interest in identifying intron retention using RNA-Seq. While there are several methods for quantifying these introns if they are already annotated, there aren’t any for identifying novel retention events. We developed a method, KeepMeAround to identify retained introns. This project started by our curiosity of the classes of genes that retain introns in terminal erythropoiesis. While the method is simply an extension of transcript abundance estimation, many of the events that were identified in RNA-Seq were validated using other protocols. If you’d like to learn more, check out our paper.
You can find software that I have worked on below. I have had variable contributions to each project. You can check the GitHub contributions to see how much of the packages I have written.
- sleuth is a transcript level differential expression analysis tool for RNA-Seq. It is different from most tools in that it incorporates the inferential variability from transcript abundance estimation.
sleuth also allow scientists to interactively explore their data using a R Shiny application which can be easily shared to encourage reproducibility and scientific sharing. A demo of this app can be found at the bear's lair.
- kallisto is an incredibly fast and accurate transcript abundance estimation tool for RNA-Seq. It forgoes the costly step of alignment by implementing a novel idea which we call a pseudoalignment. Pseudoalignment along with a re-parameterization of the likelihood allows for extremely fast inference of transcript abundance estimation which is now possible on the laptop in about 5 minutes.
- KeepMeAround is a tool to quantify novel intron retention events. We used it to find several classes of intron retention in terminal erythropoiesis.
- fscca is a reimplementation of the NIPALS SCCA algorithm in C++ and R.
- TopHat is a spliced RNA-Seq read aligner. I implemented most of the transcriptome mapping mode since v1.4.