I am working on my Ph.D. from the Department of Statistics & Data Science at Carnegie Mellon University, under supervision of Professor Kathryn Roeder and Professor Jing Lei. My research mainly focuses on developing statistical methods to understand complex data dependencies found in epigenetics applications. Recently, I have been researching changepoint detection and network models. I also have a growing interest in topics relating to statistical practice that formalize how data scientists interact with their datasets. The code for my research are available on GitHub.
I also have a M.S. in Machine Learning from Carnegie Mellon University. Before coming to CMU, I got my B.S.E. in Operations Research and Financial Enginnering from Princeton University in 2014, along with a Certificate in Statistics and Machine Learning as well as a Certificate in Applications of Computing.
When taking a break from work, I am a fanatic fan of zumba, cooking, anime and poetry.
Common pipelines to estimate the cell developmental trajectories based on single-cell data typically first embed each cell into a lower-dimensional space, but these embedding typically assume statistical models that do not model single-cell data well. In this paper, we develop an embedding for hierarchical model where the inner product between two latent low-dimensional vectors is the natural parameter of an exponential family distributed random variable, and prove identifiability and convergence. When studying oligodendrocytes in fetal mouse brains, we find that oligodendrocytes mature into various cell types.
Dependency graphs encode complex pairwise patterns that are often statistically estimated, but are often hard to diagnose with visualizations due to the quadratic number of scatter plots. In this paper, we develop an interactive system in R that learns if the data scientist visually interprets dependency. Then, this system applies the learned classifier to infer a dependency graph that can be compared against the estimated graph. This paper won honorable mention for the Student Paper Award in the ASA Section: Statistical Computing and Statistical Graphics.
Changepoint detection methods such as binary segmentation are often used in CGH analyses for copy number variation detection, but these methods lack proper downstream statistical inference. In this paper, we develop post-selection hypothesis tests for various changepoint detection methods and provide substantial practical guidelines based on simulation.
Microarray samples from brain tissue is hard to collect, and also varies substantially depending on the tissue's brain region and the developmental age of its subject, hence it is hard to collect enough samples for the statistical analysis. In this paper, we develop a sample selection method to find additional microarray samples that are statistically similar to the samples of our desired spatio-temporal brain tissue. We demonstrate that after apply an existing analysis pipeline to our selected samples, we detect a higher percentage of autism risk genes.
Changepoint estimators have statistical theory for how well they estimate the mean function and how well they estimate the changepoints, but existing theory often analyzes these properties separately. In this paper, we prove a near-optimal estimation rate for the fused lasso, which in turn directly proves a changepoint detection rate that is near the detection limit. We extend this logic to other estimators and settings.
Many compressed sensing are developed to be as generic as possible, but have shortcomings in specialized settings where modern optimization theory can deliver a substantial boost in computational efficiency. In this paper, we develop two compressed sensing algorithms, one specialized for extremely sparse signals and another specialized for Kronecker-structed sensing matrices. We numerically demonstrate a near 10-times reduction in computation time compared to other state-of-the-art methods.
While de novo mutations within the protein-coding portion of the genome have been thoroughly studied, these mutations in the noncoding portions which comprise of 98.5% of the genome have been less well understood. In this paper, we use a bioinformatics framework to analyze 1902 autism quartets via WGS and find that the strongest signals arose from promoters -- noncoding regions that control gene transcription.
Analysis of 7,608 genomes highlights a role for promoter regions in Autism Spectrum Disorder
Science 362.6420 (2018). (link) (pdf)
Dependency diagnostic: Visually understanding pairwise variable relationships (talk).
2018 Joint Statistical Meetings (JSM), Vancouver, Canada.
A sharp error analysis for the fused lasso, with application to approximate changepoint screening (poster).
2017 Conference on Neural Information Processing Systems (NIPS), Long Beach, CA.
Hypothesis testing for simulatenous variable clustering and correlation network estimation, with applications to gene coexpression networks (talk).
2017 Joint Statistical Meetings (JSM), Baltimore, MD.
Longitudinal Gaussian graphical model for autism risk gene detection (talk).
2016 Joint Statistical Meetings (JSM), Chicago, IL.
Longitudinal Gaussian graphical model integrating gene expression and sequencing data for autism risk gene detection (talk).
2015 American Society of Human Genetics (ASHG), Baltimore, MD.
Optimization for compressed sensing: New insights and alternatives (talk).
2014 Modeling and Optimization: Theory and Applications, Bethlehem, PA.
Honorable mention in student paper competition
(For article "Dependency diagnostic: Visually understanding pairwise variable relationships")
ASA section: Statistical Computing and Statistical Graphics, January 2018
Winner of Statistical excellence for early-career writing
(For article "We, the millenials: The statistical significance of political significance")
Significance magazine in partnership with Young Statisticians Section of Royal Statistical Society, June 2017
Teaching assistant award recipient
(For "Statistical Computing" in Fall 2016)
Carnegie Mellon University, May 2017
Award recipient of Kenneth H. Condit Prize
(For excellence in service to department)
Princeton University, May 2014
- (2018 Summer) 36-350 Statistical Computing (Instructor)
- (2018 Spring) 36-350 Statistical Computing (Assistant Instructor with R. J. Tibshirani)
- (2017 Fall) 36-350 Statistical Computing (TA under P. Freeman)
- (2015 Fall, 2016 Fall) 36-350 Statistical Computing (TA under R. J. Tibshirani)
- (2015 Spring) 36-217 Probability Theory and Random Processes (TA under A. Rinaldo)
- (2014 Fall) 46-921 Financial Data Analysis I and 46923 Financial Data Analysis II (TA under C. Schafer)
- (2014 Spring, 2013 Spring, 2012 Spring) ORF 350 Analysis of Big Data (Course designer under H. Liu)
Last Updated: August 10, 2018.