Research
This page summarizes my research projects, publications, and selected course projects. It includes manuscripts under review, manuscripts in progress, and published papers across machine learning, data science, and biomedical applications. Each project includes a brief summary and links to papers or code where available.
Selected Projects
Nonlinear Time Series Modeling Using Bernstein Polynomials and Bayesian Inference
Status: Submitted to Transactions on Machine Learning Research (TMLR).
Authors: Li Yuan and Raanju R. Sundararajan.
Current advisor: Raanju R. Sundararajan.
This project develops a Bayesian nonlinear autoregressive model using Bernstein polynomial representations and compares its forecasting performance with classical time-series models such as ARIMA and SARIMA.
Empirical Legal Analytics: Deal Lawyer Database and Missingness Analysis
Research assistant project, Vanderbilt Law School and MS in Data Science, Vanderbilt University
Faculty collaborators: Tracey E. George and Albert Yoon.
View faculty collaborator profile
Project summary:
This research assistant project supported empirical legal scholarship on deal lawyers, law firms, and attorney biographical data in mergers-and-acquisitions transactions. I built reproducible R analyses to evaluate missingness patterns and study how missing lawyer or law-firm information related to deal characteristics, industry sectors, time, law-school rank, firm rank, and inferred attorney gender.
- Cleaned and joined deal, deal-lawyer, lawyer biographical, law-school ranking, and law-firm ranking datasets, creating encoded deal identifiers and matched analysis tables.
- Quantified missing lawyer, law-firm, and attorney-biographical information across industry sector, deal value, year, and deal type using summary tables and visualizations.
- Applied chi-square tests, two-sample Kolmogorov-Smirnov tests, and two-proportion tests to compare missing versus non-missing groups.
- Built logistic-regression models with robust standard errors and random-forest classifiers to assess predictors of attorney gender patterns and missing biographical information.
- Used SSA baby-name data and UCI gender-by-name data as supplementary sources for probabilistic gender inference in records without complete attorney biographies.
- Produced reproducible statistical reports and figures for faculty research workflows at the intersection of law, data science, and empirical social science.
Selected Course Projects
Amazon Query Product Ranking
Social Network Analysis course project, MS in Data Science, Vanderbilt University
Authors: Li Yuan, Jay Kim, and Daron Greenblatt.
Paper summary:
This project studied product search relevance in the Amazon KDD Cup 2022 setting. Given a user query and a list of candidate products, the goal was to re-rank products so that more relevant items appeared higher in the results.
- Built and compared three transformer-based modeling approaches: Multi-Genre Natural Language Inference, Cross-Encoder Regression, and Cross-Encoder Classification.
- Framed query-product matching as a sentence-pair relevance task using product titles, descriptions, bullet points, and labels such as exact, substitute, complement, and irrelevant.
- Evaluated model performance on validation and test sets, while documenting practical challenges around model selection, class imbalance, GPU resources, and fine-tuning large language models for e-commerce ranking.
Automatic Medical Image Annotation with CLIP and CNN Models
MS in Data Science capstone project, Vanderbilt University
Author: Li Yuan.
View capstone paper View GitHub repository
Paper summary:
This capstone project explored how to reduce manual labeling effort for biomedical images by combining zero-shot vision-language models with supervised convolutional neural networks on the MedMNIST benchmark.
- Tested CLIP-style zero-shot image classification for medical image annotation, where general-domain visual-language models struggle with granular biomedical labels.
- Designed a two-stage workflow that first used CLIP to generate labels and then trained CNN models on the CLIP-labeled data to improve annotation quality.
- Reported that the combined approach improved prediction accuracy from about 80% to 95%, while also improving average annotation accuracy and F1 score compared with using CLIP alone.
- Proposed future extensions using medical-domain foundation models, domain-adapted CLIP training, and broader multi-class MedMNIST experiments.
Amazon US Customer Review and Product Analysis
Big Data Scaling course project, MS in Data Science, Vanderbilt University
Authors: Li Yuan and Matthew Flaherty.
View project paper View Hugging Face model
Paper summary:
This project analyzed large-scale Amazon customer review data to study product recommendation, sentiment prediction, and product-category clustering in an e-commerce setting.
- Processed more than 20 GB of Amazon review data across 12 product categories using Google Cloud, Hadoop file storage, PySpark, and MLlib.
- Built a recommender system with Alternating Least Squares (ALS) matrix factorization to generate customer-product recommendations from review and rating behavior.
- Fine-tuned a multilingual BERT sentiment model to predict star ratings from customer review text, then published the resulting model to Hugging Face.
- Applied K-means clustering and locality-sensitive hashing ideas to group product titles and descriptions, supporting product discovery and category-level recommendation analysis.
US Accidents Database and R Shiny Application
DS 5420: Data Management Systems final project, MS in Data Science, Vanderbilt University
Authors: Li Yuan and Matthew Flaherty.
Paper summary:
This project built a MySQL-backed R Shiny application for exploring and managing a large U.S. traffic accident dataset collected from Kaggle.
- Designed a database workflow for 4.2 million accident records with 49 features covering accidents across 49 states from February 2016 to December 2020.
- Normalized the original table into a database structure with a location table and accident table, while preserving a source mega-table for loading and data management.
- Implemented database views, stored procedures, triggers, and transactions to support reliable querying, validation, insertion, updating, and deletion of accident records.
- Built an interactive R Shiny front end with pages for accident maps, date-range trends, temperature histograms, accidents by street side, accidents by state, record search, and full CRUD operations.
- Documented the practical engineering challenges of connecting MySQL with Shiny and translating database operations into a usable web application.
Food-101 Image Classification with CLIP and Transfer Learning
Machine Learning 2 course project, MS in Data Science, Vanderbilt University
Author: Li Yuan. Advisor: Yuankai Huo.
Paper summary:
This project studied large-scale food image classification on the Food-101 dataset using both CLIP zero-shot classification and PyTorch transfer learning.
- Evaluated OpenAI CLIP models for zero-shot classification across 101 food categories, with ViT-B/16 producing the strongest reported CLIP result at about 83.82% accuracy.
- Prepared and cleaned the Food-101 image data, including train/validation organization, image preprocessing, and removal of corrupted images.
- Compared several pre-trained deep learning architectures, including ResNet-50, VGG16, EfficientNet-B7, RegNet-X, and DenseNet-161.
- Scaled the transfer-learning workflow from a smaller architecture-selection subset to the full image dataset, reaching about 70.09% validation accuracy on 101 food classes.
- Documented practical deep-learning constraints such as slow training, large image data, Google Colab GPU limits, and CUDA memory failures for larger CLIP models.
Modeling Insurance Data
STAT 6324: Computational Statistics final project, Southern Methodist University
Authors: Hunter Schuler and Li Yuan.
View project report View presentation slides
Paper summary:
This project modeled U.S. health insurance charges using a public Kaggle dataset with demographic and health-related predictors such as age, sex, BMI, number of children, smoking status, and region.
- Conducted exploratory data analysis to study variable distributions, class imbalance, and relationships between insurance charges and predictors such as smoking status, BMI, and age.
- Built baseline and reduced interaction linear regression models, using train/test/validation splits and model diagnostics to evaluate fit and assumption violations.
- Addressed imbalance in age, smoking status, and children variables through downsampling, resampling, and feature recoding before refitting models.
- Used h2o AutoML to compare machine learning models, finding that tree-based models performed best and that an XGBoost model achieved the lowest reported test-set RMSE.
- Interpreted variable importance results showing smoking status and BMI as the strongest predictors of insurance charges, while sex, children, and region had smaller effects.
Categorical Regression Models
STAT 6336: Statistical Analysis I literature review project, Southern Methodist University
Author: Li Yuan.
Paper summary:
This literature review synthesized major categorical regression models for binary, unordered multinomial, and ordered categorical response variables.
- Reviewed logistic regression as the foundational binary-response model and explained why multi-category outcomes require extensions beyond a single probability threshold.
- Summarized unordered multinomial logit models, including dummy-vector response coding, reference-category probabilities, generalized logits, and random-utility interpretations.
- Discussed ordered categorical regression through latent-variable and cumulative-model perspectives, including threshold-based category assignment and cumulative logits.
- Compared cumulative ordinal models with sequential ordinal models and described when each framework is more appropriate.
- Included multinomial-distribution and maximum-likelihood background to connect categorical response modeling with probability theory and estimation.
Principal Component Analysis and Nonlinear PCA
STAT 6345: Linear Regression midterm project, Southern Methodist University
Authors: Hunter Schuler and Li Yuan.
Project summary:
This project presented principal component analysis as a dimension-reduction method and compared classical linear PCA with nonlinear PCA approaches.
- Reviewed PCA theory using covariance matrices, singular value decomposition, eigenvectors, eigenvalues, component scores, and loading interpretation.
- Applied PCA to a Portuguese red-wine physicochemical dataset with 1,599 samples, 11 continuous physical-property variables, and a wine-quality score.
- Interpreted major components from the wine data, including acidity, smoothness or stability, bitterness, salt-sugar contrast, dryness, and aroma-related structure.
- Discussed limitations of ordinary PCA, including linearity assumptions, sensitivity to outliers, complete-data requirements, and restrictions to continuous variables.
- Introduced nonlinear PCA variants such as standard NLPCA, hierarchical NLPCA, circular PCA, and inverse NLPCA, including examples where nonlinear structure or missing-data estimation benefits from nonlinear methods.
Birth Weights
STAT 6345: Linear Regression final project, Southern Methodist University
Author: Li Yuan.
Paper summary:
This project compared linear-regression and generalized linear modeling strategies for predicting infant birth weight from maternal and pregnancy-related variables.
- Conducted exploratory analysis of birth-weight data, including distributions of gestational age, maternal race, smoking status, education, and birth weight in grams.
- Compared ordinary least squares, log-transformed models, Box-Cox transformations, all two-way interaction models, elastic net, ridge, lasso, robust regression, inverse Gaussian models, random forest, and iteratively reweighted least squares.
- Evaluated models using train/test splitting, test-set RMSE, BIC/AIC, and residual diagnostics, with special attention to heteroskedasticity and skewed predictors.
- Selected a final IRLS model with smoking status, gestational age, and interaction terms involving maternal race, using exhaustive subset selection guided by BIC.
- Interpreted the final model as showing strong associations between birth weight and gestational age, maternal smoking, and selected interaction effects, while checking multicollinearity with variance inflation factors.
Formula One Analysis: Interpreting and Predicting Race Points
STAT 6341: Sports Analytics final project, Southern Methodist University
Author: Li Yuan.
View project report View presentation slides View Shiny dashboard
Project summary:
This project analyzed Formula One race results to compare drivers and constructors, explain point outcomes, and predict future race performance.
- Built a Formula One modeling dataset from Kaggle race tables, covering seasons 1994-2023 with driver, constructor, circuit, qualifying position, age, and race-point variables.
- Developed a Bayesian linear mixed model in RStan with fixed effects for year, circuit, age, and qualifying position, plus random effects for constructors and drivers nested within constructors.
- Used posterior inference to compare constructors and drivers, identifying Mercedes and Red Bull as leading constructors and comparing top driver effects such as Lewis Hamilton and Fernando Alonso.
- Trained frequentist and machine-learning models with tidymodels, including linear regression, random forest, and XGBoost, and compared predictive performance using train/test RMSE.
- Built an R Shiny dashboard for interactive Formula One visualization and prediction, including driver/constructor trends, nationality summaries, and model-based point predictions.
- Discussed limitations such as sparse driver/constructor levels, the 2010 point-system change, omitted race features, and the need for models better suited to discrete point outcomes.
GIDEON Project: Public Defense Event Frequency Analysis
STAT 6366: Statistical Consulting final project, Southern Methodist University
Authors: Li Yuan and Hunter Schuler.
View final report View presentation slides View Shiny dashboard
Project summary:
This consulting project supported the Deason Criminal Justice Reform Center’s GIDEON Project, which studies whether increased public-defense funding in New York changed attorney-client interactions and defense-service patterns.
- Joined three raw criminal-case data sources into a unified analysis dataset using shared case and client identifiers, producing a linked dataset with more than 390,000 rows and 200 variables.
- Cleaned and grouped case records into felony and misdemeanor categories, filtered erroneous dates, and focused analysis on client-contact, discovery, disposition, hearing, sentencing, and related event codes.
- Designed multiple event-frequency measures, including event counts per closed case and event rates per case per week to account for differences in case duration.
- Built visual analyses with
ggplot2andplotlyto compare event-code patterns, case counts, case duration, and event rates over time for felony and misdemeanor cases. - Fit Poisson regression and grouped cross-time-series models to assess whether event rates changed meaningfully over time after accounting for case duration.
- Built an R Shiny dashboard for interactive client-facing exploration of event-code definitions, case-type trends, event frequencies, event rates, and consultant/client information.
- Concluded that the current data did not show practically meaningful changes in attorney-client interaction rates, while recommending continued tracking because legal-system reforms may have delayed effects.
Electricity Net Generation in the US and Texas
STAT 6363: Time Series Analysis final project, Southern Methodist University
Author: Li Yuan.
Paper summary:
This project analyzed monthly electricity net generation in the United States and Texas from January 2001 through early 2024 using seasonal time-series methods.
- Collected monthly net-generation data from the U.S. Energy Information Administration, covering electricity generated by coal, natural gas, nuclear, hydroelectric, wind, and solar sources.
- Compared cumulative state-level electricity generation and identified Texas as the highest-producing state, motivating a focused state-level analysis alongside the national series.
- Assessed stationarity using visual inspection, split-sample mean and standard-deviation comparisons, and Augmented Dickey-Fuller tests.
- Used additive decomposition and smoothed periodograms to identify strong 12-month seasonality in both the U.S. and Texas electricity generation series.
- Fit SARIMA(1,1,1)(0,1,1)[12] models to both series and checked residual diagnostics with ACF, Q-Q plots, and Ljung-Box tests.
- Produced 12-month forecasts showing stable seasonal patterns, with a slight downward trend in U.S. generation and a slight upward trend in Texas generation.
Fully Automatic Text Data Augmentation
CS 7322: Introduction to Natural Language Processing final project, Southern Methodist University
Author: Li Yuan. Advisor: Jing Cao, Ph.D..
View project report View paper presentation
Project summary:
This project explored fully automatic text data augmentation methods for sentiment classification and evaluated whether they improved NLP model generalization without human annotation.
- Reviewed related work on counterfactually augmented data, easy data augmentation, unsupervised data augmentation, back translation, and BERT-based adversarial examples.
- Implemented seven augmentation strategies: synonym replacement, negation addition, random insertion, random deletion, random swap, BERT mask prediction, and back translation.
- Built sentiment-classification experiments using a balanced wine-review dataset and compared original training data with one-to-one augmented training sets.
- Evaluated augmentation effects on two models: the Attention-Based Multiple Instance Classification (AMIC) model and fine-tuned
bert-base-uncased. - Found that most automatic augmentation methods produced accuracy similar to the original data; random swap slightly improved AMIC performance, while synonym replacement and random deletion slightly improved BERT performance.
- Identified negation addition as a fragile augmentation strategy because it often broke grammar or made sentiment labels ambiguous, especially for smaller training samples.
- Presented the paper “Learning the Difference That Makes a Difference with Counterfactually-Augmented Data” to connect the project to human-in-the-loop counterfactual augmentation and robustness against spurious correlations.
Bayesian Neural Networks
STAT 6390 final presentation, Southern Methodist University
Author: Li Yuan.
Project summary:
This final presentation introduced Bayesian neural networks as stochastic neural networks trained with Bayesian inference to quantify uncertainty in weights and predictions.
- Compared traditional neural networks with Bayesian neural networks, emphasizing that standard neural networks produce point estimates while Bayesian neural networks place priors on model weights.
- Explained posterior predictive inference as an ensemble-like average over neural networks sampled from the posterior distribution.
- Discussed how Bayesian priors can act as regularization, support uncertainty quantification, and help reduce overfitting in high-capacity neural networks.
- Reviewed why exact posterior inference is difficult for neural networks because of high-dimensional weight spaces and large datasets.
- Presented Monte Carlo integration, Metropolis-Hastings, variational inference, KL divergence, the evidence lower bound, and stochastic variational inference as computational strategies.
- Highlighted Python and Pyro as practical tools for Bayesian neural network implementation, including examples using MCMC and variational inference.
- Connected Bayesian neural networks to applications requiring uncertainty-aware prediction, such as autonomous vehicles, healthcare, finance, and flight trajectory prediction.
Apply Geometric Data Analysis to the MNIST Data
STAT 6383: Statistical Shape Analysis final project, Southern Methodist University
Author: Li Yuan.
View project report View presentation slides
Project summary:
This project applied geometric data analysis and statistical shape analysis to MNIST handwritten digit images, using curve-based shape representations to compare and classify digit contours.
- Preprocessed MNIST images by separating digits by label, converting grayscale images into binary images, extracting contour points, and resampling contours through linear interpolation.
- Represented digit boundaries using the Square-Root Velocity Function (SRVF), enabling elastic shape comparison through geodesic distances on the curve manifold.
- Computed geodesic paths between handwritten digits to visualize how one digit contour deforms into another under the SRVF framework.
- Estimated Karcher means and shape variation for each digit class, using the class means as representative digit shapes.
- Explored functional principal component analysis for digit contours to study dominant shape-variation directions within each class.
- Built a nearest-Karcher-mean classifier for MNIST test images and used parallel R workflows with
futureandfurrrto handle the high computational cost of elastic shape-distance calculations. - Reported 74% overall classification accuracy on a balanced subset of test images, while identifying confusion among visually similar or highly variable digits such as 2, 6, 7, and 8.
GAIN: Missing Data Imputation Using Generative Adversarial Nets
STAT 7331: Statistical Modeling of Longitudinal and Incomplete Data final paper presentation, Southern Methodist University
Author: Li Yuan.
View presentation report View reviewed paper
Project summary:
This final paper presentation reviewed GAIN, a generative-adversarial approach for imputing missing data, and connected it to statistical modeling of incomplete data.
- Summarized the missing-data setting, including the role of missingness indicators, completed data vectors, and the MCAR assumption used in the reviewed paper’s theoretical results.
- Explained the GAIN architecture, where a generator imputes missing components and a discriminator learns to distinguish observed values from imputed values.
- Highlighted the role of the hint vector, which gives the discriminator partial information about the missingness mask so the generator can learn the target data distribution.
- Reviewed the minimax training objective, mini-batch stochastic gradient descent updates, and the use of fully connected neural networks for the generator and discriminator.
- Compared GAIN conceptually with traditional imputation methods such as MICE, MissForest, matrix completion, autoencoders, and expectation-maximization.
- Discussed reported empirical results showing that GAIN improved imputation RMSE, post-imputation prediction performance, and congeniality across several benchmark datasets.
Research Milestone
Ph.D. Qualifying Exam Research Paper
Integrate Attention-Based Neural Networks with Text Regressions in Sentiment Inference
Ph.D. Qualifying Exam research paper, Southern Methodist University.
View qualifying exam research paper
Paper summary:
This paper was completed for the Ph.D. Qualifying Exam, which assesses a student’s readiness for independent research. It synthesizes work on statistical text analysis, attention-based multiple-instance classification, and counterfactually augmented data for sentiment inference.
- Reviewed statistical approaches for using text as data in social science and applied research.
- Discussed attention-based neural network methods for interpretable sentiment analysis.
- Reproduced computational results from the attention-based multiple-instance classification model.