Student Seminar – Spring 2019

Schedule for Spring 2019


Seminars are on Wednesdays
Time: 12:00pm – 1:00pm
Location: Room 1025, 1255 Amsterdam Avenue
Contacts: Yuling Yao, Owen Ward

Information for speakers: For information about schedule, direction, equipment, reimbursement and hotel, please click here.


Wenda Zhou “Statistical Computing Essentials”




Shira Mitchell (NYC Mayor’s Office of Data Analytic)

“Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions”



Elisa Perrone (MIT)

Title: Geometric structure in dependence models and applications

Abstract: The growing availability of data makes it challenging yet crucial to model complex dependence traits. For example, hydrological and financial data typically display tail dependences, non-exchangeability, or stochastic monotonicity. Copulas serve as tools for capturing these complex traits and constructing accurate dependence models which resemble the underlying distributions of data. This talk explores the geometric properties of copulas to address dependence modeling challenges in several applications, such as hydrology and finance. In particular, we study the class of discrete copulas, i.e., restrictions of copulas on uniform grid domains, which admits representations as convex polytopes. In the first part of the talk, we give a geometric characterization of discrete copulas with desirable stochastic constraints in terms of the properties of their associated convex polytopes. In doing so, we draw connections to the popular Birkhoff polytopes, thereby unifying and extending results from both the statistics and the discrete geometry literature. In the second part of the talk, we further consolidate the statistics/discrete geometry bridge by showing the significance of our geometric findings to (1) construct entropy-copula models useful in hydrology, and (2) design test statistics for stochastic monotonicity properties of interest in finance.


Linxi Liu (Columbia)

Title: Exploring RNA-protein interactions at amino-acid level via a multinomial logistic regression model with latent responses

Abstract: In eukaryotic cells, alternative splicing occurs during RNA processing and greatly increases the biodiversity of proteins encoded by the genome. It is already known that RNA-binding proteins (RBP) play a central role in the regulation of splicing, while at the molecular level it is still unclear how proteins interact and crosslink with RNA. The recently developed high-throughput sequencing of RNA isolated by crosslinking immuno- precipitation (HITS-CLIP) method allows genome-wide mapping of RBP-binding footprint regions at single-nucleotide resolution. Together with information about protein-RNA complex 3-dimensional structures, we can make inference of crosslinking at amino-acid- nucleotide level by using statistical models. While generally the interaction at this level can hardly be detected in the experiments.

In this work, we introduce a multinomial logistic regression with latent responses to model the potential crosslinking between 20 different amino acids and the nucleotide. We also introduce a set of variable selection indicators for each category. Under the Bayesian framework, we are able to make inference of latent responses and association between explanatory variables and the response based on the posterior distribution. The results well coincide with our current understanding of RBPs.

2/27/19 Timothy Jones (Columbia)

Sharon Lohr (Arizona State University)

“Measuring Crime: Behind the Statistics”

In 1915, the Chicago City Council asked statistician Edith Abbott to report “upon the frequency of murder, assault, burglary, robbery, theft and like crimes in Chicago.” Her report, drawing on published and unpublished statistics from the courts, probation office, house of correction, and police department, set the stage for subsequent collections and evaluations of crime statistics. Her conclusions—that statistics’ quality depend on the systems of data collection and that multiple sources of data are needed to study crime—hold today.

Drawing on Abbott’s insights, I set out eight questions to ask about a statistic before you rely on it. I then go through these questions for three sources of statistics about sexual assault: the Uniform Crime Reports, the National Crime Victimization Survey, and the National Intimate Partner and Sexual Violence Survey.


Andrew Davison

“Optimization for the Working Statistician”

Statistical problems are defined via optimization problems. For instance, the acts of trying to find a MLE (or more broadly, performing M-estimation/empirical risk minimization), finding a most powerful hypothesis test, or finding a minimax/admissable estimator in decision theory are all those of solving an optimization problem. In statistics, we usually ignore the process of being able to find such an optimal value and take it for granted; in some respects, we suppose that the ‘optimization error’ and the ‘statistical error’ can be neatly decoupled and we only concern ourselves with the latter. I’ll argue that this perspective is not useful for the pragmatist, and perhaps even worse, an incorrect and uninteresting one. In particular, I’ll talk about non-convex optimization both from an optimization and a statistical point of view, and give some highlights of some interesting aspects of both.

Spring Break


Lauren Kennedy (Columbia)

“Multilevel regression and poststratification with applications”

Multilevel regression and poststratification is a model based approach to using survey data to make population and subpopulation estimates. In this talk we will work through the rationale of using this method and the benefits of MRP when compared to traditional survey weighting techniques, Time permitting we will then turn our attention to applications in complex survey designs where we will discuss some of the advantages of MRP and some of the challenges.

Lauren is a postdoctoral research scientist at the School of Social Work, Columbia University. She obtained her PhD in psychology from the University of Adelaide in 2018 and has undergraduate degrees in both Psychology and Maths & Computer Sciences (Statistics). Her research focuses broadly on the analysis of data provided by people, and more specifically on Bayesian modelling, generalization and measurement.



Duco Veen ( Utrecht University)

Experts can hold valuable information and they should not be ignored as a potential source of data. This can be done for many different reasons, to evaluate experts, to supplement limited data, obtain more confident parameter estimates, or, simply enable estimation of the model. Extracting this knowledge, judging the quality, and, appropriate implementation all come with their own problems though. I discuss elicitation, which is the construction of a probabilistic representation of expert knowledge, possible ways to evaluate expert knowledge, and, present our approach to obtain priors for a hierarchical model from experts with limited statistical knowledge.


Joshua Gordon (Google/Tensorflow)

Title:  “What’s new in TensorFlow 2.0?”
Description: “TensorFlow 2.0 is all about ease of use. In this 30 minute talk, I’ll cover best practices for beginners and advanced users, and point you to code examples you can try for each. I’ll cover the Keras Sequential and Subclassing APIs, built-in training loops, and how to write a custom training loop using a GradientTape. I’ll also show how to accelerate models with tf.function. To wrap it up, I’ll give a quick summary of a few announcements and updates from the TensorFlow Developer Summit.”

Marco Avella (Columbia)

Privacy-preserving parametric inference: a case for robust statistics.”
Differential privacy is a cryptographically-motivated definition of privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm we assume there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing statistical analysis of the database as a whole. In this setting we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross error sensitivity in the calibration of a noise term added to them in order to ensure privacy. We then we show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular we demonstrate that differential privacy is a weaker requirement than infinitesimal robustness and show that robust M-estimators can be easily randomized in order to guarantee both differential privacy and robustness towards the presence of contaminated data. We illustrate our results both on simulated and real data
Tian Zheng (Columbia)
5/1/19 Elections
5/8/19 Elections