Statistics Seminar Series – Spring 2014

Statistics Seminar Series

 Semester Schedule: Statistics – Spring 2014 Seminars are on Mondays Time:12:00 – 1:00 PM Location: Room 903, 1255 Amsterdam Avenue, Tea and Coffee will be served before the seminar at 11:30 AM, Room 1025 Feb 17 Rebecca C. Steorts (CMU) Title: Will the Real Steve Fienberg Please Stand Up: Getting to Know a Population From Multiple Incomplete Files Abstract: We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files.  Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records.  This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses.   Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space We illustrate our method using longitudinal data from the National Long Term Care Survey, where the tracking of individuals across waves lets us objectively assess the accuracy of our record linkage. Feb 24 Alexandra Chronopoulou, CUNY Title: Statistical Inference for fractional SDEs and applications. Abstract: Stochastic differential equations driven by fractional Brownian motion have an increasing presence in a wide range of applications, as they can model successfully phenomena that are characterized by long memory and/or selfsimilarity. In this talk, we will review their basic theoretical properties, focus on the statistical inference of their parameters and discuss particular applications in mathematical finance. March 3 Noureddine El Karoui, UC Berkeley March 10 Yi Yu, University of Cambridge Title: Fused community detection Abstract: Community detection is one of the most widely studied problems in network research. In an undirected graph, communities are regarded as tightly-knit groups of nodes with comparatively few connections between them. Popular existing techniques, such as spectral clustering and variants thereof, rely heavily on the edges being sufficiently dense and the community structure being relatively obvious. These are often not satisfactory assumptions for large-scale real-world datasets. We therefore propose a new community detection method, called fused community detection (fcd), which is designed particularly for sparse networks and situations where the community structure may be opaque. The spirit of fcd is to take advantage of the edge information, which we exploit by borrowing sparse recovery techniques from regression problems. Our method is supported by both theoretical results and numerical evidence. The algorithms are implemented in the R package fcd, which is available on cran. This is joint work with Dr. Yang Feng (Columbia University) and Prof. Richard J. Samworth (University of Cambridge). March 17 Spring Recess March 24 Zongming Ma, UPENN “Estimating High-dimensional Matrices: Convex Geometry and Computational Barriers” In this talk, we introduce a unified approach for studying estimation of high-dimensional matrices, which yields tight non-asymptotic minimax rates for a large collection of loss functions in a variety of problems. Based on the convex geometry of finite-dimensional Banach spaces, the minimax rates of oracle (unconstrained) matrix denoising problem is determined for all unitarily invariant norms. This result is then extended to denoising with submatrix sparsity, where the excess risk depends on the sparsity constraints in a completely different manner. The approach is also applicable to matrix completion under low-rank constraint and extends beyond the normal mean model. In addition, we study an example where attaining the minimax rate is provably hard in a complexity-theoretic sense. This observation reveals that there can exist a significant gap between the statistical fundamental limit and what can be achieved by computationally efficient procedures. This talk is based on joint work with Yihong Wu (UIUC). March 31 Jiashun Jin, CMU Fast Network Community Detection by SCORE Consider a network where the nodes split into K dierent communities. The community labels for the nodes are unknown and it is of major interest to estimate them (i.e., community detection). Degree Corrected Block Model (DCBM) is a popular network model. How to detect communities with the DCBM is an interesting problem, where the main challenge lies in the degree heterogeneity. We propose Spectral Clustering On Ratios-of-Eigenvectors (SCORE) as a new approach to community detection. Compared to existing spectral methods, the main innovation is to use the entry-wise ratios between the rst a few leading eigenvectors for community detection. The central surprise is, the eect of degree heterogeneity is largely ancillary, and can be eectively removed by taking such entry-wise ratios. We have applied SCORE to the well-known web blogs data and the statistics co-author network data which we have collected very recently. We nd that SCORE is competitive both in computation and in performance. On top of that, SCORE is conceptually simple and has the potential for extensions in various directions. Addi- tionally, we have identied several interesting communities in statisticians, including what we call the \Object Bayesian community”, \Theoretic Machine Learning Com- munity”, and the \Dimension Reduction Community”. We develop a theoretic framework where we show that under mild regularity conditions, SCORE stably yields consistent community detection. In the core of the analysis is the recent development on Random Matrix Theory (RMT), where the matrix-form Bernstein inequality is especially helpful. April 7 Yee Whye Teh April 14 Grant Weller, CMU Title: Inference for Hidden Regular Variation in Multivariate Extremes   Abstract: A fundamental deficiency of classical multivariate extreme value theory is the inability to distinguish between asymptotic independence and exact independence.  In this work, we examine multivariate threshold exceedance modeling in the framework of regular variation.  Under this framework, dependence in the tail of a distribution is described by a limiting measure, which in some cases is degenerate on joint tail regions despite possible dependence in such regions at finite levels.  Hidden regular variation, a higher-order tail decay on these regions, offers a refinement of the classical theory.  We develop a representation of random vectors possessing hidden regular variation as the sum of independent regular varying components.  The representation is shown to be asymptotically valid via a multivariate tail equivalence result.  We develop a likelihood-based estimation procedure from this representation via a Monte Carlo expectation-maximization algorithm which has been modified for tail estimation.  The methodology is demonstrated on simulated data and applied to a bivariate series of air pollution measurements. April 21 Mary Meyer, Colorado State University “Variable and Shape Selection in the Generalized Additive Model” The partial linear generalized additive model is considered, where the goal is to choose a subset of predictor variables and describe the component relationships with the response, in the case where there is very little a priori information. For each predictor, the user need only specify a set of possible shape or order restrictions. For example, the systematic component associated with a continuous predictor might be assumed to be increasing, decreasing, convex, or concave. The effect of a treatment variable might have a tree ordering or be unordered. A model selection method chooses the nature of the relationships as well as the variables. Given a set of predictors and shape or order restrictions, the maximum likelihood estimator for the constrained generalized additive model is found using iteratively re-weighted cone projections. The cone information criterion (CIC) is used to select the best combination of variables and shapes. Because the shapes and orderings impose some degree of smoothness, no tuning parameters are required. The methods are applied to several classical data sets, and simulations show that the model selection criterion performs well in comparison to the AIC with parametric assumptions. The R package cgam contains routines and data sets for the methods and examples. April 28 Liza Levina, University of Michigan Title: Fast Community Detection in Large Sparse Networks Abstract: Community detection is one of the fundamental problems in network analysis, with many diverse applications, and a lot of work has been done on models and algorithms that find communities.   Perhaps the most commonly used probabilistic model for a network with communities is the stochastic block model, and many algorithms for fitting it have been proposed.   Since finding communities involves optimizing over all possible assignments of discrete labels, most existing algorithms do not scale well to large networks, and many fail on sparse networks.   In this talk, we propose a pseudo-likelihood approach for fitting the stochastic block model to address these shortcomings.  Pseudo-likelihood is a general statistical principle that involves trading off some of the model complexity against computational efficiency.   We also derive a variant that allows for arbitrary degree distributions in the network, making it suitable for fitting the more flexible degree-corrected stochastic block model.  The pseudo-likelihood algorithm scales easily to networks with millions of nodes, performs well empirically under a range of settings, including on very sparse networks, and is asymptotically consistent under reasonable conditions.   If times allows, I will also discuss spectral clustering with perturbations, a new method of independent interest we use to initialize pseudo-likelihood, which works well on sparse networks where regular spectral clustering fails. May 5 Juerg Huesler, University of Bern On high exceedances and excursions of Gaussian processes Abstract We consider a Gaussian process which crosses a high threshold and investigate its path behaviour above such a threshold. Such excursions are usually rather short depending on the high crossing level u, with ! 1. If we lower the level one observes longer excursions. In particular, if one conditions on two exceedances above the threshold which are separated in time by a positive quantity “, the excursions may be longer, not tending to 0. The pattern depends on the correlation function and the separation “. Such pattern will be discussed in details and the limiting process will be investigated. We consider also the cases that the exceedances are above two different large thresholds which might occur for instances by double ruins or double shocks, as earthquakes.