Forschungsgruppe "Stochastische Algorithmen und Nichtparametrische Statistik"

This page in English

Research Seminar "Mathematical Statics" Summer Semester 2016

Place:	Weierstrass-Institute for Applied Analysis and Stochastics
	Erhard-Schmidt-Hörsaal, Mohrenstraße 39, 10117 Berlin
Time:	Wednesdays, 10.00 a.m. - 12.30 p.m.
20.04.16	Marc Hallin (L'Université Libre de Bruxelles)
	Monge-Kantorovich ranks and signs
27.04.16	Prof. Gilles Blanchard (Universität Potsdam) & Prof. Markus Reiß (Humboldt-Universität zu Berlin)
	Is adaptive early stopping possible in statistical inverse problems?
04.05.16	Prof. Vladimir Spokoiny (WIAS Berlin)
	Bootstrap tuned model selection
11.05.16	Prof. Sokbae Lee (Seoul National University)
Mohrenstr.39, R.406	Oracle estimation of a change point in high dimensional quantile regression
18.05.16	Rémi Gribonval (INRIA, Rennes)
	Dictionary learning: Principles, algorithms, guarantees
25.05.16	Dr. Yury Maximov (IITP, Russian Academy of Sciences)
	Structured semi-definite programming with applications to non-Gaussian component analysis
01.06.16	No Seminar

08.06.16	Dr. Pierre Bellec (CREST, Malakoff, France)
	Slope meets Lasso: improved oracle bounds and optimality
15.06.16	No Seminar
Mohrenstr.39, R.406
22.06.16	Dr. Jakob Söhl (University of Cambridge)
	Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions
29.06.16	No Seminar

06.07.16	Arthur Gretton (Gatsby Computational Neuroscience Unit, UCL, UK )
	Nonparametric statistical tests using kernels
13.07.16	Prof. V. Ulyanov (Lomonosov Moscow State University, Russische Föderation)
	Non-asymptotic analysis - general approach
20.07.16	Dr. Francis Bach (INRIA, Paris)
	Beyond stochastic gradient descent for large-scale machine learning
27.07.16

03.08.16

last reviewed: June 14, 2016, by Christine Schneider

Marc Hallin (L'Université Libre de Bruxelles)

Monge-Kantorovich ranks and signs

Abstract: Unlike the real line, the real space RK, K 2 is not \naturallyordered. As a consequence, such fundamentals univariate concepts as quantile and distribution functions, ranks, signs, all order-related, do not straightforwardly extend to the multivariate context. Since no universal pre-existing order exists, each distribution, each data set, has to generate its own|the rankings behind sensible concepts of multivariate quantile, ranks, or signs, inherently will be distribution-specic and, in empirical situations, data-driven. Many proposals have been made in the literature for such orderings|all extending some aspects of the univariate concepts, but failing to preserve the essential properties that make classical rank-based inference a major inferential tool in the analysis of semiparametric models where the density of some underlying noise remains unspecied: (i) exact distribution-freeness, and (ii) asymptotic semiparametric eciency, see Hallin and Werker (2003). Ranks and signs, and the resulting inference methods, are well understood and well developed, essentially, in two cases: one-dimensional observations, and elliptically symmetric ones. We start by establishing the close connection, in those two cases, between classical ranks and signs and measure transportation results, showing that the rank transformation there actually reduces to an empirical version of the unique gradient of convex function mapping a distribution to the uniform over the unit ball. That fact, along with a result by McCann (1995), itself extending Brenier's celebrated polar factorization Theorem(Brenier 1991), is then exploited to dene fully general concepts of ranks and signs|called the Monge-Kantorovich ranks and signs coinciding, in the univariate and elliptical settings, with the traditional concepts, and enjoying under completely unspecied (absolutely continuous) d-dimensional distributions, the essential properties that make traditional rank-based inference an essential part of the semiparametric inference. Reference Chernozhukov V., A. Galichon, M. Hallin, and M. Henry (2015) Monge-Kantorovich depth, quantiles, ranks, and signs, Annals of Statistics, to appear.

Gilles Blanchard (Universität Potsdam) &
Markus Reiß (Humboldt-Universität zu Berlin)

Is adaptive early stopping possible in statistical inverse problems?

Abstract:We consider a standard setting of statistical inverse problem, taking the form of the Gaussian sequence model with D observed noisy coecients. Consider the simple family of keep or kill estimators depending on a cuto index k0. For the choice of the cuto index, there exist a number of well-known methods achieving oracle adaptivity (i.e. data-dependent choice of k0 whose performance is comparable to the unknown optimal one), such as penalization and Lepskis method. However, they have in common that the estimators for all values of k0 have to be computed rst and compared to each other in some way. Contrast this to an early stopping approach where we would like to compute iteratively the estimators for k0 = 1; 2; : : : and have to decide to stop at some point without being allowed to compute the following estimators. Is oracle adaptivity possible then? This question is motivated by settings where computing estimators for larger k0 requires more computational cost; furthermore some form of early stopping is most often used in practice. We propose a precise mathematical formulation of this question and provide upper and lower bounds on what is achievable.

Prof. Vladimir Spokoiny

Bootstrap tuned model selection

Abstract: In the problem of model selection for a given family of linear estimators ��%G�̃%@ m, m��M, ordered by their variance, we offer a new ``smallest accepted'' approach motivated by Lepski's device and the multiple testing idea. The procedure selects the smallest model which satisfies the acceptance rule based on comparison with all larger models. The method is completely data-driven and does not use any prior information about the variance structure of the noise: its parameters are adjusted to the underlying possibly heterogeneous noise by the so called ``propagation condition'' using wild bootstrap method. The validity of the bootstrap calibration is proved for finite samples with an explicit error bound. We provide a comprehensive theoretical study of the method and describe in details the set of possible values of the selected parameter m%G�̂%@ . We also establish some precise oracle error bounds for the corresponding estimator ��%G�̂%@ =��%G�̃%@ m%G�̂%@ which equally applies to estimation of the whole parameter vectors, its subvector or linear mapping, as well as estimation of a linear functional.

Prof. Sokbae Lee (Seoul National University)

Oracle estimation of a change point in high dimensional quantile regression

Abstract:In this paper, we consider a high dimensional quantile regression model where the sparsity structure may differ between the two sub-populations. We develop �1-penalized estimators of both regression coefficients and the threshold parameter. Our penalized estimators not only select covariates but also discriminate between a model with homogeneous sparsity and a model with a change point. As a result, it is not necessary to know or pretest whether the change point is present, or where it occurs. Our estimator of the change point achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets of regression coefficients were known. Importantly, we establish this oracle property without a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. Dealing with high dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is non-smooth and furthermore the corresponding objective function is non-convex with respect to the change point. The technique developed in this paper is applicable to a general M-estimation framework with a change point, which may be of independent interest. The proposed methods are then illustrated via Monte Carlo experiments and an application to tipping in the dynamics of racial segregation.

Rémi Gribonval (INRIA, Rennes)

Dictionary learning: Principles, algorithms, guarantees

Abstract: Sparse modeling has become highly popular in signal processing and machine learning, where many tasks can be expressed as under-determined linear inverse problems. Together with a growing family of low-dimensional signal models, sparse models expressed with signal dictionaries have given rise to a rich set of algorithmic principles combining provably good performance with bounded complexity. In practice, from denoising to inpainting and super-resolution, applications require choosing a good dictionary. This key step can be empirically addressed through data-driven principles known as dictionary learning. In this talk I will draw a panorama of dictionary learning for low-dimensional modeling. After reviewing the basic empirical principles of dictionary learning and related matrix factorizations such as PCA, K-means and NMF, we will discuss techniques to learn dictionaries with controlled computational eciency, as well as a series of recent theoretical results establishing the statistical signicance of learned dictionaries even in the presence of noise and outliers. (based on joint work with F. Bach, R. Jenatton, L. Le Magoarou, M. Kleinsteuber, and M. Seibert).

Dr. Yury Maximov (IITP, Russian Academy of Sciences)

Structured semi-definite programming with applications to non-Gaussian component analysis

Abstract: Semi-definite programming (SDP) is a popular tool for approximation of non-convex quadratic problems arises in multiple statistical and computer science problems. Known to be worst-case optimal SDP is often dominated on well-structured (practical) problems by domain specific methods and heuristics. Yet another problem of SDP is a slow computational time makes it hardly applicable for huge-scale problems. In this talk we try to incorporate problem structure in the semi-definite dual to contribute both decrease computational time and improve approximation guarantees. A major application the proposed structured semi-definite programming is a novel convex projection based method in Non-gaussian component analysis (NGCA) especially efficient for projecting sparse and low-rank high-dimensional data. The talk is based on the joint research results with Yu. Nesterov (CORE/UCL, Belgium) and V. Spokoiny (WIAS, Berlin)

Abstract:

Dr. Pierre Bellec (Centre for Research in Economy and Statistics, Malakoff, France)

Slope meets Lasso: improved oracle bounds and optimality

Abstract We show that two polynomial time methods, a Lasso estimator with adaptively chosen tuning parameter and a Slope estimator, adaptively achieve the exact minimax prediction and `2 estimation rate (s=n) log(p=s) in high-dimensional linear regression on the class of s-sparse target vectors in Rp. This is done under the Restricted Eigenvalue (RE) condition for the Lasso and under a slightly more constraining assumption on the design for the Slope. The main results have the form of sharp oracle inequalities accounting for the model misspecication error. The minimax optimal bounds are also obtained for the `q estimation errors with 1 q 2 when the model is well-specied. The results are non-asymptotic, and hold both in probability and in expectation. The assumptions that we impose on the design are satised with high probability for a large class of random matrices with independent and possibly anisotropically distributed rows. We give a comparative analysis of conditions, under which oracle bounds for the Lasso and Slope estimators can be obtained. In particular, we show that several known conditions, such as the RE condition and the sparse eigenvalue condition are equivalent if the `2-norms of regressors are uniformly bounded. (Joint work with Guillaume Lecu, Alexandre Tsybakov.)

Abstract:

Dr. Jakob Söhl (University of Cambridge)

Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions

Abstract:We consider nonparametric Bayesian inference in a reflected diffusion model $dX_t = b (X_t)dt + \sigma(X_t) dW_t,$ with discretely sampled observations $X_0, X_\Delta, \dots, X_{n\Delta}$. We analyse the nonlinear inverse problem corresponding to the `low frequency sampling' regime where $\Delta>0$ is fixed and $n \to \infty$. A general theorem is proved that gives conditions for prior distributions $\Pi$ on the diffusion coefficient $\sigma$ and the drift function $b$ that ensure minimax optimal contraction rates of the posterior distribution over H\"older-Sobolev smoothness classes. These conditions are verified for natural examples of nonparametric random wavelet series priors. For the proofs we derive new concentration inequalities for empirical processes arising from discretely observed diffusions that are of independent interest.

Prof. Éric Moulines (�cole Polytechnique, France)

Stochastic optimization and high-dimensional sampling

Abstract: Recently, the problem of designing MCMC samplers adapted to high-dimensional Bayesian inference with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including large-scale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is L-smooth (the log-density is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a -Y�rejection-free� algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the log-density (from the weakest, bounded oscillations on a compact set and super-exponential in the tails to the strong concavity). When the log-density is not smooth (a problem which typically appears when using sparsity inducing priors for example), we still suggest to use a Euler discretization but of the Moreau envelope of the non-smooth part of the log-density. An importance sampling correction may be later applied to correct the target. Several numerical illustrations will be presented to show that this algorithm (named MYULA) can be practically used in a high dimensional setting. Finally, non-asymptotic bounds convergence bounds (in total variation and Wasserstein distances) are derived.

Arthur Gretton (Gatsby Computational Neuroscience Unit, UCL, UK )

Nonparametric statistical tests using kernels

Abstract: I will describe a kernel approach to hypothesis testing, based on a representation of probability distributions in a reproducing kernel Hilbert space. I will rst derive a metric as the distance between these representations. Next, I will describe both tests of homogeneity (of whether two samples are from the same distribution), and of independence (of whether a joint distribution factorises into a product of marginals). More recent work will follow, including statistical tests for random processes (for instance, to compare marginal distributions of Markov chains), tests for multi-variable interaction, and a test of goodness-of-t based on Stein's method. This last test can be used even if the reference distribution is known only up to a normalising constant, making it suited to benchmarking MCMC and to statistical model criticism.

Abstract:

Dr. Francis Bach (INRIA, Paris)

Beyond stochastic gradient descent for large-scale machine learning

Abstract: Many machine learning and statistics problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/\sqrt{n}) for general convex functions and reaches O(1/n) for strongly-convex functions. In this talk, I will show how the smoothness of loss functions may be used to design novel simple algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-basedstochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions. (joint work with Alexandre Defossez, Aymeric Dieuleveut, Nicolas Flammarion, and Eric Moulines)

Abstract: