Research Group "Stochastic Algorithms and Nonparametric Statistics"
Research Seminar "Mathematical Statics" Summer Semester 2016
Place: 
WeierstrassInstitute for Applied Analysis and Stochastics


ErhardSchmidtHörsaal, Mohrenstraße 39, 10117
Berlin 
Time: 
Wednesdays, 10.00 a.m.  12.30 p.m. 
20.04.16 
Marc Hallin (L'Université Libre de Bruxelles)


MongeKantorovich ranks and signs

27.04.16 
Prof. Gilles Blanchard (Universität Potsdam) &
Prof. Markus Reiß (HumboldtUniversität zu Berlin)


Is adaptive early stopping possible in statistical inverse problems?

04.05.16 
Prof. Vladimir Spokoiny (WIAS Berlin) 

Bootstrap tuned model selection

11.05.16 
Prof. Sokbae Lee (Seoul National University) 
Mohrenstr.39, R.406 
Oracle estimation of a change point in high dimensional quantile regression 
18.05.16 
Rémi Gribonval (INRIA, Rennes)


Dictionary learning: Principles, algorithms, guarantees

25.05.16 
Dr. Yury Maximov (IITP, Russian Academy of Sciences) 

Structured semidefinite programming with applications to nonGaussian component analysis 
01.06.16

No Seminar 


08.06.16 
Dr. Pierre Bellec (CREST, Malakoff, France) 

Slope meets Lasso: improved oracle bounds and optimality

15.06.16 
No Seminar 
Mohrenstr.39, R.406 

22.06.16 
Dr. Jakob Söhl (University of Cambridge) 

Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions

29.06.16 
No Seminar 


06.07.16 
Arthur Gretton (Gatsby Computational Neuroscience Unit, UCL, UK ) 

Nonparametric statistical tests using kernels 
13.07.16 
Prof. V. Ulyanov (Lomonosov Moscow State University, Russische Föderation) 

Nonasymptotic analysis  general approach 
20.07.16 
Dr. Francis Bach (INRIA, Paris) 

Beyond stochastic gradient descent for largescale machine learning 
27.07.16 



03.08.16 



last reviewed: June 14, 2016, by Christine Schneider
Marc Hallin (L'Université Libre de Bruxelles)
MongeKantorovich ranks and signs
Abstract: Unlike the real line, the real space RK, K 2 is not \naturallyordered. As a consequence, such
fundamentals univariate concepts as quantile and distribution functions, ranks, signs, all orderrelated, do
not straightforwardly extend to the multivariate context. Since no universal preexisting order exists, each
distribution, each data set, has to generate its ownthe rankings behind sensible concepts of multivariate
quantile, ranks, or signs, inherently will be distributionspecic and, in empirical situations, datadriven.
Many proposals have been made in the literature for such orderingsall extending some aspects of the
univariate concepts, but failing to preserve the essential properties that make classical rankbased inference
a major inferential tool in the analysis of semiparametric models where the density of some underlying
noise remains unspecied: (i) exact distributionfreeness, and (ii) asymptotic semiparametric eciency, see
Hallin and Werker (2003).
Ranks and signs, and the resulting inference methods, are well understood and well developed, essentially,
in two cases: onedimensional observations, and elliptically symmetric ones. We start by establishing the
close connection, in those two cases, between classical ranks and signs and measure transportation results,
showing that the rank transformation there actually reduces to an empirical version of the unique gradient
of convex function mapping a distribution to the uniform over the unit ball.
That fact, along with a result by McCann (1995), itself extending Brenier's celebrated polar factorization
Theorem(Brenier 1991), is then exploited to dene fully general concepts of ranks and signscalled the
MongeKantorovich ranks and signs coinciding, in the univariate and elliptical settings, with the traditional
concepts, and enjoying under completely unspecied (absolutely continuous) ddimensional distributions,
the essential properties that make traditional rankbased inference an essential part of the semiparametric
inference.
Reference
Chernozhukov V., A. Galichon, M. Hallin, and M. Henry (2015) MongeKantorovich depth, quantiles,
ranks, and signs, Annals of Statistics, to appear.
Gilles Blanchard (Universität Potsdam) &
Markus Reiß (HumboldtUniversität zu Berlin)
Is adaptive early stopping possible in statistical inverse problems?
Abstract:We consider a standard setting of statistical inverse problem, taking the form of the Gaussian
sequence model with D observed noisy coecients. Consider the simple family of keep or kill estimators
depending on a cuto index k0.
For the choice of the cuto index, there exist a number of wellknown methods achieving oracle adaptivity
(i.e. datadependent choice of k0 whose performance is comparable to the unknown optimal one), such as
penalization and Lepskis method. However, they have in common that the estimators for all values of k0
have to be computed rst and compared to each other in some way. Contrast this to an early stopping
approach where we would like to compute iteratively the estimators for k0 = 1; 2; : : : and have to decide
to stop at some point without being allowed to compute the following estimators. Is oracle adaptivity
possible then? This question is motivated by settings where computing estimators for larger k0 requires
more computational cost; furthermore some form of early stopping is most often used in practice. We
propose a precise mathematical formulation of this question and provide upper and lower bounds on what
is achievable.
Prof. Vladimir Spokoiny
Bootstrap tuned model selection
Abstract: In the problem of model selection for a given family of linear estimators †è%G†Ìƒ%@ m, m’¢ºM,
ordered by their variance, we offer a new ``smallest accepted'' approach motivated by Lepski's device
and the multiple testing idea. The procedure selects the smallest model which satisfies the acceptance rule based
on comparison with all larger models. The method is completely datadriven and does not use any prior information about
the variance structure of the noise: its parameters are adjusted to the underlying possibly heterogeneous noise
by the so called ``propagation condition'' using wild bootstrap method. The validity of the bootstrap calibration is proved for finite samples with an explicit error bound.
We provide a comprehensive theoretical study of the method and describe in details the set of possible values of the selected parameter m%G†Ì‚%@ .
We also establish some precise oracle error bounds for the corresponding estimator †è%G†Ì‚%@ =†è%G†Ìƒ%@ m%G†Ì‚%@
which equally applies to estimation of the whole parameter vectors, its subvector or linear mapping, as well as estimation of a linear functional.
Prof. Sokbae Lee (Seoul National University)
Oracle estimation of a change point in high dimensional quantile regression
Abstract:In this paper, we consider a high dimensional quantile regression model where the sparsity structure may differ between the two subpopulations. We develop ¤1penalized estimators of both regression coefficients and the threshold parameter. Our penalized estimators not only select covariates but also discriminate between a model with homogeneous sparsity and a model with a change point. As a result, it is not necessary to know or pretest whether the change point is present, or where it occurs. Our estimator of the change point achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets of regression coefficients were known. Importantly, we establish this oracle property without a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. Dealing with high dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is nonsmooth and furthermore the corresponding objective function is nonconvex with respect to the change point. The technique developed in this paper is applicable to a general Mestimation framework with a change point, which may be of independent interest. The proposed methods are then illustrated via Monte Carlo experiments and an application to tipping in the dynamics of racial segregation.
Rémi Gribonval (INRIA, Rennes)
Dictionary learning: Principles, algorithms, guarantees
Abstract: Sparse modeling has become highly popular in signal processing and machine learning, where
many tasks can be expressed as underdetermined linear inverse problems. Together with a growing family
of lowdimensional signal models, sparse models expressed with signal dictionaries have given rise to a rich
set of algorithmic principles combining provably good performance with bounded complexity. In practice,
from denoising to inpainting and superresolution, applications require choosing a good dictionary. This
key step can be empirically addressed through datadriven principles known as dictionary learning. In this
talk I will draw a panorama of dictionary learning for lowdimensional modeling. After reviewing the basic
empirical principles of dictionary learning and related matrix factorizations such as PCA, Kmeans and
NMF, we will discuss techniques to learn dictionaries with controlled computational eciency, as well as
a series of recent theoretical results establishing the statistical signicance of learned dictionaries even in
the presence of noise and outliers.
(based on joint work with F. Bach, R. Jenatton, L. Le Magoarou, M. Kleinsteuber, and M. Seibert).
Dr. Yury Maximov (IITP, Russian Academy of Sciences)
Structured semidefinite programming with applications to nonGaussian component analysis
Abstract: Semidefinite programming (SDP) is a popular tool for approximation of nonconvex quadratic problems arises in multiple statistical and computer science problems. Known to be worstcase optimal SDP is often dominated on wellstructured (practical) problems by domain specific methods and heuristics. Yet another problem of SDP is a slow computational time makes it hardly applicable for hugescale problems. In this talk we try to incorporate problem structure in the semidefinite dual to contribute both decrease computational time and improve approximation guarantees. A major application the proposed structured semidefinite programming is a novel convex projection based method in Nongaussian component analysis (NGCA) especially efficient for projecting sparse and lowrank highdimensional data.
The talk is based on the joint research results with Yu. Nesterov (CORE/UCL, Belgium) and V. Spokoiny (WIAS, Berlin)
Dr. Pierre Bellec (Centre for Research in Economy and Statistics, Malakoff, France)
Slope meets Lasso: improved oracle bounds and optimality
Abstract We show that two polynomial time methods, a Lasso estimator with adaptively chosen tuning
parameter and a Slope estimator, adaptively achieve the exact minimax prediction and `2 estimation rate
(s=n) log(p=s) in highdimensional linear regression on the class of ssparse target vectors in Rp. This is
done under the Restricted Eigenvalue (RE) condition for the Lasso and under a slightly more constraining
assumption on the design for the Slope. The main results have the form of sharp oracle inequalities
accounting for the model misspecication error. The minimax optimal bounds are also obtained for the `q
estimation errors with 1 q 2 when the model is wellspecied. The results are nonasymptotic, and
hold both in probability and in expectation. The assumptions that we impose on the design are satised
with high probability for a large class of random matrices with independent and possibly anisotropically
distributed rows. We give a comparative analysis of conditions, under which oracle bounds for the Lasso
and Slope estimators can be obtained. In particular, we show that several known conditions, such as the
RE condition and the sparse eigenvalue condition are equivalent if the `2norms of regressors are uniformly
bounded.
(Joint work with Guillaume Lecu, Alexandre Tsybakov.)
Dr. Jakob Söhl (University of Cambridge)
Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions
Abstract:We consider nonparametric Bayesian inference in a reflected diffusion model $dX_t = b (X_t)dt + \sigma(X_t) dW_t,$ with discretely sampled observations $X_0, X_\Delta, \dots, X_{n\Delta}$. We analyse the nonlinear inverse problem corresponding to the `low frequency sampling' regime where $\Delta>0$ is fixed and $n \to \infty$. A general theorem is proved that gives conditions for prior distributions $\Pi$ on the diffusion coefficient $\sigma$ and the drift function $b$ that ensure minimax optimal contraction rates of the posterior distribution over H\"olderSobolev smoothness classes. These conditions are verified for natural examples of nonparametric random wavelet series priors. For the proofs we derive new concentration inequalities for empirical processes arising from discretely observed diffusions that are of independent interest.
Prof. Éric Moulines (École Polytechnique, France)
Stochastic optimization and highdimensional sampling
Abstract: Recently, the problem of designing MCMC samplers adapted to highdimensional Bayesian inference with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including largescale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is Lsmooth (the logdensity is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a Y´rejectionfree¡ algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the logdensity (from the weakest, bounded oscillations on a compact set and superexponential in the tails to the strong concavity). When the logdensity is not smooth (a problem which typically appears when using sparsity inducing priors for example), we still suggest to use a Euler discretization but of the Moreau envelope of the nonsmooth part of the logdensity. An importance sampling correction may be later applied to correct the target. Several numerical illustrations will be presented to show that this algorithm (named MYULA) can be practically used in a high dimensional setting. Finally, nonasymptotic bounds convergence bounds (in total variation and Wasserstein distances) are derived.
Arthur Gretton (Gatsby Computational Neuroscience Unit, UCL, UK )
Nonparametric statistical tests using kernels
Abstract: I will describe a kernel approach to hypothesis testing, based on a representation of probability
distributions in a reproducing kernel Hilbert space. I will rst derive a metric as the distance between these
representations. Next, I will describe both tests of homogeneity (of whether two samples are from the same
distribution), and of independence (of whether a joint distribution factorises into a product of marginals).
More recent work will follow, including statistical tests for random processes (for instance, to compare
marginal distributions of Markov chains), tests for multivariable interaction, and a test of goodnessoft
based on Stein's method. This last test can be used even if the reference distribution is known only up to
a normalising constant, making it suited to benchmarking MCMC and to statistical model criticism.
Dr. Francis Bach (INRIA, Paris)
Beyond stochastic gradient descent for largescale machine learning
Abstract: Many machine learning and statistics problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/\sqrt{n}) for general convex functions and reaches O(1/n) for stronglyconvex functions. In this talk, I will show how the smoothness of loss functions may be used to design novel simple algorithms with improved behavior, both in theory and practice: in the ideal infinitedata setting, an efficient novel Newtonbasedstochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions. (joint work with Alexandre Defossez, Aymeric Dieuleveut, Nicolas Flammarion, and Eric Moulines)