Skip to main content

Currently Skimming:

5 Inference When Regularization Is Used to Simplify Fitting of High-Dimensional Models
Pages 44-61

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 44...
... Daniela Witten (University of Washington) introduced novel methods for learning the structure of a graphical model from gene expression and neural spike train data, and she discussed the gap between statistical theory and practice in the context of theoretical results associated with high-dimensional model fitting.
From page 45...
... described the growing size of genetic and genomic data sets and associated statistical challenges and emphasized the importance of statisticians engaging early in experimental design and data collection. LEARNING FROM TIME Daniela Witten, University of Washington Daniela Witten began by describing methods for learning the structure of graphical models, which represent interrelationships between multiple random variables (Figure 5.1A)
From page 46...
... Witten explained that it is not necessary to know the functional form of fjk for structure learning beyond knowing whether it is exactly zero, which indicates that there is no regulatory relationship or cor responding graph edge. Fitting this nonparametric model is challenging, explained Witten.
From page 47...
... This method has similar theoretical guarantees as in the gene expression example, specifically that the correct parent nodes are identified and that the estimated graphical models are selected with high probability from highdimensional data sets.
From page 48...
... Preferable measures of uncertainty such as a false dis covery rate or p-values may follow, but a lot of work done now does not take this first step of establishing model selection consistency. With consistency established, Kosorok described different aspects of the graphical model that inferences can be drawn on, including graph structure and edge direction, the magnitude and sign of model coefficients, or overall prediction or classification error of the model on a new or test data set.
From page 49...
... . Kosorok wondered whether the proposed methods for gene time course and neural spike train data can be generalized to allow for nonparametric interactions -- for example, by using tensor products of the bases -- and suggested that this approach could be used to interrogate the additive structure of the model.
From page 50...
... PANEL DISCUSSION Following their presentations, Daniela Witten and Michael Kosorok partici pated in a panel discussion. A participant commented that the skepticism expressed by Witten regarding the results of unsupervised analyses in which the model being used is known to be an oversimplification should be extended to the context of supervised analyses as well.
From page 51...
... Referring back to Genevera Allen's presentation, Witten described the best available option as generating a rank list of graph edges based on different data samples and graph estimation techniques, though this approach still falls short of fully quantifying uncertainty. Reiterating Witten's comment that graphical model estimation from large, complex multivariate data sets should be viewed as a way of generating hypotheses for future experimental investigation, one participant asked how the field could better communicate the uncertainty associated with these exploratory analyses.
From page 52...
... In the context of unsupervised analyses, she continued, there is no gold standard for selecting tuning parameters. Graph estimation is useful to generate a relatively simple representa tion of large, complex data sets, said Witten.
From page 53...
... After splitting the data set, the square root lasso identified 11 variables; while there was some overlap from the first case described above, several differences emerged because different data were used in the two cases. Fitting a parametric Gaussian model with these variables to the unused portion of the data resulted in p-values or confidence intervals that, in principle, were justified, as the data used to select variables were independent of the data used to fit the model.
From page 54...
... the exploratory-confirmatory analysis paradigm. X and y represent random variables before data splitting, ω represents a random split, X1 and y1 represent the first-stage data used to select variables E for fitting.
From page 55...
... Brown showed a few diverse examples of dynamic neuroscience data, including neuron spike train data, images produced by functional magnetic resonance imaging and diffuse optical tomography, and behavioral or cognitive performance data. Brown presented a case study from his anesthesiology research evaluating electroencephalogram (EEG)
From page 56...
... If the first eigenvalue and thereby the global coherence are large, it suggests there is clear directionality to the EEG signal at that frequency and time. He showed the time course of global coherence for six patients that showed strong coherence (e.g., between 0.7 and 0.8)
From page 57...
... Brown described calculating the 95 percent confidence interval for the difference using bootstrap methods (Ramos, 1988; Hurvich and Zeger, 1987) and showed the upper ­ and lower confidence bounds on the difference as a function of frequency.
From page 58...
... The field of statistics needs a credo, Brown concluded, that there is no uncertainty that statisticians cannot quantify. DISCUSSION OF STATISTICS AND BIG DATA CHALLENGES IN NEUROSCIENCE Xihong Lin, Harvard University Xihong Lin provided several examples of diverse data types that fall under the umbrella of big data, including neuroimaging, whole genome sequencing (WGS)
From page 59...
... For example, the National Human Genome Research Institute's G ­ enome Sequencing Program contains data from 200,000 subjects consisting of approximately 1 billion SNPs. Because the majority of the human genome contains rare variants, p increases with n and additional rare variants will be observed as more samples are sequenced.
From page 60...
... Moving to the importance of including statistical reasoning in the design of modern clinical trials, Lin described research using cell phone record data to construct networks representing social interactions and how this information can be used to improve HIV interventions. In such networks, random sampling may be less effective, as it is unlikely that a highly connected hub node is selected at random.
From page 61...
... Furthermore, she encouraged statisticians writing code to consider the emergence of cloud computing, as software that works well for clustered computing may not be effective in the cloud environment. A bioinformatics master's student commented that graduate programs want students with strong computational and statistical training; however, there was little opportunity or incentive to engage with these concepts in high school or as an undergraduate biology major.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.