Psyc881
Exploratory and Graphical
Data Analysis

Professor Steven M. Boker

The process by which psychological knowledge advances involves a cycle of theory development, experimental design and hypothesis testing. But after the hypothesis test either does or doesn't reject a null hypothesis, where does the idea for the next experiment come from?

Exploratory data analysis completes this research cycle by helping to form and change new theories. After the planned hypothesis testing for an experiment is finished, exploratory data analysis can look for patterns in these data that may have been missed by the original hypothesis tests. Successful exploratory analyses help the researcher modify theories and modify or design novel experiments with focussed hypothesis tests.

A second use of exploratory data analysis is in diagnostics for hypothesis tests. There are many reasons why a hypothesis test might fail. There are even times when a hypothesis test will reject the null for an unexpected reason. By becoming familiar with data through exploratory methods, the informed researcher can understand what went wrong (or what went right for the wrong reason).

The initial part of the course will introduce the rationale and scope of exploratory data analysis. Next, we will examine how perceptual and cognitive illusions can affect our judgement with respect to exploratory and graphical techniques. Next, we will dive in and try a variety of techniques for the presentation and graphical exploration of univariate, bivariate and multivariate data. We will then use these graphical techniques in the service of other exploratory methods such as data screening, outlier analysis, residual analysis, transformations, and time series analysis. The remainder of the course will be devoted to an integration of these techniques into projects of interest to the students.

Computer work associated with the course will primarily involve the Splus software. Additional assignments may introduce the use of Mathematica for visualization of multivariate data. It is expected that students will learn to be sufficiently familiar with Splus that they can access available routines to perform interactive exploratory analyses. Students will also acquire sufficient skill in writing Splus scripts such that they can perform the data manipulations necessary to use exploratory analysis in practical applications to their own research problems.

Splus and R Example Files.

  • Univariate 1 -- Download the S program file Univariate1.S and/or the R version of the same program Univariate1.R and run each of the sections. You will also need the dataset galaxy.dat

  • Univariate 2 -- Download the S program file Univariate2.S and/or the R version of the same program Univariate2.R and run each of the sections. You will also need the dataset iris.dat

  • Univariate 3 -- Download the S program file Univariate3.S and/or the R version of the same program Univariate3.R and run each of the sections. You will need to have saved the results of Univariate2 or you'll need to reload the Iris data.

  • Univariate 4 -- Download the S program file Univariate4.S and/or the R version of the same program Univariate4.R and run each of the sections. You will need to have saved the results of Univariate2 or you'll need to reload the Iris data.

  • Transformations 1 -- Download the S program file Transformations1.S and/or the R version of the same program Transformations1.R and run each of the sections.

  • Transformations 2 -- Download the S program file Transformations2.S and/or the R version of the same program Transformations2.R and run each of the sections.

  • Bivariate 1 -- Download the S program file Bivariate1.S and/or the R version of the same program Bivariate1.R and run each of the sections.

  • Bivariate 2 -- Download the S program file Bivariate2.S and/or the R version of the same program Bivariate2.R and run each of the sections.

  • Outliers -- Download the S program file OutliersCIs1.S and/or the R version of the same program OutliersCIs1.R and run each of the sections.

  • Smoothing 1 -- Download the S program file Smoothing1.S and/or the R version of the same program Smoothing1.R and run each of the sections. The R version of the program has additional graphs that are not in the Splus version.

  • Smoothing 2 -- Download the S program file Smoothing2.S and/or the R version of the same program Smoothing2.R and run each of the sections.

  • Three-D 1 -- Download the S program file ThreeD1.S and/or the R version of the same program ThreeD1.R and run each of the sections.

  • Three-D 2 -- Download the S program file ThreeD2.S and/or the R version of the same program ThreeD2.R and run each of the sections.

  • Three-D 3 -- Download the Mathematica program file ThreeD3MMA1.nb and run each of the sections.

  • Time Series 1 -- Download the S program file TimeSeries1.S and/or the R version of the same program TimeSeries1.R and run each of the sections.

  • Time Series 2 -- Download the S program file TimeSeries2.S and/or the R version of the same program TimeSeries2.R and run each of the sections.

  • Time Series 3 -- Download the S program file TimeSeries3.S and/or the R version TimeSeries3.R and run each of the sections.

  • Vector Fields 1 -- Download the S program file VectorFields1.S and/or the R version of the same program VectorFields1.R and run each of the sections.

  • Vector Fields 2 -- Download the S program file VectorFields2.S and/or the R version of the same program VectorFields2.R and run each of the sections.

R Homework Files.

  • Homework 1 Load the edaStudent data into Splus or R using the command source("edaStudent.sdd"). Create histograms and boxplots for the variables hscgpa and ztest. Create Quantile-Normal plots for these two variables. Print the graphs. Questions: 1. Do these two distributions approximate a normal distribution? 2. Does one of them more closely approximate a normal? Why?

  • Homework 2 Create Trellis boxplots and Quantile-Normal plots for the variables hscgpa and ztest split by gender. Print the graphs. Check for homogeneity of pooled variance. Questions: 1. There is an apparent anomaly at hscgpa=4.0. Can you think of a reason why this might be? 2. Are there any apparent differences between genders on these two variables? 3. Does it seem reasonable to pool the variance of the variables split by gender?

  • Homework 3 Transform the variables leptoData and platyData using the centeredPower function to approximate a normal distribution. Print Quantile-Normal Plots for the transformed variables. Questions: 1. What exponent worked best for leptoData? 2. What exponent worked best for platyData?

Programs.

  • VRA is a Windows program for Recurrence Analysis.

Handouts and Useful Websites.

  • Splus Manuals can be downloaded from the Insightful Corporation.

  • CRAN is the acronym for the Comprehensive R Archive Network and has free copies of the R software and extensive documentation.

  • The Trellis Graphics User's Manual and A Tour of Trellis Graphics can be found here at the Bell Labs Trellis Graphics site.

  • A Gallery of Data Visualization on Michael Friendly's website gives thumbs up and thumbs down listings for best and worst examples of statistical graphics.

Data.

  • The iris sepal and petal dimensions data that R. A. Fisher used as an example are provided in Splus sdd format and R read.table format. These are the data that are shown in the matrix scatterplot at the top of this page.

Steven M. Boker
Department of Psychology
University of Virginia
Gilmer Hall Room 102
Charlottesville, VA 22903
Office: 434-243-7275, FAX: 434-982-4766
e-mail: boker@virginia.edu

Hecho a mano en ciberespacio