/*
PR0139.SAS
COMPUTING NUMERATOR SUMS OF SQUARES
IN UNBALANCED ANALYSIS OF VARIANCE:
Two-Way Case
Donald B. Macnaughton
donmac@matstat.com
TABLE OF CONTENTS
Abstract
Introduction
- preliminary notes
- introduction to the Searle example
- a goal of experiments and analysis of variance
- a controversy about sums of squares
- model equations
- the residual sum of squares of a model equation
- an interpretation of analysis of variance sums of squares
in terms of model equations
- summing up
- program overview
Preliminary Steps
- start PROC IML
- define the Searle data
- generate the main effect submatrices of the design matrix
- generate the interaction submatrix of the design matrix
- obtain the SS subroutine and list it
- abstract
- uses
- main arguments: y, XE, and XR
- method
- details of the method
- the hypothesis matrix H as a function of XE and XR
- the projection matrix PM as a function of H
- yp, the projection of y by PM
- the sum of squares as the squared length of yp
- other methods of computing the sum of squares
- general comments
- secondary arguments
- notes
- executable statements
- references
- set the values of the three secondary arguments of the SS
subroutine
Compute the Seven Sums of Squares Using the SS Subroutine
- HTO A sum of squares
- specify the matrix XE for the effect being tested
- specify the matrix XR for the other effects in the
two models
- call SS to compute the sum of squares
- HTO B sum of squares
- HTI A sum of squares
- HTI B sum of squares
- sequential A sum of squares when A is entered first
- sequential B sum of squares when B is entered first
- interaction sum of squares
Save the Data in a SAS Dataset and Quit from IML
Compute the Sums of Squares using PROC GLM (for comparison with
the values generated above)
Summary
Notes
Appendix: Steps to Run the Program
References
Output from PROC GLM
ABSTRACT
This SAS program illustrates a conceptual point of view and the
matrix arithmetic for computing the following types of analysis
of variance numerator sums of squares:
- HTO (Higher-level Terms are Omitted)
= SPSS ANOVA Experimental
= SAS Type II in the two-way case
- HTI (Higher-level Terms are Included)
= SAS Type III
= SPSS ANOVA Unique
= the default approach in many analysis of variance pro-
grams
- sequential
= SAS Type I
= SPSS ANOVA Hierarchical in the two-way case.
The conceptual point of view is one of computing an analysis of
variance sum of squares by computing the difference between the
residual sums of squares of two model equations (Yates 1934).
The matrix arithmetic is simple and is specified directly in
terms of the conceptual point of view.
The program is heavily annotated. Computations are illustrated
using data from a 2 x 3 unbalanced experiment discussed by
Shayle Searle (1987, 79).
PRELIMINARY NOTES
If you are not familiar with SAS, you can tell the difference
between my comments and the SAS program statements as follows:
My comments begin with the two symbols /* and end with the two
symbols */ /* Anything outside these symbols (except
blanks) is a program statement, which SAS will try to execute.
To lay some groundwork, I begin the program not with SAS pro-
gram statements but instead with about 500 lines of my own com-
ments. These set the stage for the later program statements.
INTRODUCTION TO THE SEARLE EXAMPLE
Analysis of variance is a broadly used method for analyzing the
results of scientific experiments. The method was invented by
Sir Ronald Aylmer Fisher (1925, 1935) and is generally viewed
as the most powerful and versatile available method for scien-
tifically inferring causation.
A current controversy in analysis of variance pertains to ana-
lyzing the data from "unbalanced" experiments. The controversy
is important because (for various reasons) a large proportion
of real-world experiments end up being unbalanced. Shayle
Searle addresses the controversy in his book LINEAR MODELS FOR
UNBALANCED DATA (1987) in which he discusses both mathematical
and philosophical issues. Although I disagree with some of
Professor Searle's philosophical conclusions, I am in awe of
his mathematical work. It is with deep respect that I offer
the following analysis of an important example in his book.
Searle's example is of an experiment to test whether "type of
potting soil" influences "time to germination" in three varie-
ties of carrot seed (1987, 78-79).
(Searle's example is from the field of agriculture. However,
the discussion is not limited to experiments in the field of
agriculture. Instead, both Searle's discussion and my discus-
sion apply to experiments in all fields of empirical research.
In particular, the discussion applies to a majority of the ex-
periments in the physical, biological, and social sciences.)
Clearly, the response variable in Searle's experiment is "time
to germination", and the two predictor variables are "soil
type" and "seed variety". Searle presents the following data
as possible results of the experiment:
TABLE 1
Time in Days to First Germination
of Three Varieties of Carrot Seed
Grown in Two Different Potting Soils
-------------------
Seed
Soil Variety (B)
Type ------------
(A) 1 2 3
-------------------
1 6 13 14
10 15 22
11
2 12 31 18
15 9
19 12
18
-------------------
The first number in the body of the table (i.e., 6) indicates
that in one of the fifteen trials in the experiment it took six
days for seeds of variety 1 to germinate when they were planted
in soil of type 1.
Searle's experiment is unbalanced because, as the table shows,
the number of values of the response variable available in the
various "cells" in the experiment differs from cell to cell.
For example, three values of the response variable are avail-
able in the (1,1) cell, but only two values are available in
the (1,2) cell.
A GOAL OF EXPERIMENTS AND ANALYSIS OF VARIANCE
When discussing analysis of variance, it is important to be
aware of both the goal of scientific experiments and the role
of analysis of variance in achieving the goal. Otherwise, the
discussion may become an arbitrary mathematical exercise. One
useful way of characterizing the goal is
The goal of experiments and analysis of variance is to
obtain knowledge about relationships between variables.
In any scientific experiment, an important step in achieving
this goal is to determine, in as unequivocal a way as possible,
whether a relationship actually *exists* between the variables
under study. In particular, we wish to determine whether the
response variable in an experiment "depends" on one or more of
the predictor variables. If a relationship is found between
the variables, a second goal is to determine the nature of the
relationship.
Thus in the Searle data we wish to determine whether "time to
germination" depends on "soil type", or whether "time to germi-
nation" depends on "seed variety".
It is invariably the case in experimental research that we wish
to determine whether the dependence exists in the *population*
of entities under study, not just in the sample of entities
that participated in the experiment. (In Searle's example the
entities are trials, or "carrot seed plantings".)
In an experiment with two predictor variables, the nature of
the relationship between the response variable and the predic-
tor variables can be either
- no (detectable) relationship or
- one or two "simple" relationships ("main effects") or
- an "interaction".
Interactions were invented by Fisher (1935, ch. VI). Interac-
tions provide a comprehensive means for detecting any (detect-
able) form of relationship that might exist between the re-
sponse variable and the predictor variables in an experiment.
In particular, interactions help us to detect complicated rela-
tionships between the response variable and a predictor vari-
able in which the specific form of the relationship depends on
the level of one or more *other* predictor variables. I give
formal definitions of the concepts of 'relationship between
variables', 'interaction', and 'simple relationship' in a paper
(1997, sec. 6).
The use of analysis of variance to detect relationships between
variables is (at a high level) straightforward: We submit the
data summarizing the results of an experiment (e.g., the data
in table 1) to an analysis of variance program, and the program
computes a set of "p-values". Assuming the experiment was
properly designed, the program provides a p-value for each sim-
ple relationship (main effect) and a p-value for each interac-
tion. If the p-value for a main effect or interaction is low
enough (and if there is no reasonable alternative explanation),
we conclude that the particular relationship between variables
associated with the p-value is extant in the population of en-
tities under study.
I discuss the general scientific study of relationships between
variables (including the notion of a p-value) further in two
papers (1996a, 1996b).
A CONTROVERSY ABOUT SUMS OF SQUARES
To compute an analysis of variance p-value from the results of
an experiment, it is mathematically necessary to first compute
certain "sums of squares". All analysis of variance programs
compute these sums of squares as an intermediate step in com-
puting p-values. Currently controversy exists about how sums
of squares should be computed. Controversy exists about both
the numerator sum of squares and the denominator sum of squares
used in the "F-ratio" to compute a p-value. The present dis-
cussion focuses exclusively on computing numerator sums of
squares for unbalanced experiments.
(Since balanced experiments are an "internal" limiting case of
unbalanced experiments, the discussion below also applies to
balanced experiments.)
This program illustrates the computations of the three best-
known conceptual methods for computing numerator sums of
squares. The program also illustrates the mathematical aspects
of computing numerator sums of squares by illustrating a simple
mathematical algorithm that can carry out all of the three con-
ceptual methods.
The purpose of this program is not to judge the various methods
of computing sums of squares, but rather to illustrate them.
Therefore, I make no judgments here about the merits of the
various methods. However, I do make judgments in the paper
(1997) and I shall extend these judgments in material I shall
publish later.
MODEL EQUATIONS
To understand sums of squares, it is useful to understand the
notion of a "model equation" of the relationship between the
response variable and the predictor variables in an experiment.
Two types of model equations (often called simply "models") are
in use today: "overparameterized" models and "cell-means" mod-
els. I discuss both types in the paper (1997, sec. 9). The
following discussion is in terms of overparameterized models.
Consider a two-way experiment (e.g., the Searle experiment)
that has predictor variables A and B. We can model the rela-
tionship between the response variable, called y, and the pre-
dictor variables with the following model equation:
y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1)
(Normally the i's, j's and k's in the parentheses would be sub-
scripts, and the five terms on the right side of the equation
would be Greek letters, but these features are not yet avail-
able for comments in computer programs.)
The terms in the equation have the following interpretations:
y(i,j,k) = value of the response variable for the kth entity in
the (i,j) treatment group in the experiment
mu = grand mean of the values of the response variable
for all the entities in the population
alpha(i) = simple ("main") effect of predictor variable A on y
when A is at level i
beta(j) = simple ("main") effect of predictor variable B on y
when B is at level j
phi(i,j) = the joint (interaction) effect of predictor vari-
ables A and B on y when they are at levels i and j
respectively
e(i,j,k) = the "error" term, which takes account of the vari-
ation in y that cannot be accounted for by the other
four terms on the right side of the equation.
Model (1) gives us a succinct picture of how the value of the
response variable "depends" on the values of the two predictor
variables, and how it also depends on other unknown factors,
which are taken account of by the error term. Model (1) is
called the "saturated" model for a two-way experiment because
it contains all the possible terms for such an experiment.
As we shall see, it often makes sense to use a reduced version
of (1) in which certain terms are omitted from the right side.
For example, if we omit the interaction term, phi, we get
y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
THE RESIDUAL SUM OF SQUARES OF A MODEL EQUATION
Once we have the data from an experiment (e.g., the data in ta-
ble 1), we can "fit" various models to the data. That is, we
can use a mathematical algorithm to compute values for the pa-
rameters associated with the terms in the model [i.e., values
for mu, the alpha(i)s, the beta(j)s, and the phi(i,j)s]. The
algorithm operates by choosing values for the parameters asso-
ciated with the four terms so that the resulting model gives
the "best" predictions of the values of the response variable y
for all the values of y obtained in the experiment.
The fitting of the terms is usually done by the method of least
squares or by the closely related method of maximum likelihood.
In the standard case, both methods yield "identical" estimates
for the values of the parameters associated with the terms
(excluding the error term e) on the right side of the equation
in the sense that
the sum of the squared differences between the values
of the response variable predicted by the equation and
the *actual* values of the response variable in the ex-
periment
is the lowest possible value.
After we have fitted a model to the results of an experiment
and obtained estimates for the values of the parameters, we can
then use the model to compute the predicted value for each of
the values of the response variable obtained in the experiment.
Then we can subtract each predicted value from the correspond-
ing actual value to get a number called the "residual". If we
square each of these residuals and add the squared residuals
together, we get the "residual sum of squares" for the model
for the experimental data at hand.
(Of course, the residual sum of squares is the number that was
itself minimized two paragraphs above in order to determine the
estimates of the values of the parameters -- a piece of mathe-
matical bootstrapping that still amazes me.)
AN INTERPRETATION OF ANALYSIS OF VARIANCE SUMS OF SQUARES
IN TERMS OF MODEL EQUATIONS
We can view all standard analysis of variance numerator sums of
squares as being the value we obtain if we subtract the resid-
ual sum of squares for one model from the residual sum of
squares for another model (Yates 1934, 63). For example, con-
sider the following two models for the results of a two-way ex-
periment:
y(i,j,k) = mu + beta(j) + e(i,j,k) (3)
y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
I shall use the term Rn to denote the residual sum of squares
for model (n). Thus R3 denotes the residual sum of squares for
(3).
Suppose we perform a two-way experiment (such as the Searle ex-
periment), and suppose we (separately) fit models (2) and (3)
to the results of the experiment, yielding R2 and R3. If we
subtract R2 from R3, this difference is identical to the HTO (=
SAS Type II = SPSS ANOVA Experimental) sum of squares for the A
main effect.
It stands to reason that the numerical difference R3 - R2
should equal a sum of squares for the A main effect since the
only conceptual difference between the two models is that (2)
contains alpha, the term for predictor variable A, and (3) does
not.
Consider the following two models:
y(i,j,k) = mu + beta(j) + phi(i,j) + e(i,j,k) (4)
y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1)
Note that (4) and (1) are the same as (3) and (2) respectively,
except that both (4) and (1) have an extra term, namely
phi(i,j). R4 - R1 is identical to the HTI (= SAS Type III =
SPSS ANOVA Unique) sum of squares for the A main effect.
We can see the source of the names HTI and HTO by studying the
two pairs of models above. For the HTI sum of squares the
Higher-level interaction Term [phi(i,j)] is Included (HTI) in
both models. For the HTO sum of squares the Higher-level in-
teraction Term is Omitted (HTO) from both models.
Finally, consider the following two models:
y(i,j,k) = mu + e(i,j,k) (5)
y(i,j,k) = mu + alpha(i) + e(i,j,k). (6)
Note that (5) and (6) are the same as (3) and (2) respectively,
except that both (5) and (6) lack the term beta(j). R5 - R6 is
identical to the sequential (= SAS Type I = SPSS ANOVA Hierar-
chical) sum of squares for the A main effect when A is entered
first (after the mean) into the model.
If we compute the difference between the residual sums of
squares of two models, the difference is called the "sum of
squares for the effect being tested". The "effect" is the term
that is present in one of the two models, but absent from the
other.
Each sum of squares has associated with it a number of "degrees
of freedom".
(For any main effect, the number of degrees of freedom is one
less than the number of values assumed by the associated pre-
dictor variable in the experiment. For example, in the Searle
data, predictor variable B assumes three different values in
the experiment so the number of degrees of freedom for B is
two. For an interaction, the number of degrees of freedom is
the product of the numbers of degrees of freedom for the main
effects for each of the predictor variables associated with the
interaction.)
If we divide a sum of squares for a particular effect by its
degrees of freedom, we get the "mean square" for the effect.
Similarly, if we divide the residual sum of squares for the
saturated model by *its* degrees of freedom, we get the resid-
ual mean square. The reason we might *want* to compute any of
these mean squares rests on three facts
- If there is *no* relationship in the population between the
response variable and predictor variable A, clearly the cor-
rect value for alpha(i) is zero for all values of i in all
the equations above. In this case, it can be shown that the
three mean squares for the A effect can be expected (under
certain often-satisfiable assumptions) to equal the "residual
variance" in the experiment. (The residual variance is esti-
mated by the residual mean square.)
- If there *is* a relationship between A and the response vari-
able, the three mean squares for the A effect can usually be
expected to be *greater* than the residual variance.
- Thus to determine whether there is evidence of a relationship
between the response variable and predictor variable A, we
need only determine whether the appropriate effect mean
square is significantly greater than the residual mean
square. The p-value is simply an easy-to-interpret result of
this determination.
These facts, buttressed by mathematical arguments (Searle 1987,
sec. 8.6), imply that the three approaches to computing sums of
squares provide (with certain limitations) valid candidates for
the numerator sums of squares in F-ratios used to compute p-
values used to test for the existence of a relationship between
the response variable and predictor variable A.
(The technique I discuss above for detecting relationships be-
tween variables is undeniably complex. A questioning reader
might wonder whether a simpler technique [with reasonable power
and objectivity] might be found. So far, no such technique has
been found.)
The above discussion states that three popular types of numera-
tor sums of squares (HTO, HTI, and sequential) can be computed
in a two-way experiment by computing the difference between the
residual sums of squares of two model equations. This general-
izes:
All standard analysis of variance numerator sums of
squares for two-way, three-way, and higher experiments
can be viewed as reflecting the difference between the
residual sums of squares of two overparameterized model
equations.
SUMMING UP
- an important use of analysis of variance is to help research-
ers detect relationships between variables in populations of
entities
- we can detect relationships between variables by studying the
p-values obtained by applying analysis of variance to the re-
sults of an experiment
- analysis of variance computer programs compute numerator sums
of squares as an intermediate step in computing p-values
- controversy exists about which method of computing numerator
sums of squares is preferred in unbalanced experiments
- one easy-to-understand way of viewing all standard approaches
to computing numerator sums of squares is to view them as re-
flecting the difference between the residual sums of squares
of two different overparameterized model equations for the
relationship between the response variable and predictor
variables in the experiment.
PROGRAM OVERVIEW
The program statements below demonstrate a simple algorithm for
computing analysis of variance numerator sums of squares. The
algorithm takes three pieces of information as input
1. a specification of the two model equations whose residual
sums of squares we wish to "difference" in order to compute
an analysis of variance numerator sums of squares [e.g., (3)
and (2) above]
2. a specification of the layout of the experiment
3. the response variable data vector containing the values of
the response variable obtained in the experiment.
The algorithm uses the input to compute the specified sum of
squares.
To show that the algorithm works properly, the program computes
seven analysis of variance sums of squares for the Searle data
given above.
The program below is organized into five parts:
1. a set of six preliminary steps to get things ready for com-
puting the sums of squares
2. seven repetitions of three simple steps to compute the seven
sums of squares
3. a recomputation of the seven sums of squares with SAS PROC
GLM for comparison
4. a summary
5. notes, an appendix, and references.
PRELIMINARY STEP 1: START PROC IML AND RESET SOME OPTIONS
PROC IML (Interactive Matrix Language) is an easy-to-use com-
puter language for general matrix arithmetic. It is an add-on
component of the SAS system and has many built-in functions to
facilitate matrix operations particular to statistical analy-
sis.
The PROC IML statement below initiates the IML environment.
After that statement is executed, the statements that follow
are statements in the IML language until we reach the QUIT
statement, which takes us back into a standard SAS program.
*/
proc iml;
/*
The options in the following RESET statement control the desti-
nation and appearance of the IML output.
*/
reset log fw=3 spaces=3;
/*
PRELIMINARY STEP 2: DEFINE THE SEARLE DATA
To get the Searle data into IML, we define two column vectors
(a and b) of values for the two predictor variables, and we de-
fine a column vector y, containing the values of the response
variable. Following are the three IML statements that define
the vectors a, b, and y:
*/
a = { 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,2, 2 };
b = { 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3,3, 3 };
y = { 6,10,11, 13,15, 14,22, 12,15,19,18, 31, 18,9,12 };
/*
Note how the values in the three vectors "line up" correctly
with each other to reflect exactly the same information as is
given near the beginning of this program in table 1. For exam-
ple, the last three values in y (i.e., 18, 9, and 12) each line
up with a value of 2 in vector a and a value of 3 in vector b,
reflecting the fact that the last three values in y are associ-
ated with the (2,3) cell in table 1.
The set of fifteen numbers between the braces in each of the
three IML statements above is called a "matrix literal". Al-
though the above three matrix literals are laid out horizon-
tally, they do not specify row vectors but instead specify
*column* vectors. They specify column vectors because each of
the numbers (except the last) in each matrix literal is fol-
lowed by a comma, which in an IML matrix literal indicates the
end of a row. (Numbers *within* a row of an IML matrix literal
are not separated by commas, but by blanks.)
PRELIMINARY STEP 3: GENERATE THE MAIN EFFECT SUBMATRICES
Before we can begin computing sums of squares, we must first
generate the "full-column-rank" submatrices of the overall
"design matrix" for the experiment. These submatrices (which
are surprisingly simple) are used in the computation of the
sums of squares.
Any analysis of variance has a separate submatrix of the design
matrix for each of its main effects (i.e., for each predictor
variable) and a separate submatrix for each interaction in the
experiment. Each submatrix has N rows, where N equals the num-
ber of elements (entities, rows, cases, observations) in the
data. (In the Searle data N is fifteen since there were fif-
teen trials in the experiment; these trials are reflected in
the fifteen elements in each of vectors a, b, and y defined
above.) Each submatrix has DF columns, where DF is the number
of degrees of freedom associated with the main effect or inter-
action associated with the submatrix.
The full-column-rank submatrix for any main effect can be gen-
erated with a single statement in IML. For example, to gener-
ate the submatrix for the main effect of predictor variable B,
and to store that submatrix in the matrix called Bdesign, we
use the following statement:
Bdesign = designf(b)
where b is the column vector containing the raw values of pre-
dictor variable B used in the experiment.
DESIGNF is a built-in function in IML. The function returns a
matrix of zeros and (positive and negative) ones, with N rows
and DF columns, where N and DF are as described above.
Consider the main effect for predictor variable B (seed vari-
ety) in the Searle data. Here is a copy of the definition
(given above) of the (column) vector showing the different val-
ues of predictor variable B for the fifteen trials in the ex-
periment:
b = {
1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3, 3, 3 }.
Following is the IML statement to compute the submatrix of the
design matrix for predictor variable B. This is followed by a
statement to print the newly computed matrix. (The matrix is
printed transposed for ease of study.)
*/
Bdesign = designf(b);
print (Bdesign`);
/*
Consider the first row of 1's, zeros, and -1's in the output
above. Compare the values in this row with the values in vec-
tor B given further above. Note how the 1's identify the cases
in which B has the value 1. Similarly, the zeros identify the
cases in which B has the value 2. Finally, the -1's identify
the cases in which B has the value 3.
Similarly, in the second row note how the 1's identify the
cases in which B has the value 2. And the -1's identify
(again) the cases in which B has the value 3. The zeros iden-
tify the cases in which B has the value 1.
This generalizes: Suppose some (discrete-valued) predictor
variable P has assumed m different values in an experiment.
(The value of m must be greater than or equal to 2 for P to be
a valid predictor variable.) Then the (transposed) submatrix
for P will have m-1 rows, each row being associated with one of
the m values, and with one of the m values being "left out"
with no row of its own. Within each row the elements are +1 if
this case received the associated value of predictor variable
P; -1 if the case received the "left out" value of predictor
variable P; and zero if m is greater than 2 and the case re-
ceived one of the other m-2 values of predictor variable P.
Design matrices are mysterious because, in the case of comput-
ing the standard analysis of variance sums of squares, they are
not unique. That is, we could get exactly the same results in
the output below (apart from slight differences due to roundoff
errors) if we specified a new submatrix of the design matrix
for predictor variable B provided that the two columns in the
new matrix were themselves any two independent "linear combina-
tions" of the two columns of the submatrix for B given above.
How is it possible to get the same results if a submatrix of
the design matrix is replaced with a transformed version of it-
self? This puzzling aspect of design matrices is explained by
the fact that the relevant information (for computing a sum of
squares) in a submatrix of a design matrix is not stored in the
*values* of the elements in the matrix -- it is stored in the
*relationships among* the values. Independent linear combina-
tions of the columns in a submatrix preserve this relevant in-
formation.
(Technically, the linear combinations preserve the information
in the sense that all transformed sets of columns of a sub-
matrix of the design matrix so-defined are generating vectors
of the same relevant subspace of the N-dimensional vector space
under study.)
Now let us compute the submatrix of the design matrix for pre-
dictor variable A (soil type). Here is a copy of the (column)
vector of values from the definition given above for this vari-
able:
a = {
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2 }
Following are the statements to compute and print the submatrix
of the design matrix for predictor variable A. (The matrix is
again printed transposed for ease of study.)
*/
Adesign = designf(a);
print (Adesign`);
/*
Since predictor variable A assumed only two values in the ex-
periment, the (transposed) submatrix of the design matrix has
only one row.
As was the case with predictor variable B, note how the 1's and
-1's above in Adesign line up respectively with the 1's and 2's
in the column vector of the values of predictor variable A.
PRELIMINARY STEP 4: GENERATE THE INTERACTION SUBMATRIX
Each interaction in an experiment has its own submatrix of the
design matrix. We can generate an interaction submatrix by
computing the "horizontal direct product" of the design matri-
ces for all the main effects for the predictor variables asso-
ciated with the interaction.
Consider two matrices P and Q with n rows and np and nq col-
umns, respectively. The horizontal direct product of P and Q
has n rows and np x nq columns. Each column in the horizontal
direct product is associated with a unique pair of columns, one
from P and one from Q. The elements in a given column in the
horizontal direct product are the scalar products of corre-
sponding elements in the two associated columns of P and Q.
For example, consider the following two matrices:
P = {1 2, and Q = {1 6,
3 4, 2 5,
5 6 } 3 4 }.
The horizontal direct product of P and Q is
{ 1 6 2 12,
6 15 8 20,
15 20 18 24 }.
The IML function HDIR(P,Q) computes the horizontal direct prod-
uct of two matrices P and Q.
Following is a statement to generate the submatrix of the de-
sign matrix for the A x B interaction in the Searle data. The
submatrix is generated by computing the horizontal direct prod-
uct of the two design submatrices (Adesign and Bdesign) gener-
ated above. This is followed by a statement to print the in-
teraction submatrix (again transposed for ease of study).
*/
ABdesign = hdir(Adesign, Bdesign);
print (ABdesign`);
/*
COMPUTATION STEP 1: FOR A GIVEN SUM OF SQUARES SPECIFY XE
(WE SHALL FIRST COMPUTE THE HTO SUM OF SQUARES FOR A)
Having computed the three submatrices of the design matrix, we
can begin computing sums of squares. Let us first compute the
HTO sum of squares for predictor variable A (soil type).
The actual computing of sums of squares in this program is done
by a subroutine called SS. Using a subroutine is efficient be-
cause it consolidates the code to compute sums of squares into
a reusable and portable package. I discuss the SS subroutine
in detail below.
In order to compute a sum of squares, the SS subroutine re-
quires three pieces of input:
1. the values of the response variable obtained in the experi-
ment
2. information about which sum of squares we wish to compute,
as specified by two overparameterized model equations, as
discussed above
3. information about the layout of the experiment.
The first piece of input, the values of the response variable,
was defined above in the definition of the vector y. Thus we
need only pass the vector y to the subroutine for the subrou-
tine to use the values of the response variable.
The second and third pieces of input are passed to the subrou-
tine through two matrices that are obtained from the submatri-
ces of the design matrix discussed above. The two matrices are
called XE and XR.
The first matrix, XE, is simply the submatrix of the design ma-
trix for the particular effect we wish to test. In the present
case we are testing the A main effect. Thus we specify XE as
follows:
*/
XE = Adesign;
/*
COMPUTATION STEP 2: FOR A GIVEN SUM OF SQUARES SPECIFY XR
In the present example we wish to compute the HTO numerator sum
of squares for the A main effect. Thus (as discussed above) we
wish to compute the difference between the residual sums of
squares of the following two model equations:
y(i,j,k) = mu + beta(j) + e(i,j,k) (3)
y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
We have already specified the conceptual difference between (3)
and (2) in the specification of XE above. Thus all that re-
mains in order to fully specify the two equations is to list
the "other" terms on the right side of (3) and (2).
The "other" terms on the right side of (3) and (2) (excluding
the error term) are mu and beta. We specify the other terms by
"horizontally concatenating" the submatrices for these terms to
form the matrix XR.
As suggested by the name, the horizontal concatenation of two
matrices P and Q is the matrix formed by appending the rows of
Q to the corresponding rows of P. (Horizontal concatenation of
two matrices is possible only if the matrices have the same
number of rows.) In SAS IML the expression P || Q is used to
specify the horizontal concatenation of two matrices P and Q.
Since mu is a constant in the two equations, the submatrix of
the design matrix for mu is a column vector of 1's. We can
specify a column vector of fifteen 1's with the IML function
j(15,1).
Thus we specify the XR matrix for the HTO A (soil type) sum of
squares in the experiment with the following statement:
*/
XR = j(15,1) || Bdesign;
/*
Here is the (transposed) XR matrix:
*/
print (XR`);
/*
Note that the XE and XR contain two types of information:
- information about which analysis of variance sum of squares
we wish to compute (as captured by the *choice* of the sub-
matrices we have used to define XE and XR)
- information about the layout of the experiment (as captured
in the *layout* of the submatrices of the design matrix used
to define XE and XR).
The E in XE stands for the Effect being tested. The R in XR
stands for otheR terms on the right side of the two model equa-
tions.
PRELIMINARY STEP 5:
MAKE THE SS SUBROUTINE AVAILABLE TO THE PROGRAM
(I have delayed this preliminary step until after defining XE
and XR, since this will help you to understand the subroutine.)
Since we have now defined y, XE, and XR for the HTO A (soil
type) effect, we are ready to call subroutine SS to compute the
sum of squares. In fact, we shall call the subroutine momen-
tarily. But before we can actually call the subroutine, we
must first make it known to IML. We do this with the %INCLUDE
statement that follows.
Because the SOURCE2 option is specified in the %INCLUDE state-
ment, SAS lists all of the statements in the subroutine in the
output below immediately after the %INCLUDE statement.
*/
%include 'D:\PROGS\SS.SAS' / source2;
/*
PRELIMINARY STEP 6:
SET THE VALUES OF THE THREE SECONDARY ARGUMENTS
OF THE SS SUBROUTINE
The values of the three secondary arguments of the SS subrou-
tine are set immediately below. These arguments control de-
tails of how the subroutine computes sums of squares and prints
results. The values set below are used on every call to the
subroutine in this program. The values instruct the subroutine
to
- use the first method of computing sums of squares
- print the value of the computed sum of squares and
- print intermediate results (showing the transposed versions
of the matrix H and the vector yp).
*/
level = 1;
printss = 1;
printh = 2;
/*
COMPUTATION STEP 3: CALL SS
Now that we have made the SS subroutine known to IML, we can
use it. Recall that we defined y, XE and XR above to allow us
to compute the HTO sum of squares for the A (soil type) main
effect in the Searle data. Thus we are now ready to actually
call the SS subroutine to do the computation. But just before
we do, note that Searle's exact answer for the HTO sum of
squares for the A main effect (1987, 113, 114, 122 [typo]) is
83 127/141 = 83.90070 92198 58156 0
Here is the statement to call the SS subroutine to compute and
print the HTO sum of squares for the A main effect:
*/
call SS(result, y, XE, XR, level,printss,printh);
/*
The SS subroutine generated the twenty-seven printed lines
above. First it printed the computed sum of squares, 83.9....
Note how this value is very close to Searle's exact answer: the
difference occurs in the fifteenth significant digit (which, if
it is rounded, should be 2 instead of 1).
Next, the subroutine printed the transposed version of the hy-
pothesis matrix H that was used to compute the sum of squares.
(This matrix was generated solely from XE and XR, which were,
in turn, generated solely from the values of the two predictor
variables. That is, the response variable, y, played no role
in generating H.)
Next, the subroutine printed the projection matrix PM. This
matrix was generated directly from the hypothesis matrix H and
is simply a different way of stating the hypothesis being
tested.
(The zeros shown above in PM are not true zeros but are only
shown as zeros due to necessary rounding on the part of the SAS
printing algorithm due to the narrow print fields.)
Note that PM is symmetric (i.e., the rows equal the columns)
and the values in the matrix change in step with the changes in
the treatment groups in the experiment, as defined by the dif-
ferent values in the a and b vectors specified earlier.
Next, the subroutine printed the vector yp, which is the pro-
jection of the vector y obtained when y is multiplied by PM.
The computed sum of squares (i.e., 83.9...) was computed by
summing the squared elements of yp.
To help us judge the accuracy of the computed sum of squares,
the subroutine concluded by printing the three numbers
NDIGITSS, NDIGITSR, and NDIGITSI. These numbers are measures
of the integrity of the projection matrix -- the smallest of
these numbers (i.e., 15.3) gives a rough indication of the
ceiling of the number of digits of accuracy in the numbers in
the projection matrix. Since IML maintains a precision of
roughly sixteen floating point decimal digits on my (Windows)
computer, the maximum possible number of digits of accuracy on
my computer is 16. (For curious readers, I describe the three
numbers further in the comments near the end of the SS subrou-
tine.)
Since the three measures of the integrity of the projection ma-
trix suggest that the projection matrix is accurate to roughly
fifteen significant digits, and since an accuracy of a sum of
squares to four significant digits is fully adequate for com-
puting a p-value, we can be confident that the computed sum of
squares is sufficiently accurate.
COMPUTE THE REMAINING SIX SUMS OF SQUARES
Following are steps to compute six other sums of squares for
the Searle data. Note that each case requires only the follow-
ing three lines of code:
- a line to specify the effect being tested (via XE)
- a line to specify the other terms in the two models (via XR)
- a call to the subroutine (with the CALL statement).
In each case I first state the two model equations whose resid-
ual sums of squares are being differenced. I state the models
in an abbreviated form, omitting the subscripts on the terms
and omitting the error term.
SUM OF SQUARES FOR THE HTO B (SEED VARIETY) MAIN EFFECT
The two model equations for computing the HTO B sum of squares
are
y = m + a
y = m + a + b.
Thus to compute the HTO B sum of squares we specify XE as
BDESIGN -- the submatrix of the design matrix for main effect
B.
*/
XE = Bdesign;
/*
We specify XR as the horizontal concatenation of J(15,1) and
ADESIGN, which are the submatrices for m and a, which are the
other terms (excluding the error term) on the right side of the
two model equations.
*/
XR = j(15,1) || Adesign;
/*
Searle's exact answer for HTO B sum of squares (1987, 104, 113,
114, 122) is
124 69/94 = 124.73404 25531 91489 4
*/
call SS(result, y, XE, XR, level,printss,printh);
/*
SUM OF SQUARES FOR THE HTI (HIGHER-LEVEL TERMS INCLUDED)
A MAIN EFFECT
HTI sums of squares operate by including higher-level interac-
tion terms in the two model equations whose residual sums of
squares are differenced. Thus the two model equations for com-
puting the HTI A sum of squares are
y = m + b + p
y = m + a + b + p.
Note the appearance of the p (interaction) term in both equa-
tions. This term was omitted above in the computation of the
two HTO sums of squares. Recall that the submatrix of the de-
sign matrix for the interaction is ABdesign.
Searle's exact answer for this sum of squares (1987, 91) is
123 27/35 = 123.77142 85714 28571 4
*/
XE = Adesign;
XR = j(15,1) || Bdesign || ABdesign;
call SS(result, y, XE, XR, level,printss,printh);
/*
SUM OF SQUARES FOR THE HTI B MAIN EFFECT
The two models for computing the HTI B sum of squares are
y = m + a + p
y = m + a + b + p.
Searle does not give an exact answer for this sum of squares.
The SAS Type III sum of squares for the B main effect is
192.12765 957
*/
XE = Bdesign;
XR = j(15,1) || Adesign || ABdesign;
call SS(result, y, XE, XR, level,printss,printh);
/*
SUM OF SQUARES FOR THE SEQUENTIAL A MAIN EFFECT
WHEN A IS ENTERED FIRST
The two models for computing the sequential A sum of squares
when A is entered first are
y = m
y = m + a.
(When A is entered SECOND [i.e., after B], the sequential sum
of squares for A is the same as the HTO sum of squares for A,
which is computed above.)
Searle's exact answer for this sum of squares (1987, 93, 113,
114, 122) is
52 1/2
*/
XE = Adesign;
XR = j(15,1);
call SS(result, y, XE, XR, level,printss,printh);
/*
SUM OF SQUARES FOR THE SEQUENTIAL B MAIN EFFECT
WHEN B IS ENTERED FIRST
The two models for computing the sequential B sum of squares
when B is entered first are
y = m
y = m + b.
(When B is entered SECOND [i.e., after A], the sequential sum
of squares for B is the same as the HTO sum of squares for B,
which is computed above.)
Searle's exact answer for this sum of squares (1987, 95, 113,
114, 122 [typo]) is
93 1/3
*/
XE = Bdesign;
XR = j(15,1);
call SS(result, y, XE, XR, level,printss,printh);
/*
SUM OF SQUARES FOR THE A x B INTERACTION EFFECT
The two models for computing the A x B interaction sum of
squares are
y = m + a + b
y = m + a + b + p.
Note that the sum of squares for the highest-level interaction
in an experiment does not differ from one standard approach to
computing analysis of variance sums of squares to the next.
Searle's exact answer for this sum of squares (1987, 113, 114)
is
222 36/47 = 222.76595 74468 08510 6
*/
XE = ABdesign;
XR = j(15,1) || Adesign || Bdesign;
call SS(result, y, XE, XR, level,printss,printh);
/*
SAVE THE DATA IN A SAS DATASET
The next three IML statements create a SAS dataset (called
"Searle_1") and then transfer the values of vectors a, b, and y
to variables with the same names in the dataset. This enables
us to compute all the sums of squares for the Searle data with
SAS GLM.
*/
create Searle_1 var {a b y};
append;
close Searle_1;
/*
QUIT FROM IML
*/
quit;
/*
RUN PROC GLM TWICE
The following statements run SAS GLM on the data to enable com-
paring the above IML output with output from an analysis of the
data by GLM. (The output from GLM comes in a separate output
file.)
Examination of the GLM output reveals that all the GLM sums of
squares are identical in all available digits to the sums of
squares produced by the SS subroutine.
*/
title 'IML/GLM 2 x 3 Unbalanced ANOVA, Searle (1987, 79)';
options nodate linesize=80 probsig=2;
proc glm data=Searle_1;
class a b;
model y = a | b / ss1 ss2 ss3;
quit;
/*
Run GLM a second time with a and b reversed in the model state-
ment to get the sequential (Type I) sum of squares for B when B
is entered first into the model.
*/
proc glm data=Searle_1;
class a b;
model y = b | a / ss1 ss2 ss3;
quit;
options date linesize=80 probsig=2;
title;
/*
SUMMARY
This program illustrates three approaches to computing numera-
tor sums of squares in unbalanced analysis of variance: HTO,
HTI, and sequential. The three approaches are specified in
terms of computing the difference between the residual sums of
squares of two overparameterized model equations. The two mod-
els are specified in terms of two matrices, XE and XR, which
are made up of submatrices of the overall (full-column-rank)
design matrix for the experiment. XE is the submatrix for the
effect being tested. XR is the horizontal concatenation of the
submatrices for all the other terms (except the error term) on
the right side of the two model equations.
Note the conceptual economy of the approach: After the data
are known to IML, the submatrices of the design matrix can be
specified with one simple statement per submatrix (using the
DESIGNF and HDIR functions). It takes just two more statements
to specify (via XE and XR) a particular sum of squares to be
computed. And it takes only four more reusable statements (in
the SS subroutine) to compute any standard numerator sum of
squares.
NOTES
If you wish to run this program on your computer, see the
checklist in the appendix.
This program illustrates computation of numerator sums of
squares in the two-way case -- i.e., the case with two dis-
crete-valued predictor variables. I discuss five approaches to
computing analysis of variance numerator sums of squares in the
three-way case in another program (1998).
I discuss in the paper (1997, sec. 17) the fact that statisti-
cal tests provided by the HTO sums of squares are (in the rele-
vant cases) generally more powerful than the corresponding
tests provided by the HTI sums of squares. This implies that
if there is a relationship, an HTO sum of squares will usually
be greater than the corresponding HTI sum of squares. However,
Searle's data twice illustrate the fact that an HTO sum of
squares is not *always* greater than the corresponding HTI sum
of squares. Specifically, the HTO (SAS Type II) sum of squares
for the A main effect in Searle's data is less than the HTI
(SAS Type III) sum of squares for the same effect (83.9 versus
123.8). Similarly, the HTO sum of squares for the B main ef-
fect in Searle's data is also less than the HTI sum of squares
(124.7 versus 192.1).
APPENDIX: STEPS TO RUN THIS PROGRAM
1. Ensure that the STAT and IML components of the SAS system
are available on your computer. Information about the SAS
system is available at http://www.sas.com
2. Ensure that you have the source version of this program,
which is called PR0139.SAS (not the HTML version, which is
called PR0139.HTM). You can obtain a copy of the source
version in the "Computer Programs" section of the page at
http://www.matstat.com/ss/
3. Install a copy of the SS subroutine on your computer. This
subroutine does the actual computations of sums of squares
and is available from the above MatStat web page.
4. Edit the %INCLUDE statement in preliminary step 5 above to
correctly point to the location of the SS.SAS subroutine
file on your computer. That is, change the
D:\PROGS\SS.SAS
in the statement to the location where SS.SAS is stored on
your computer.
5. (Optional.) Modify the two OPTIONS statements in the pro-
gram that adjust the DATE, LINESIZE, and PROBSIG options.
6. Submit the program to SAS.
REFERENCES
Fisher, R. A. 1925. _Statistical Methods for Research Workers._
Edinburgh: Oliver and Boyd. The 14th edition of this semi-
nal work appears in Fisher (1990).
Fisher, R. A. 1935. _The Design of Experiments._ Edinburgh:
Oliver and Boyd. The 8th edition of this seminal work ap-
pears in Fisher (1990).
Fisher, R. A. 1990. _Statistical Methods, Experimental Design,
and Scientific Inference_ edited by J. H. Bennett. Oxford:
Oxford University Press.
Macnaughton, D. B. 1996a. The introductory statistics course:
A new approach. Available at http://www.matstat.com/teach/
Macnaughton, D. B. 1996b. The entity-property-relationship ap-
proach to statistics: An introduction for students. Avail-
able at http://www.matstat.com/teach/
Macnaughton, D. B. 1997. Which sums of squares are best in un-
balanced analysis of variance? Available at
http://www.matstat.com/ss/
Macnaughton, D. B. 1998. PR0165.HTM: Computing numerator sums
of squares in unbalanced analysis of variance: Three-way
case. Available in the "Computer Programs" section at
http://www.matstat.com/ss/
Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New
York: Wiley.
Yates, F. 1934. The analysis of multiple classifications with
unequal numbers in the different classes. _Journal of the
American Statistical Association_ 29, 51-66.
version of June 19, 1998
(end of program pr0139.sas) */