40
(1), 115–148
Copyright © 2005, Lawrence Erlbaum Associates, Inc.
Fit Indices Versus Test Statistics
Ke-Hai Yuan
University of Notre Dame
Model evaluation is one of the most important aspects of structural equation model-
ing (SEM). Many model fit indices have been developed. It is not an exaggeration to
say that nearly every publication using the SEM methodology has reported at least
one fit index. Most fit indices are defined through test statistics. Studies and interpre-
tation of fit indices commonly assume that the test statistics follow either a central
chi-square distribution or a noncentral chi-square distribution. Because few statistics
in practice follow a chi-square distribution, we study properties of the commonly
used fit indices when dropping the chi-square distribution assumptions. The study
identifies two sensible statistics for evaluating fit indices involving degrees of free-
dom. We also propose linearly approximating the distribution of a fit index/statistic
by a known distribution or the distribution of the same fit index/statistic under a set of
different conditions. The conditions include the sample size, the distribution of the
data as well as the base-statistic. Results indicate that, for commonly used fit indices
evaluated at sensible statistics, both the slope and the intercept in the linear relation-
ship change substantially when conditions change. A fit index that changes the least
might be due to an artificial factor. Thus, the value of a fit index is not just a measure
of model fit but also of other uncontrollable factors. A discussion with conclusions is
given on how to properly use fit indices.
In social and behavioral sciences, interesting attributes such as
stress
,
social sup-
port
,
socio-economic status
cannot be observed directly. They are measured by
multiple indicators that are subject to measurement errors. By segregating mea-
surement errors from the true scores of attributes, structural equation modeling
(SEM), especially its special case of covariance structure analysis, provides a
methodology for modeling the latent variables directly. Although there are many
The research was supported by Grant DA01070 from the National Institute on Drug Abuse (Peter
M. Bentler, Principal Investigator) and NSF Grant DMS-0437167. I am thankful to Peter M. Bentler
and Robert C. MacCallum for their comments that have led the article to a significant improvement over
the previous version.
Correspondence concerning this article should be addressed to Ke-Hai Yuan, Department of Psy-
chology, University of Notre Dame, Notre Dame, IN 46556. E-mail: kyuan@nd.edu
116
YUAN
aspects to modeling, such as parameter estimation, model testing, and evaluating
the size and significance of specific parameters, overall model evaluation is the
most critical part in SEM. There is a huge body of literature on model evaluation
that can be roughly classified into two categories: (a) overall-model-test statistics
that judge whether a model fits the data exactly; (b) fit indices that evaluate the
achievement of a model relative to a base model.
Fit indices and test statistics are often closely related. Actually, most interesting
fit indices
F
s are defined through the so called chi-square statistics
T
s. The rationales
behind these fit indices are often based on the properties of
T
. For example, under ide-
alized conditions,
T
may approximately follow a central chi-square distribution un-
der the null hypothesis and a noncentral chi-square distribution under an alternative
hypothesis. In practice, data and model may not satisfy the idealized conditions and
T
may not follow (noncentral) chi-square distributions. Then, the rationales motivat-
ing these fit indices do not hold. There are a variety of studies on the performance of
statistics; there also exist many studies on the performance of fit-indices. However,
these two classes of studies are not well connected. For example, most of the studies
on fit indices use just simulation with the normal theory based likelihood ratio statis-
tic. There are also a few exceptions (e.g., Anderson, 1996; Hu & Bentler, 1998;
Marsh, Hau, & Wen, 2004; Wang, Fan, & Willson, 1996; Zhang, 2004) but no study
focused on the relationship between fit-indices and test statistics. This article will
formally explore the relationship of the two. We are especially interested in condi-
tions that affect the distributions of the commonly used fit indices. The purpose is to
identify statistics that are most appropriate for calculating fit indices, to use fit indi-
ces more wisely and to evaluate models more scientifically.
We will use both analytical and empirical approaches to study various proper-
ties of fit indices. Our study will try to answer the following questions.
1. As point estimators, what are the population counterparts of the commonly
used fit indices?
2. How the population counterparts related to model misspecifications?
3. Do we ever know the distribution of a fit index with real or even simulated
data?
4. Are cutoff values such as 0.05 or 0.95 related to the distributions of the fit
indices?
5. Are measures of model fit/misfit defined properly when the base-statistic
does not follow a chi-square distribution? If not, can we have more sensible
measures?
6. Whether confidence intervals for fit indices as printed in standard software
cover the model fit/misfit with the desired probability?
7. How to reliably evaluate the power or sensitivity of fit indices?
8. Can we ever get an unbiased estimator of the population model fit/misfit as
commonly defined?
FIT INDICES VERSUS TEST STATISTICS
117
Some of the questions have positive answers, some have negatives, and some may
need further study. We will provide insightful discussions when definite answers
are not available.
Although mean structure is an important part of SEM, in this article we will fo-
cus on covariance structure models due to their wide applications. In the next sec-
tion, in order to facilitate the understanding of the development in later sections,
we will give a brief review of the existing statistics and their properties, as well as
fit indices and their rationales. We will discuss properties of fit indices under ideal-
ized conditions in the section entitled “Mean Values of Fit Indices Under Idealized
Conditions.” Of course, idealized conditions do not hold in practice. In the section
entitled “Approximating the Distribution of
T
Using a Linear Transformation,” we
will introduce a linear transformation on the distribution of
T
to understand the dif-
ference between idealization and realization. With the help of the linear transfor-
mation, we will discuss the properties of fit indices in the section entitled “Prop-
erties of Fit Indices When
T
Does Not Follow a Chi-Square Distribution.” In the
section entitled “Matching Fit Indices with Statistics,” we will match fit indices
and statistics based on existing literature. An ad hoc correction to some existing
statistics will also be given. The corrected statistics are definitionally more appro-
priate to define most fit indices. In the section entitled “Stability of Fit Indices
When Conditions Change,” we discuss the sensitivity of fit indices to changes in
other conditions besides model misspecification. Power issues related to fit indices
will be discussed in the section entitled “The Power of a Fit Index.” In the Discus-
sion section, we will discuss several critical issues related to measures of model fit
and test statistics. We conclude the article by providing recommendations and
pointing out remaining issues for further research.
SOME PROPERTIES OF STATISTICS AND RATIONALES
FOR COMMONLY USED FIT INDICES
Let
x
represent the underlying
p
-variate population from which a sample
x
1
,
x
2
,
…
,
x
N
with
N
=
n
+ 1 is drawn. We will first review properties of three classes of statis-
tics. Then we discuss the rationales behind several commonly used fit-indices.
This section will provide basic background information for later sections, where
we discuss connections between fit indices and the existing statistics.
Statistics
The first class of statistics includes the normal theory likelihood ratio statistic
and its rescaled version; the second one involves asymptotically distribution free
statistics. These two classes are based on modeling the sample covariance matrix
S
by a proposed model structure
(
). The third class is based on robust proce-
118
YUAN
dures which treat each observation
x
i
individually instead of using the summary
statistic
S
.
The most widely utilized test statistic in SEM is the classical likelihood ratio
statistic
T
ML
, based on the normal distribution assumption of the data. When data
are truly normally distributed and the model structure is correctly specified,
T
ML
approaches a chi-square distribution as the sample size
N
increases. Under cer-
tain conditions, this statistic asymptotically follows even when data are not
normally distributed (Amemiya & Anderson, 1990; Browne & Shapiro, 1988;
Kano, 1992; Mooijaart & Bentler, 1991; Satorra, 1992; Satorra & Bentler, 1990;
Yuan & Bentler, 1999b). Such a property is commonly called asymptotic robust-
ness. However, procedures do not exist for verifying the conditions for asymptotic
robustness. It seems foolish to blindly trust that
T
ML
will asymptotically follow
when data exhibit nonnormality. When data possess heavier tails than that of a
multivariate normal distribution, the statistic
T
ML
is typically stochastically greater
than the chi-square variate When the fourth-order moments of
x
are all fi-
nite, the statistic
T
ML
can be decomposed into a linear combination of independent
That is,
2
df
2
df
2
df
df
.
j
1
.
df
T
ML
j j
o
p
(1),
j
1
j
s depend on the fourth-order moments of
x
as well as the model struc-
ture, and
o
p
(1) is a term that approaches zero in probability as sample size
N
in-
creases. When
x
follows elliptical or pseudo elliptical distributions with a common
kurtosis
,
1
=
2
=…=
df
=
. Then (see Browne, 1984; Shapiro & Browne,
1987; Yuan & Bentler, 1999b)
T
ML
o
p
(1).
(1)
is available, one can divide
T
ML
by so that
the resulting statistic still asymptotically approaches
ˆ
ˆ
df
.
Satorra and Bentler
(1988) proposed = ( + … +
ˆ
ˆ
ˆ
df
)/
df
and the resulting statistic
1
T
R
ˆ
1
T
ML
,
is often referred to as the Satorra-Bentler rescaled statistic. Like
T
ML
,
T
R
can also
follow a chi-square distribution when certain asymptotic robustness conditions are
satisfied (Kano, 1992; Yuan & Bentler, 1999b). Simulation studies indicate that
T
R
performs quite robustly under a variety of conditions (Chou, Bentler, & Satorra,
1991; Curran, West, & Finch, 1996; Hu, Bentler, & Kano, 1992). However, data
2
where the
When a consistent estimator of
2
FIT INDICES VERSUS TEST STATISTICS
119
generation in some of the studies is not clearly stated and may satisfy the asymp-
totic robustness condition for
T
R
(see Yuan & Bentler, 1999b). In general,
T
R
does
not approach a chi-square distribution. Instead, it approaches a variate
with
E
(
)
is far from that of a chi-square. In
such cases,
T
R
will not behave like a chi-square. It can also lead to inappropriate
conclusions when referring
T
R
to a chi-square distribution.
With typical nonnormal data in the social and behavioral sciences (Micceri,
1989), the ideal is to have a statistic that approximately follows a chi-square distri-
bution regardless of the underlying distribution of the data. One of the original pro-
posals in this direction was made by Browne (1984). His statistic is commonly
called the asymptotically distribution free (ADF) statistic
T
ADF
, due to its asymp-
totically following as long as
x
has finite fourth-order moments. The ADF
property is desirable. However, the distribution of
T
ADF
can be far from that of
for typical sample sizes encountered in practice (Hu et al., 1992). Specifically, the
mean and variance of
T
ADF
are much greater than those of Most correctly
specified models are rejected if using In an effort to find statistics that
perform better in rejection rate with smaller
N
s, Yuan and Bentler (1997b) pro-
posed a corrected statistic
df
df
df
.
T
2
.
ADF
df
T
CADF
=
T
ADF
/(1 +
T
ADF
/
n
).
Like
T
ADF
,
T
CADF
asymptotically follows as long as
x
has finite fourth-order
moments, thus, it is asymptotically distribution free. The mean of
T
CADF
approx-
imately equals
df
for all sample sizes across various distributions (Yuan &
Bentler, 1997b). However, at small sample sizes
T
CADF
over-corrects the behav-
ior of
T
ADF
due to its rejection rate with correct models being smaller than the
nominal level. Furthermore,
T
CADF
also carries the drawback of the ADF estima-
tion method with nonconvergences at smaller sample sizes. In addition to
T
ADF
,
Browne (1984) also proposed a residual-based ADF statistic
T
RADF
in which the
estimator just needs to be consistent. However,
T
RADF
behaves almost the same
as
T
ADF
, rejecting most correct models at smaller sample sizes. Parallel to
T
CADF
,
Yuan and Bentler (1998b) proposed
T
CRADF
whose performance is almost the
same as
T
CADF
, with its empirical mean approximately equal to
df
and under-re-
jecting the correct model for small sample sizes (Bentler & Yuan, 1999; Yuan &
Bentler, 1998b).
The third class of statistics is obtained from robust procedures. It is well-known
that the sample covariance matrix
S
is very sensitive to influential observations and
is biased for
df
0
=Cov(
x
) when data contain outliers (see Yuan & Bentler, 1998c).
In such a situation, removing these outliers followed by modeling
S
will lead to
proper analysis of the covariance structure model. However, in a given data set, de-
termining which cases are outliers may be difficult. The heavy tails, often indi-
cated by larger marginal kurtoses or Mardia’s (1970) multivariate kurtosis, might
=
df
. It is likely that the distribution shape of
2