|
A Note on the Interpretation and Analysis
of the Linear Discriminant Model for
Prediction and Classification
Scott M. Smith, Ph.D.
INTRODUCTION
| |
While not on the "hot marketing topics" list, discriminant
analysis is much valued tool for market segmentation. Over the years, the
estimation of the linear discriminant function has received much theoretical
attention, both in the marketing literature (Dillon 1979; Dillon and Schiffman
1978; Crask and Perreault 1977; Morrison 1969: Frank, Massey and Morrison
1965), and in mathematical statistics (Randles, Broffitt, Ramberg, Hogg
1978; McLachlan 1977; Kraznowski 1975; Fisher and Van Ness 1973; Lachenbruch
and Mickey 1968).
This concern for estimation has most often focused on the precision with
which the discriminant function correctly classifies sets of observations,
rather than with methods to better optimize the function itself.
Specifically, methodological research has evaluated such areas as the
influence of variable selection (Goldstein and Rabinow 1975: Urbakh 1971),
bias in categorization (Krishnaswami and Nath 1968; Lachenbruch 1967; McLachlan
1974), and the validation of rules for classifying sets of observations
(Dillon and Goldstein 1978; Hills 1966).
This concern for the classification ability of the linear discriminant
function has obscured and even confused the fact that two very distinct
purposes and procedures for conducting discriminant analysis exist. The
first procedure, discriminant predictive analysis, is used to optimize
the predictive functions. The second procedure, discriminant classification
analysis, uses the predictive functions derived in the first procedure
to either classify fresh sets of data of known group membership, thereby
validating the predictive function; or if the function has previously been
validated, to classify new sets of observations of unknown group membership.
The prediction and classification procedures referenced in the preceding
paragraph need to be defined. In the prediction procedure, t linear discriminant
functions are derived from a set of weighted independent variables. These
t functions maximally discriminate the t levels of the dependent
variable, thus providing a predictive measure of the subject's group membership.
Discriminant analysis conducted for predictive purposes is based on an
initial set of observations, the group membership of which is known. This
discriminant procedure is commonly coupled with an analysis to classify
the initial data set. However, it is important to note that discriminant
analysis for predictive purposes (i.e., prediction of data having known
group membership) involves only the derivation of the linear discriminant
function, and not the classification of subjects. The purpose of a classification
of observations of known grouping is merely to see how well the derived
function predicts group membership using the subject data from which it
was derived. The classification procedure associated with the predictive
analysis may be thought of as a base line analysis that establishes a standard
of comparison for future Discriminant classification analysis. This baseline
classification analysis produces a t x t confusion matrix
that compares predicted versus actual group membership. This confusion
matrix is one measure of how well the derived functions predict group membership.
Again, the discriminant classification analysis is in sharp contrast to
the predictive analysis.
Discriminant analysis conducted for predictive purposes formulates a linear
discriminant function describing the importance of the independent variables
in differentiating observations of known group membership. Discriminant
analysis conducted for classification purposes validates the predictive
discriminant function as a means of classifying fresh observations of unknown
group membership sampled from the same populations. In the event of previous
validation of the predictive function, the classification analysis is purely
for classification Purposes, and the Discriminant function used for classification
is neither derived nor at issue.
The central research objectives by which discriminant analysis is most
often evaluated are to maximize either the discriminating power of the
predictive function or the overall correct classification within the confusion
matrix. Although these objectives may lead to maximum values, they are
often less than optimal when expressed in terms of specific research hypotheses.
The results of the classification analysis must be evaluated in light of
the specific research objectives if optimization rather than maximization
is to result. Specifically, we may ask if classification of the group of
interest is maximized. Overall classification may be maximized at the expense
of less than maximal classification of the group of interest, especially
if the group is a small proportion of the total number of observations
classified, as is often the case in classifying members of a market segment.
The ancillary questions that must be answered are: (1) Where do the expected
and observed classifications differ? and (2) What statistical significance
lies in the deviation of observed from expected classification?
Given that the distinction between Discriminant analysis as used for prediction
and classification has been made, the objectives of this paper are threefold:
first, to review the Discriminant model as a predictive tool; second, to
expand on the interpretation of the predictive analysis; and third, to
consider a series of analyses that may be used to statistically test the
results of the classification analysis as presented in the confusion matrix.
DISCRIMINANT ANALYSIS FOR PREDICTION
Discriminant analysis is based on the linear model of the familiar matrix
notation form:
which may be expanded to:
(1)
where,
D t = the predicted discriminant score for group t
t = the number of groups differentiated by the t discriminant functions
X = the measured values of the p independent variables used to predict
group membership
the
vector of weights associated with the p variables that predict category
t.
The discriminant analysis when conducted for predictive purposes maximizes
the amount of subject variance explained by the linear function. This maximization
procedure is the rule for all procedures that comprise the family of general
linear models (regression, principal components, and canonical analysis).
Discriminant analysis uses a set of p variables with associated weights
(Lambdatp ) that are derived in a best fit, linear unbiased fashion to predict
the score of the dependent variable, D to These discriminant scores are
predictors of group membership that can be used to classify groups of observations
that are of either known or unknown group membership.
The Discriminant analysis may be viewed as an eigenvalue problem, no
different from the eigenvalue problem encountered in solving for the characteristic
roots of any set of linear equations. To solve for the characteristic roots,
we successively maximize the ratio of the between sum of squares to the
within sum of squares for each Lambda t :
where,
A = SSa = the between (among) groups SSCrossProducts matrix
W = SSw = the Pooled within groups SSCrossProducts matrix
V = the eigenvector of weights associated with Lambda t, the first characteristic
root
Wt = the vector of discriminant scores on eigenvector X t
The vector of characteristic roots Lambda is derived from the matrix
equation (2)
(A-Lambda t W)Vt = 0. The setting of V=O for the trivial solution and
the transformation of the equation by W-1 (the inverse of the within groups
SSCP matrix), produces the characteristic equation

It is this characteristic equation that is differentiated to solve for
X t and Vt.
Once Lambda t and Vt are determined, the prediction of Dt is routine, since
all values in the predictive formula (1) are known.
The basic question answered in the predictive analysis is one of: given
that groups of observations exist, can we develop t functions that
maximally discriminate or explain the difference between the groups? This
type of problem situation is common to most areas of marketing research.
However, the best examples occur where a clear distinction between the
t nominally scaled groups exists.
One such example is provided by Evans (1959), who used personality variables
as predictors of past brand purchases for Ford and Chevrolet car owners.
A Discriminant analysis, if conducted within this problem setting, would
attempt to differentiate on a post hoc basis the brand choice behavior
for Ford and Chevrolet owners. That is, given two previously identified
groups of car owners, can a predictive function be formulated from the
independent variables to explain this difference? Albaum and Hawkins (1979)
provide yet another example, where a predictive analysis was used to differentiate
a sample of fixed and variable rate mortgage holders. Again in this situation,
group membership was known prior to the analysis, the sole purpose of which
was to derive the predictive function. A predictive analysis is possible
in many situations where prior designation of groups exists (e.g., product
purchasers versus non-purchasers: heavy half versus light half market segments-
innovators versus non-innovators; successful versus non-successful new
product ideas, etc.). Again, the research objective is to predict using
the set of independent variables, and not to classify consumers of unknown
group membership.
Linear model users are often disappointed when the model that predicts
group membership well for the original set of objects becomes at best marginal
when applied to fresh data drawn from the same population. This is often
the case because the predictive models do capitalize on chance and therefore
lead to situations where the function may predict group membership of the
initial data set far better than for any other sample that could be drawn.
Clearly, "testing the procedure on the data that gave it birth is
almost certain to overestimate performance. For the optimizing process
that chose it from among many possible procedures will have made the greatest
use possible of any and all idiosyncracies of those particular data. Sometimes
we say that optimization capitalizes on chance" (Mosteller and Tukey
1968). Optimization based on chance creates a degree of fit, but in the
case of the predictive analysis, this fit may be upward biased and not
representative of the real world (Morrison 1969). Thus we see that while
the predictive analysis explains differences between the t groups
described in the current data sample, it does not validate the model as
explaining differences in the population as a whole.
Consider one final example of the predictive analysis. Two groups of
customers are defined on an a priori basis, these being (1) purchasers,
and (2) non-purchasers of an accident insurance product.
The objective of the predictive analysis is to develop an equation that
maximally discriminates the two purchase groups using p independent
demographic and socioeconomic variables. If we restate this objective in
terms of prediction and validation, we desire to develop a set of Discriminant
functions that both discriminate between the two sample groups (prediction),
and are generalizable as a valid tool for classifying potential customers
(classification) in the future.
For the purchaser and non-purchaser groups of the accident insurance
product, the Discriminant functions are expressed:
Thus, given the equation and the observed values Xp , the value Dt can
be derived.
The functions that discriminate between the purchasers and non-purchasers
of the accident insurance product (3) were derived in a step-wise analysis
that employed the Wilks Lambda statistic to determine which independent
variables should be included in the Discriminant function. The Wilks Lambda
criterion maximally discriminates between the t groups by maximizing
the multi-variate F ratio in the tests of differences between the t
group means.
The derived discriminant coefficients may be interpreted as indicative
of the importance of the respective p independent variables entered into
the discriminant analysis. Although these coefficients indicate importance,
they are not appropriate for assessing the relative importance or discriminatory
power of the variables, i.e., the proportion of total discriminating
power attributable to a specific variable. Relative importance of the independent
variables entered in the predictive function is defined in part by:
(4)
where: Ip = the importance of the pth variable
Lambda p = the unstandardized discriminant coefficient for the pth variable
Xpt = the mean of the p th variable for the t th group (Mosteller and Waters,
1973).
To convert Ip the importance measure for the p th variable into a relative
importance score, I p must be expressed in terms of the sum of the importance
values of all variables. The relative importance of the p th variable,
Rp is expressed for the insurance purchasers as (Awh and Waters 1974):
(5)
These Rp values computed for the purchaser of the accident insurance product
are:
|
| Function 1 |
Function 2 |
| mean p1 |
Lambdap |
Ip |
Rp |
mean p2 |
Lambdap |
Ip |
Rp |
| 1.79 |
10.26 |
4.53 |
.53 |
1.35 |
7.87 |
3.47 |
.53 |
| 1.76 |
8.84 |
2.49 |
.29 |
1.48 |
7.33 |
2.07 |
.31 |
| 1.97 |
2.69 |
1.53 |
.18 |
1.40 |
1.87 |
1.07 |
.16 |
|
Sum=8.5 |
|
|
Sum=6.6 |
|
|
Once the meaning of the prediction function is clear, the predictive
function can be used to classify observations of either known or unknown
group membership.
CLASSIFICATION OF OBSERVATIONS FROM INITIAL AND NEW DATA SETS
Discriminant analysis conducted for predictive purposes uses an initial
data set having known group membership to both derive the Discriminant
function and predict group classification. This classification of observation
is but an extension of the predictive Discriminant analysis in that the
predictive Discriminant scores, Dit, form the basis of the decision rule
used to classify this same set of objects into the t groups.
In contrast to the classification of the initial data set, where group
membership is known- the same decision rule may be applied to other sets
of data. However, when we classify data sets other than the initial set
from which the predictive analysis was conducted, we are no longer engaged
in predictive Discriminant analysis, but rather in Discriminant classification
analysis. It is critical that this distinction is clear. Predictive Discriminant
analysis requires no validation procedures be implemented, since derivation
of an optimal Discriminant function is the only relevant issue. However,
if fresh sets of data with either known or unknown grouping are classified,
then the Discriminant function must be validated to be generalizable to
these data sets. The following discussion of the methodology for classification
and for extending the classification analysis applies equally well to both
predictive and classification analyses in that classification methodology
is the same in both cases.
The predictive analysis explained above demonstrated the source of the
derived Discriminant scores, Dt, that are used to classify observations.
To demonstrate the classification procedure, we must first recognize that
the two p dimensional populations of our example are described by the discriminant
function, Dt , where
Values Dit are computed for each of the i- observations so as to form the
t distributions of values in a dimensional space that have sample
means or centroids designated as xbar1 and xbar2. For the example problem,
the classification analysis determines if observation i belongs
to population one or two. Using the midpoint between the two groups defined
by C, the correct classification for Dit may be determined by selecting
the appropriate decision alternative:
Classify observation i as coming from population one if
Otherwise, classify i as population two. Alternatively, the classification
rule may be defined as:

where no correction for the midpoint is made. In this case, the decision
criterion is to compute the value D it for each of the t functions
and classify the observation into the group that has the largest Discriminant
score D. Computational form (6) is the basis for most algorithms found
in the standard statistical packages.
The classification rules described above are commonly used in both predictive
and classification analyses when group membership is known to develop a
t x t matrix designated a confusion matrix. Although this
confusion matrix shows the frequency of correct and incorrect classification
resulting from the decision rule, it has not been subject to the further
analysis necessary to test for the presence of specific relationships or
even overall significance. (Note that if a classification analysis with
unknown grouping of objects is run, then a confusion matrix cannot be constructed,
thus showing the critical nature of the validation analysis.)
Confusion -Matrix Analysis
The computation of the confusion matrix has traditionally ended the
Discriminant analysis procedure. However the confusion matrix, when viewed
as a contingency table, is subject to a variety of analyses that may be
directed toward unanswered questions. Specifically, given the level of
observed correct classification;
1. What level of overall classification is expected from chance alone,
and is this classification significantly different from observed classification?
(An analysis of the aggregate confusion matrix)
2. Which groups are best classified by the Discriminant function, and
is each respective group classified significantly better than expected
by chance alone? (Analysis of individual rows of the confusion matrix)
3. Within each group, does the proportion of subjects correctly classified
or misclassified differ significantly from chance? (Analysis of individual
cells of the confusion matrix)
Analysis Level I: The Aggregate Confusion Matrix
The confusion matrix derived from the analysis of the accident insurance
purchasers was evaluated with respect to the above stated questions.
Figure 1
Confusion "Matrix for Accident Insurance Purchasers
|
Frequency, Row %
Chi-Square Contrib. |
Predicted
Purchase |
Predicted
Non-Purchase |
Row Total
Row Percentage |
| Actual Purchase |
n= 22
66.7
15.41 |
N=11
33.3
6.46 |
N=33
11.1 |
| Actual Non-Purchase |
n=66
24.9
1.92 |
N=199
75.1
.80 |
N=265
88.9 |
Column Totals
Column Percentage |
88
29.5 |
21.0
70.5 |
298
100 |
|
Percent of Cases Correctly Classified = 221 / 298 = 74.16%
Chi-Square = 24.599 df = l, Significance < .001
Overall correct classification was observed in 74.16% of all subjects
surveyed. This observed classification was found to be significant at the
.001 level (X 2= 24.59, df = 1) and (Q = 69.58, df = 1). Thus, observed
classification is significantly different from expected chance classification.
In addition to testing for overall significance of a single confusion
matrix, tests may be used to differentiate alternative Discriminant models
defining the same population. Operationally, this is done by selecting
the function with the largest Q statistic, since this identifies the function
with the greatest discriminating ability.
Analysis Level II: Tests of Group Differences
Morrison (1969) considered the question of how well variables discriminate
by formulating a likelihood ratio to estimate chance classification. This
estimate of chance classification is the basis for further tests of specific
relations critical to a rigorous analysis. However, expected classification,
or tests involving expected classification of specific groups, are rarely
reported in the literature.
Morrison's likelihood analysis provides a criterion that may be used
to compare the proportion of correctly classified observations with the
proportion expected by chance. This proportion, designated the proportional
chance criteria, or Cpro (Morrison 1969), is expressed as:
Cpro = p alpha + (1 - p) (1 - alpha) = (.295) (.111) + (.705) (.889)
= .6594
where,
alpha = the proportion of customers in the sample categorized as purchasers
·p = the true proportion of purchasers in the sample
(1-alpha) = the proportion of the sample classified as non-purchasers
(1-p) = the true proportion of non-purchasers in the sample
This likelihood analysis states that 65.94 % of the overall sample is
expected to receive correct classification by chance alone. The proportional
chance criterion, Cpro, has been used mainly as a point of reference for
subjective evaluation (Morrison 1969), rather than the basis of a statistical
test to determine if the expected proportion differs from the observed
proportion that is correctly classified. Notable exceptions are found in
Albaum, Best, and Hawkins (1975), and Smith (1979).
This relationship between chance and observed proportions can be tested
using a Z statistic of the form:
where,
Pcc is the percent of observations correctly classified Cpro p alpha + (1-P)
(1-alpha)
Thus for the example -problem, the difference between expected and actual
overall correct classification is significantly different at the .01 level.
This overall test of significance suggests that further analysis should
be conducted to determine the source of the divergence from chance expectations.
Divergence may be present in any of the confusion matrix cells (i.e.,
purchasers or non-purchasers, that are either correctly or incorrectly
categorized), and thus each may be tested to determine whether its proportion
differs from chance.
Analysis Level III: Classification and Misclassification Within Groups
The analysis to determine the source of deviation is conducted using the
maximum chance criterion, designated Cmax (Morrison 1969). Cmax is the
minimum expected correct classification for a selected group of interest.
The computation of Cmax is based on the assumption that all observations
are categorized as coming from that group: e.g., given that all 298 purchasers
and non-Purchasers were classified as purchasers, then the maximum
correct classification, Cmax, would be expressed:
Total Purchasers 33
Cmax = Total
Customers = 298
Because we are interested in the correct classification of insurance
purchasers, the test of classification involves asking if the 66.67% correct
insurance purchaser classification differs significantly from the 11.1%
maximum expected chance classification. A Z statistic is used to test this
relationship as shown for the example analysis.
* Significant at the .001 level.
This test may be conducted for the other cells in the confusion matrix:

Thus cell Z11 shows that observed classification is significantly greater
than is expected to occur by chance classification alone. The analysis
of cells (1,2) and (2,1) shows that observed and expected misclassification
results differ in that purchasers are misclassified into cell (1,2) less
often than expected by chance, and non-purchasers are misclassified into
cell (2,2) more often than expected bv chance. Thus the discriminant functions
appear to shift the classification of subjects toward the purchaser categories,
as demonstrated by significantly greater than expected classification in
the upper and left portions of the confusion matrix.
SUMMARY
Two objectives have been fulfilled by this paper. The first objective of
the paper was to show that differences in the application and the requirements
for discriminant analysis exist. These are often misinterpreted, especially
with respect to the validation of the predictive Discriminant analysis.
These differences are summarized as follows.
Stages of Analysis
|
|
Predictive
Discriminant
Analysis |
Classification
Analysis of
Initial Data
Set of Known
Groupings |
Classification
Analysis of
New Data Set
of Known
Groupings |
Classification
Analysis of
New Data Set
of Known
Groupings |
| Purpose |
Derive Discriminant function
using initial data set:
No classification involved |
Determine how
well discriminant
function classifies
(biased) |
1) Classify data
using classification
rule derived
from predictive
function
2) May be part
of validation
analysis of
initial predictive
function |
1) Classify data
using classification
rule derived
from predictive
function
2) May be part
of validation
analysis of
initial predictive
function |
| Requirements |
Assumptions of
linear discriminant
model:
No validation
required |
No validation
required |
Validation required |
Initial predictive
function must
have been
previously validated |
|
The second objective of this paper has been to demonstrate the increased
rigor in Discriminant analysis that can be implemented if classification
of data sets of known groupings are implemented. The use of these techniques
will enhance both the analysis and interpretation of the classification
analysis, particularly when the predictive function is being validated
as a tool for classification.
The clarification of the alternative uses of the Discriminant analysis
along with the possibility of increased rigor will greatly enhance both
the analysis and interpretation of empirical and managerial problems.
REFERENCES
Albaum, G., R. Best, and D. Hawkins (1975), "Applying Discriminant
Analysis to Unipolar Semantic Scaling Data," American Institute of
Decision Sciences Western Meetings.
Albaum, G., and D. Hawkins (1979), "Differences between Consumers
of Variable-Rate and Fixed-Rate Residential mortgages," in Proceedings
of the Association for Consumer Research, J.C. Olson et. al.,
eds., San Francisco, California.
Awh, R.Y., and D. Waters (1974), "A Discriminant Analvsis of Economic,
Demographic, and Attitudinal Characteristics of Bank Change-Card Holders:
A Case Study, 29, The Journal of Finance 29, pp. 973-980.
Crask, M.R., and W.D. Perreault, Jr. (1977), "Validation of Discriminant
Analvsis in Marketing Research," Journal of Market Research
14 (February), pp.60-68.
Dillon, W.R. (1979), ';The PerforTnance of the Linear Discriminant Function
in Non-Optimal Situations and the Estimation of Classification Error Rates:
A Review of Recent Findings,7' Journal of Marketing Reserach 16
(August), pp. 370-391.
Dillon, W.R., and M. Goldstein (1978), "On the Performance of Some
Multinomial Classification Rules," Journal of the Anerican Statistical
Association 73 (June), pp. 305-313.
Dillon, W.R., and L. Schiffman (1978), ':Appropriateness of Linear Discriminant
and Multinomial Classification Analysis in Marketing Research," Journal
of Marketing Research 15 (February), pp. 103-112.
Evans, F.B. (1959), "Psychological and Objective Factors in the Prediction
of Brand Choice: Ford vs. Chevrolet," Journal of Business 32
(October), pp. 340-369.
Fisher, L., and J.W. Van Ness (1973), "Admissible Discriminant Analysis,"
Journal of the American Statistical Association 68, pp. 603-607.
Frank, R.E., W.F. Brassy, and D.G. Morrison (1965), ';Bias in Multiple
Discriminant Analysis,"' Journal of Marketing, Research 2 (August),
pp. 250-258.
Goldstein, M. , and "T. Rabinowitz (1975), "Selection of Variates
for the Two Group Classification Problem," Journal of the American
Statistical Association 70, pp. 776-781.
Hills, M. (1966). "Allocation Rules and their Error Rates,"
Journal of theRoval Statistical Societv B28, pg. 1.
Krishnaswami, P., and R. Nath (1968), "Bias in Multinomial Classification,
Journal of the American Statistical Association 63, pp. 298-303.
Krzanowski, W.J. (1975), "Discrimination and Classification Using
Both Binary and Continuous Variables," Journal of the American
Statistical Association 70, pp. 782-790.
Lachenbruch, P.A. (1967), "An Almost Unbiased Method of Obtaining
Confidence Intervals for the Probability of Misclassification in Discriminant
Analysis," Biometrics. 23, pp. 639-645.
Lachenbruch, P.A., and M.R. Mickey (1968), "Estimation of Error Rates
in Discriminant Analysis," Technometrics 10, pp. 1-11.
McLachlan, G.J. (1974), "Estimation of the Errors of Misclassification
on the Criterion of Asymptotic Mean Square Error," Technometrics
16 (May), pp. 255-256.
McLachlan, G.J. (1977), "'Estimating the Linear Discriminant Function
from Initial Samples Containing a Small Number of Unclassified Observations,"
Journal of the American Statistical Association 72, pp. 403-406.
Morrison, D.G. (1969), "On Interpretation in Discriminant Analvsis,"'
Journal of Marketing Research 6 (May), pp. 156-163.
Mosteller, F., and J.W. Tukey (1968), "Data Analysis, Including Statistics,"
in ' The Handbook of Social Pscyhology, Vol. 2, G. Lindsey and E.
Aronson, eds., Reading, MA: Addison-wesley, pp. 80-203.
Mosteller, F., and D.F. Wallace (1963), "Influence in an Authorship
Problem," Journal of the American Statistical Association 58
(June), pp. 275-309.
Randles, R.H., J.D. Brofitt, J.S. Ramberg, and R.V. Hogg (1978), "Discriminant
Analysis Based on Ranks," Journal of the American Statistical
Association 73, pp. 379-384.
Smith,, S.M. (1979). "Product Aggregation as a Mediating Variable
in the Segmentation of Consumer and Geographic Markets,'.' Unpublished
Doctoral Dissertation, Pennsylvania State University, University Park,
PA.
Urbakh, V.U. (1971), "Linear Discriminant Analysis: Loss of Discriminating
Power when a Variate is Omitted," Biometrics 27, pp. 531-534.
|
|