- psychological testing
Use of tests to measure skill, knowledge, intelligence, capacities, or aptitudes and to make predictions about performance.Best known is the IQ test; other tests include achievement testsdesigned to evaluate a student's grade or performance leveland personality tests. The latter include both inventory-type (question-and-response) tests and projective tests such as the Rorschach (inkblot) and thematic apperception (picture-theme) tests, which are used by clinical psychologists and psychiatrists to help diagnose mental disorders and by psychotherapists and counselors to help assess their clients. Experimental psychologists routinely devise tests to obtain data on perception, learning, and motivation. Clinical neuropsychologists often use tests to assess cognitive functioning of people with brain injuries. See also experimental psychology; psychometrics.
* * *Introductionalso called psychometrics,the systematic use of tests to quantify psychophysical behaviour, abilities, and problems and to make predictions about psychological performance.The word “test” refers to any means (often formally contrived) used to elicit responses to which human behaviour in other contexts can be related. When intended to predict relatively distant future behaviour (e.g., success in school), such a device is called an aptitude test. When used to evaluate the individual's present academic or vocational skill, it may be called an achievement test. In such settings as guidance offices, mental-health clinics, and psychiatric hospitals, tests of ability and personality may be helpful in the diagnosis and detection of troublesome behaviour. Industry and government alike have been prodigious users of tests for selecting workers. Research workers often rely on tests to translate theoretical concepts (e.g., intelligence) into experimentally useful measures.General problems of measurement in psychologyPhysical things are perceived through their properties or attributes. A mother may directly sense the property called temperature by feeling her infant's forehead. Yet she cannot directly observe colicky feelings or share the infant's personal experience of hunger. She must infer such unobservable private sensations from hearing her baby cry or gurgle; from seeing him flail his arms, or frown, or smile. In the same way, much of what is called measurement must be made by inference. Thus, a mother suspecting her child is feverish may use a thermometer, in which case she ascertains his temperature by looking at the thermometer, rather than by directly touching his head.Indeed, measurement by inference is particularly characteristic of psychology. Such abstract properties or attributes as intelligence or introversion never are directly measured but must be inferred from observable behaviour. The inference may be fairly direct or quite indirect. If persons respond intelligently (e.g., by reasoning correctly) on an ability test, it can be safely inferred that they possess intelligence to some degree. In contrast, people's capacity to make associations or connections, especially unusual ones, between things or ideas presented in a test can be used as the basis for inferring creativity, although producing a creative product requires other attributes, including motivation, opportunity, and technical skill.Types of measurement scalesTo measure any property or activity is to assign it a unique position along a numerical scale. When numbers are used merely to identify individuals or classes (as on the backs of athletes on a football team), they constitute a nominal scale. When a set of numbers reflects only the relative order of things (e.g., pleasantness-unpleasantness of odours), it constitutes an ordinal scale. An interval scale has equal units and an arbitrarily assigned zero point; one such scale, for example, is the Fahrenheit temperature scale. Ratio scales not only provide equal units but also have absolute zero points; examples include measures of weight and distance.Although there have been ingenious attempts to establish psychological scales with absolute zero points, psychologists usually are content with approximations to interval scales; ordinal scales often are used as well.Primary characteristics of methods or instrumentsThe primary requirement of a test is validity—traditionally defined as the degree to which a test actually measures whatever it purports to measure. A test is reliable to the extent that it measures consistently, but reliability is of no consequence if a test lacks validity. Since the person who draws inferences from a test must determine how well it serves his purposes, the estimation of validity inescapably requires judgment. Depending on the criteria of judgment employed, tests exhibit a number of different kinds of validity.Empirical validity (also called statistical or predictive validity) describes how closely scores on a test correspond (correlate) with behaviour as measured in other contexts. Students' scores on a test of academic aptitude, for example, may be compared with their school grades (a commonly used criterion). To the degree that the two measures statistically correspond, the test empirically predicts the criterion of performance in school. Predictive validity has its most important application in aptitude testing (e.g., in screening applicants for work, in academic placement, in assigning military personnel to different duties).Alternatively, a test may be inspected simply to see if its content seems appropriate to its intended purpose. Such content validation is widely employed in measuring academic achievement but with recognition of the inevitable role of judgment. Thus, a geometry test exhibits content (or curricular) validity when experts (e.g., teachers) believe that it adequately samples the school curriculum for that topic. Interpreted broadly, content covers desired skills (such as computational ability) as well as points of information in the case of achievement tests. Face validity (a crude kind of content validity) reflects the acceptability of a test to such people as students, parents, employers, and government officials. A test that looks valid is desirable, but face validity without some more basic validity is nothing more than window dressing.In personality (personality assessment) testing, judgments of test content tend to be especially untrustworthy, and dependable external criteria are rare. One may, for example, assume that a man who perspires excessively feels anxious. Yet his feelings of anxiety, if any, are not directly observable. Any assumed trait (anxiety, for example) that is held to underlie observable behaviour is called a construct. Since the construct itself is not directly measurable, the adequacy of any test as a measure of anxiety can be gauged only indirectly; e.g., through evidence for its construct validity.A test exhibits construct validity when low scorers and high scorers are found to respond differently to everyday experiences or to experimental procedures. A test presumed to measure anxiety, for example, would give evidence of construct validity if those with high scores (“high anxiety”) can be shown to learn less efficiently than do those with lower scores. The rationale is that there are several propositions associated with the concept of anxiety: anxious people are likely to learn less efficiently, especially if uncertain about their capacity to learn; they are likely to overlook things they should attend to in carrying out a task; they are apt to be under strain and hence feel fatigued. (But anxious people may be young or old, intelligent or unintelligent.) If people with high scores on a test of anxiety show such proposed signs of anxiety, that is, if a test of anxiety has the expected relationships with other measurements as given in these propositions, the test is viewed as having construct validity.Test reliability is affected by scoring accuracy, adequacy of content sampling, and the stability of the trait being measured. Scorer reliability refers to the consistency with which different people who score the same test agree. For a test with a definite answer key, scorer reliability is of negligible concern. When the subject responds with his own words, handwriting, and organization of subject matter, however, the preconceptions of different raters produce different scores for the same test from one rater to another; that is, the test shows scorer (or rater) unreliability. In the absence of an objective scoring key, a scorer's evaluation may differ from one time to another and from those of equally respected evaluators. Other things being equal, tests that permit objective scoring are preferred.Reliability also depends on the representativeness with which tests sample the content to be tested. If scores on items of a test that sample a particular universe of content designed to be reasonably homogeneous (e.g., vocabulary) correlate highly with those on another set of items selected from the same universe of content, the test has high content reliability. But if the universe of content is highly diverse in that it samples different factors (say, verbal reasoning and facility with numbers), the test may have high content reliability but low internal consistency.For most purposes, the performance of a subject on the same test from day to day should be consistent. When such scores do tend to remain stable over time, the test exhibits temporal reliability. Fluctuations of scores may arise from instability of a trait; for example, the test taker may be happier one day than the next. Or temporal unreliability may reflect injudicious test construction.Included among the major methods through which test reliability estimates are made is the comparable-forms technique, in which the scores of a group of people on one form of a test are compared with the scores they earn on another form. Theoretically, the comparable-forms approach may reflect scorer, content, and temporal reliability. This ideally demands that each form of the test be constructed by different but equally competent persons and that the forms be given at different times and evaluated by a second rater (unless an objective key is fixed).In the test-retest method, scores of the same group of people from two administrations of the same test are correlated. If the time interval between administrations is too short, memory may unduly enhance the correlation. Or some people, for example, may look up words they missed on the first administration of a vocabulary test and thus be able to raise their scores the second time around. Too long an interval can result in different effects for each person due to different rates of forgetting or learning. Except for very easy speed tests (e.g., in which a person's score depends on how quickly he is able to do simple addition), this method may give misleading estimates of reliability.Internal-consistency methods of estimating reliability require only one administration of a single form of a test. One method entails obtaining scores on separate halves of the test, usually the odd-numbered and the even-numbered items. The degree of correspondence (which is expressed numerically as a correlation coefficient) between scores on these half-tests permits estimation of the reliability of the test (at full length) by means of a statistical correction.This is computed by the use of the Spearman-Brown prophecy formula (for estimating the increased reliability expected to result from increase in test length). More commonly used is a generalization of this stepped-up, split-half reliability estimate, one of the Kuder-Richardson formulas. This formula provides an average of estimates that would result from all possible ways of dividing a test into halves.Other characteristicsA test that takes too long to administer is useless for most routine applications. What constitutes a reasonable period of testing time, however, depends in part on the decisions to be made from the test. Each test should be accompanied by a practicable and economically feasible scoring scheme, one scorable by machine or by quickly trained personnel being preferred.A large, controversial literature has developed around response sets; i.e., tendencies of subjects to respond systematically to items regardless of content. Thus, a given test taker may tend to answer questions on a personality test only in socially desirable ways or to select the first alternative of each set of multiple-choice answers or to malinger (i.e., to purposely give wrong answers).Response sets stem from the ways subjects perceive and cope with the testing situation. If they are tested unwillingly, they may respond carelessly and hastily to get through the test quickly. If they have trouble deciding how to answer an item, they may guess or, in a self-descriptive inventory, choose the “yes” alternative or the socially desirable one. They may even mentally reword the question to make it easier to answer. The quality of test scores is impaired when the purposes of the test administrator and the reactions of the subjects to being tested are not in harmony. Modern test construction seeks to reduce the undesired effects of subjects' reactions.Types of instruments and methodsPsychophysical scales and psychometric, or psychological, scalesThe concept of an absolute threshold (the lowest intensity at which a sensory stimulus, such as sound waves, is perceived) is traceable to the German philosopher Johann Friedrich Herbart. The German physiologist Ernst Heinrich Weber later observed that the smallest discernible difference of intensity is proportional to the initial stimulus intensity. Weber found, for example, that, while people could just notice the difference after a slight change in the weight of a 10-gram object, they needed a larger change before they could just detect a difference from a 100-gram weight. This finding, known as Weber's law, is expressed more technically in the statement that the perceived (subjective) intensity varies mathematically as the logarithm of the physical (objective) intensity of the stimulus.In traditional psychophysical scaling methods, a set of standard stimuli (such as weights) that can be ordered according to some physical property is related to sensory judgments made by experimental subjects. By the method of average error, for example, subjects are given a standard stimulus and then made to adjust a variable stimulus until they believe it is equal to the standard. The mean (average) of a number of judgments is obtained. This method and many variations have been used to study such experiences as visual illusions, tactual intensities, and auditory pitch.Psychological (psychometric) scaling methods are an outgrowth of the psychophysical tradition just described. Although their purpose is to locate stimuli on a linear (straight-line) scale, no quantitative physical values (e.g., loudness or weight) for stimuli are involved. The linear scale may represent an individual's attitude toward a social institution, his judgment of the quality of an artistic product, the degree to which he exhibits a personality characteristic, or his preference for different foods. Psychological scales thus are used for having a person rate his own characteristics as well as those of other individuals in terms of such attributes, for example, as leadership potential or initiative. In addition to locating individuals on a scale, psychological scaling can also be used to scale objects and various kinds of characteristics: finding where different foods fall on a group's preference scale; or determining the relative positions of various job characteristics in the view of those holding that job. Reported degrees of similarities between pairs of objects are used to identify scales or dimensions on which people perceive the objects.The American psychologist L.L. Thurstone offered a number of theoretical-statistical contributions that are widely used as rationales for constructing psychometric scales. One scaling technique (comparative judgment) is based empirically on choices made by people between members of any series of paired stimuli. Statistical treatment to provide numerical estimates of the subjective (perceived) distances between members of every pair of stimuli yields a psychometric scale. Whether or not these computed scale values are consistent with the observed comparative judgments can be tested empirically.Another of Thurstone's psychometric scaling techniques (equal-appearing intervals) has been widely used in attitude measurement. In this method judges sort statements reflecting such things as varying degrees of emotional intensity, for example, into what they perceive to be equally spaced categories; the average (median) category assignments are used to define scale values numerically. Subsequent users of such a scale are scored according to the average scale values of the statements to which they subscribe. Another psychologist, Louis Guttman, developed a method that requires no prior group of judges, depends on intensive analysis of scale items, and yields comparable results. Quite commonly used is the type of scale developed by Rensis Likert in which perhaps five choices ranging from strongly in favour to strongly opposed are provided for each statement, the alternatives being scored from one to five. A more general technique (successive intervals) does not depend on the assumption that judges perceive interval size accurately. The widely used graphic rating scale presents an arbitrary continuum with preassigned guides for the rater (e.g., adjectives such as superior, average, and inferior).Tests versus inventoriesThe term “test” most frequently refers to devices for measuring abilities or qualities for which there are authoritative right and wrong answers. Such a test may be contrasted with a personality inventory, for which it is often claimed that there are no right or wrong answers. At any rate, in taking what often is called a test, the subjects are instructed to do their best; in completing an inventory, they are instructed to represent their typical reactions. A distinction also has been made that in responding to an inventory the subjects control the appraisal, whereas in a test they do not. If a test is more broadly regarded as a set of stimulus situations that elicit responses from which inferences can be drawn, however, then an inventory is, according to this definition, a variety of test.Free-response versus limited-response testsFree-response tests entail few restraints on the form or content of response, whereas limited-response tests restrict responses to one of a smaller number presented (e.g., true-false). An essay test tends toward one extreme (free response), while a so-called fully objective test is at the other extreme (limited response).Response to an essay question is not completely unlimited, however, since the answer should bear on the question. The free-response test does give practice in writing, and, when an evaluator is proficient in judging written expression, his comments on the test may aid the individual to improve his writing style. All too often, however, writing ability unfortunately affects the evaluator's judgment of how well the test taker understands content, and this tends to reduce test reliability. Another source of unreliability for essay tests is found in their limited sampling of content, as contrasted with the broader coverage that is possible with objective tests. Often both the scorer and the content reliability of essay tests can be improved, but such attempts are costly.The objective test, which minimizes scorer unreliability, is best typified by the multiple-choice form, in which the subject is required to select one from two or (preferably) more responses to a test item. Matching items that have a common set of alternatives are of this form. The true-false test question is a special multiple-choice form that may tend to arouse antagonism because of variable standards of truth or falsity.The more general multiple-choice item is more acceptable when it is specified only that the best answer be selected; it is flexible, has high scorer reliability, and is not limited to simple factual knowledge. The ingenious test constructor can use multiple-choice items to test such functions as generalization, application of principles, and the ability to educe unfamiliar relationships.Some personality tests are presented in a forced-choice format. They may, for example, force the person to choose one of two favourable words or phrases (e.g., intelligent-handsome) as more descriptive of himself or one of two unfavourable terms as less descriptive (e.g., stupid-ugly). Marking one choice yields a gain in score on some trait but may also preclude credit on another trait. This technique is intended to eliminate any effects from subjects' attempts to present themselves in a socially desirable light; it is not fully successful, however, because what is highly desirable for one person may be less desirable for another.The forced-choice technique for self-appraisals is exemplified in a widely used interest inventory. Forced-choice ratings were introduced for evaluation of one military officer by another during World War II. They were an effort to avoid the preponderance of high ratings typically obtained with ordinary rating scales. Raters tend to give those being rated the benefit of any doubt, especially when they are fellow workers. Also, supervisors or teachers may give unduly favourable ratings because they believe good performance of subordinates or students reflects well on themselves.Falling between free- and limited-response tests is a type that requires a short answer, perhaps a single word or a number, for each item. When the required response is to fit into a blank in a sentence, the test is called a completion test. This type of test is susceptible to scorer unreliability.A personality test to which a subject responds by interpreting a picture or by telling a story it suggests resembles an essay test except that responses ordinarily are oral. A personality inventory that requires the subject to indicate whether or not a descriptive phrase applies to him is of the limited-response type. A sentence-completion personality test that asks the subject to complete statements such as “I worry because . . . ” is akin to the short-answer and completion types.Verbal versus performance testsA verbal (or symbol) test poses questions to which the subject supplies symbolic answers (in words or in other symbols, such as numbers). In performance tests, the subject actually executes some motor activity; for example, he assembles mechanical objects. Either the quality of performance as it takes place or its results may be rated.The verbal test, permitting group administration, requiring no special equipment, and often being scorable by relatively unskilled evaluators, tends to be more practical than the performance test. Both types of devices also have counterparts in personality measurement, in which verbal tests as well as behaviour ratings are used.The oral test is administered to one person at a time, but written tests can be given simultaneously to a number of subjects. Oral tests of achievement, being uneconomical and prone to content and scorer unreliability, have been supplanted by written tests; notable exceptions include the testing of illiterates and the anachronistic oral examinations to which candidates for graduate degrees are liable.Proponents of individually administered intelligence tests (e.g., the Stanford-Binet) state that such face-to-face testing optimizes rapport and motivation, even among literate adult subjects. Oral tests of general aptitude remain popular, though numerous written group tests have been designed for the same purpose.The interview may provide a personality measurement and, especially when it is standardized as to wording and order of questions and with a key for coding answers, may amount to an individual oral test. Used in public opinion surveys, such interviews are carefully designed to avoid the effects of interviewer bias and to be comprehensible to a highly heterogeneous sample of respondents.Appraisal by others versus self-appraisalIn responding to personality inventories and rating scales, a person presumably reveals what he thinks he is like; that is, he appraises himself. Other instruments may reflect what one person thinks of another. Because self-appraisal often lacks objectivity, appraisal by another individual is common in such things as ratings for promotions. Ordinary tests of ability clearly involve evaluation of one person by another, although the subject's self-evaluation may intrude; for example, he may lack confidence to the point where he does not try to do his best.Projective testsThe stimuli (e.g., inkblots) in a projective test are intentionally made ambiguous and open to different interpretations in the expectation that each subject will project his own unique (idiosyncratic) reactions in his answers. Techniques for evaluating such responses range from the intuitive impressions of the rater to complex, coded schemes for scoring and interpretation that require extensive manuals; some projective tests are objectively scorable.Speed tests versus power testsA pure speed test is homogeneous in content (e.g., a simple clerical checking test), the tasks being so easy that with unlimited time all but the most incompetent of subjects could deal with them successfully. The time allowed for testing is so short, however, that even the ablest subject is not expected to finish. A useful score is the number of correct answers made in a fixed time. In contrast, a power test (e.g., a general vocabulary test) contains items that vary in difficulty to the point that no subject is expected to get all items right even with unlimited time. In practice, a definite but ample time is set for power tests.Speed tests are suitable for testing visual perception, numerical facility, and other abilities related to vocational success. Tests of psychomotor abilities (e.g., eye–hand coordination) often involve speed. Power tests tend to be more relevant to such purposes as the evaluation of academic achievement, for which the highest level of difficulty at which a person can succeed is of greater interest than his speed on easy tasks.In general, tests reflect unknown combinations of the effects of speed and power; many consist of items that vary considerably in difficulty, and the time allowed is too limited to allow a large proportion of subjects to attempt all items.Teacher-made versus standardized testsA distinction between teacher-made tests and standardized tests is often made in relation to tests used to assess academic achievement. Ordinarily, teachers do not attempt to construct tests of general or special aptitude or of personality traits. Teacher-made tests tend instead to be geared to narrow segments of curricular content (e.g., a sixth-grade geography test). Standardized tests with carefully defined procedures for administration and scoring to ensure uniformity can achieve broader goals. General principles of test construction and such considerations as reliability and validity apply to both types of test.Special measurement techniquesSociodrama and psychodrama were originally developed as psychotherapeutic techniques. In sociodrama, group members participate in unrehearsed drama to illuminate a general problem. Psychodrama centres on one individual in the group whose unique personal problem provides the theme. Related research techniques (e.g., the sociometric test) can offer insight into interpersonal relationships. Individuals may be asked to specify members of a group whom they prefer as leader, playmate, or coworker. The choices made can then be charted in a sociogram, from which cliques or socially isolated individuals may be identified at a glance.Research psychologists have grasped the sociometric approach as a means of measuring group cohesiveness and studying individual reactions to groups. The degree to which any group member chooses or is chosen beyond chance expectation may be calculated, and mathematical techniques may be used to determine the complex links among group members. Sociogram-choice scores have been useful in predicting such criteria as individual productivity in factory work and combat effectiveness.Development of standardized testsTest contentItem developmentOnce the need for a test has been established, a plan to define its content may be prepared. For achievement tests, the test plan may also indicate thinking skills to be evaluated. Detailed content headings can be immediately suggestive of test items. It is helpful if the plan specifies weights to be allotted to different topics, as well as the desired average score and the spread of item difficulties. Whether or not such an outline is made, the test constructor clearly must understand the purpose of the test, the universe of content to be sampled, and the forms of the items to be used.Tryouts and item analysisA set of test questions is first administered to a small group of people deemed to be representative of the population for which the final test is intended. The trial run is planned to provide a check on instructions for administering and taking the test and for intended time allowances, and it can also reveal ambiguities in the test content. After adjustments, surviving items are administered to a larger, ostensibly representative group. The resulting data permit computation of a difficulty index for each item (often taken as the percentage of the subjects who respond correctly) and of an item-test or item-subtest discrimination index (e.g., a coefficient of correlation specifying the relationship of each item with total test score or subtest score).If it is feasible to do so, measures of the relation of each item to independent criteria (e.g., grades earned in school) are obtained to provide item validation. Items that are too easy or too difficult are discarded; those within a desired range of difficulty are identified. If internal consistency is sought, items that are found to be unrelated to either a total score or an appropriate subtest score are ruled out, and items that are related to available external criterion measures are identified. Those items that show the most efficiency in predicting an external criterion (highest validity) usually are preferred over those that contribute only to internal consistency (reliability).Estimates of reliability for the entire set of items, as well as for those to be retained, commonly are calculated. If the reliability estimate is deemed to be too low, items may be added. Each alternative in multiple-choice items also may be examined statistically. Weak incorrect alternatives can be replaced, and those that are unduly attractive to higher scoring subjects may be modified.Cross validationItem-selection procedures are subject to chance errors in sampling test subjects, and statistical values obtained in pretesting are usually checked (cross validated) with one or more additional samples of subjects. Typically, it is found that cross-validation values tend to shrink for many of the items that emerged as best in the original data, and further items may be found to warrant discard. Measures of correlation between total test score and scores from other, better known tests are often sought by test users.Differential weightingSome test items may appear to deserve extra, positive weight; some answers in multiple-choice items, though keyed as wrong, seem better than others in that they attract people who earn high scores generally. The bulk of theoretical logic and empirical evidence, nonetheless, suggests that unit weights for selected items and zero weights for discarded items and dichotomous (right versus wrong) scoring for multiple-choice items serve almost as effectively as more complicated scoring. Painstaking efforts to weight items generally are not worth the trouble.Negative weight for wrong answers is usually avoided as presenting undue complication. In multiple-choice items, the number of answers a subject knows, in contrast to the number he gets right (which will include some lucky guesses), can be estimated by formula. But such an average correction overpenalizes the unlucky and underpenalizes the lucky. If the instruction is not to guess, it is variously interpreted by persons of different temperament; those who decide to guess despite the ban are often helped by partial knowledge and tend to do better.A responsible tactic is to try to reduce these differences by directing subjects to respond to every question, even if they must guess. Such instructions, however, are inappropriate for some competitive speed tests, since candidates who mark items very rapidly and with no attention to accuracy excel if speed is the only basis for scoring; that is, if wrong answers are not penalized.Test normsTest norms consist of data that make it possible to determine the relative standing of an individual who has taken a test. By itself, a subject's raw score (e.g., the number of answers that agree with the scoring key) has little meaning. Almost always, a test score must be interpreted as indicating the subject's position relative to others in some group. Norms provide a basis for comparing the individual with a group.Numerical values called centiles (or percentiles) serve as the basis for one widely applicable system of norms. From a distribution of a group's raw scores the percentage of subjects falling below any given raw score can be found. Any raw score can then be interpreted relative to the performance of the reference (or normative) group—eighth-graders, five-year-olds, institutional inmates, job applicants. The centile rank corresponding to each raw score, therefore, shows the percentage of subjects who scored below that point. Thus, 25 percent of the normative group earn scores lower than the 25th centile; and an average called the median corresponds to the 50th centile.Another class of norm system (standard scores) is based on how far each raw score falls above or below an average score, the arithmetic mean. One resulting type of standard score, symbolized as z, is positive (e.g., +1.69 or +2.43) for a raw score above the mean and negative for a raw score below the mean. Negative and fractional values can, however, be avoided in practice by using other types of standard scores obtained by multiplying z scores by an arbitrarily selected constant (say, 10) and by adding another constant (say, 50, which changes the z score mean of zero to a new mean of 50). Such changes of constants do not alter the essential characteristics of the underlying set of z scores.The French psychologist Alfred Binet, in pioneering the development of tests of intelligence, listed test items along a normative scale on the basis of the chronological age (actual age in years and months) of groups of children that passed them. A mental-age score (e.g., seven) was assigned to each subject, indicating the chronological age (e.g., seven years old) in the reference sample for which his raw score was the mean. But mental age is not a direct index of brightness; a mental age of seven in a 10-year-old is different from the same mental age in a four-year-old.To correct for this, a later development was a form of IQ (intelligence quotient), computed as the ratio of the subject's mental age to his chronological age, multiplied by 100. (Thus, the IQ made it easy to tell if a child was bright or dull for his age.)Ratio IQs for younger age groups exhibit means close to 100 and spreads of roughly 45 points above and below 100. The classical ratio IQ has been largely supplanted by the deviation IQ, mainly because the spread around the average has not been uniform due to different ranges of item difficulty at different age levels. The deviation IQ, a type of standard score, has a mean of 100 and a standard deviation of 16 for each age level. Practice with the Stanford-Binet test reflects the finding that average performance on the test does not increase beyond age 18. Therefore, the chronological age of any individual older than 18 is taken as 18 for the purpose of determining IQ.The Stanford-Binet has been largely supplanted by several tests developed by the American psychologist David Wechsler between the late 1930s and the early 1960s. These tests have subtests for several capacities, some verbal and some operational, each subtest having its own norms. After constructing tests for adults, Wechsler developed tests for older and for younger children.Assessing test structureFactor analysisFactor analysis is a method of assessment frequently used for the systematic analysis of intellectual ability and other test domains, such as personality measures. Just after the turn of the 20th century the British psychologist Charles E. Spearman systematically explored positive intercorrelations between measures of apparently different abilities to provide evidence that much of the variability in scores that children earn on tests of intelligence depends on one general underlying factor, which he called g. In addition he believed that each test contained an s factor specific to it alone. In the United States, Thurstone developed a statistical technique called multiple-factor analysis, with which he was able to demonstrate, in a set of tests of intelligence, that there were primary mental abilities, such as verbal comprehension, numerical computation, spatial orientation, and general reasoning. Although later work has supported the differentiation between these abilities, no definitive taxonomy of abilities has become established. One element in the problem is the finding that each such ability can be shown to be composed of narrower factors.The first computational methods in factor analysis have been supplanted by mathematically more elegant, computer-generated solutions. While earlier techniques were primarily exploratory, the Swedish statistician Karl Gustav Jöreskog and others have developed procedures that permit the researcher to test hypotheses about the structure in a set of data.Rooted in extensive applications of factor analysis, a structure-of-intellect model developed by the American psychologist Joy Paul Guilford posited a very large number of factors of intelligence. Guilford envisaged three intersecting dimensions corresponding respectively to four kinds of test content, five kinds of intellectual operation, and six kinds of product. Each of the 120 cells in the cube thus generated was hypothesized to represent a separate ability, each constituting a distinct factor of intellect. Educational and vocational counselors usually prefer a substantially smaller number of scores than the 120 implied by this model.Factor analysis has also been widely used outside the realm of intelligence, especially to seek the structure of personality as reflected in ratings by oneself and by others. Although there is even less consensus here than for intelligence, a number of studies suggest that four prevalent factors can be approximately labeled, namely, conformity, extroversion, anxiety, and dependability.Profile analysisWith the fractionation of tests (e.g., to yield scores measuring separate factors or clusters), new concern has arisen for interpreting differences among scores measuring the underlying variables, however conceived. Scores of an individual on several such measures can be plotted graphically as a profile; for direct comparability, all raw scores may be expressed in terms of standard scores that have equal means and variabilities. The difference between any pair of scores that have less than perfect reliability tends to be less reliable than either, and fluctuations in the graph should be interpreted cautiously. Nevertheless, various features of an individual's profile may be examined, such as scatter (fluctuation from one measure to another) and relative level of performance on different measures. (The particular shape of the graph, it should be noted, partly depends upon the arbitrary order in which measures are listed.) One may also statistically express the degree of similarity between any two profiles. Such statistical measures of pattern similarity permit quantitative comparison of profiles for different persons, of profiles of the same individual's performance at different times, of individual with group profiles, or of one group profile with another. Comparison of an individual's profile with similar graphs representing the means for various occupational groups, for example, is useful for vocational guidance or personnel selection.Additional ReadingDorothy C. Adkins, Test Construction, 2nd ed. (1974), a simplified treatment of measurement principles, rules for test construction, and statistical techniques; Anne Anastasi, Psychological Testing, 5th ed. (1982), an authoritative text and reference book, with emphasis on current psychological tests; Lee J. Cronbach, Essentials of Psychological Testing, 4th ed. (1984), a modern and insightful text and general reference; J.P. Guilford, Psychometric Methods, 2nd ed. (1954), a widely used book that attempts to integrate psychophysical scaling and psychological measurement methods; Harold Gulliksen, Theory of Mental Tests (1950), a basic theoretical reference; Harry H. Harman, Modern Factor Analysis, 3rd rev. ed. (1976), an eclectic treatment of factor-analytic theory and methods; Paul Horst, Psychological Measurement and Prediction (1966), a discussion of practical requirements of psychological measurement as well as of technical problems in prediction; Frederic M. Lord and Melvin R. Novick, Statistical Theories of Mental Test Scores (1968), a highly technical presentation; Georg Rasch, Probabilistic Models for Some Intelligence and Attainment Tests (1980), with a new model for tests; and Robert L. Thorndike (ed.), Educational Measurement, 2nd ed. (1971), with specially prepared chapters by authorities in particular fields of measurement.Dorothy C. Adkins Donald W. Fiske
* * *