A Brief History of IQ Testing: From Binet to ICAR

The story of intelligence testing is one of the more revealing case studies in how scientific tools evolve — sometimes for good reasons, sometimes for terrible ones, almost always faster than the science underneath them. The instruments your grandparents might have taken at school in 1950 share an ancestor with the ones available online today, but the lineage has had to wind around two world wars, several ideological controversies, and a series of methodological revolutions that fundamentally changed what the tests measure and how they measure it.

Understanding this history isn't just academic. It explains why modern IQ tests look the way they do, why some popular criticisms of them are dated, and why a relatively recent project called ICAR is reshaping how researchers think about cognitive measurement in the open-data era.

Alfred Binet and the original purpose (1905)

The story begins in Paris in 1904, when the French Ministry of Public Instruction commissioned a psychologist named Alfred Binet to develop a way of identifying schoolchildren who would benefit from additional academic support. France had recently mandated universal education, and teachers needed a method for distinguishing students who were struggling because of academic challenges from those who were simply behind in development.

Binet, working with his collaborator Théodore Simon, developed what became the Binet-Simon Intelligence Scale in 1905. It was a structured set of age-graded tasks — vocabulary, memory, simple reasoning, basic arithmetic — that produced what Binet called a "mental age." A child whose performance matched the typical eight-year-old was assigned a mental age of eight, regardless of their chronological age.

Two things about Binet's original framing are worth knowing. First, the test was meant as a clinical screening tool, not a measurement of fixed underlying ability. Binet explicitly warned against treating mental age as an immutable trait. Second, the "intelligence quotient" — mental age divided by chronological age multiplied by 100 — wasn't Binet's idea. It was added later by William Stern in 1912 and popularized by Lewis Terman.

The Stanford-Binet and the American adaptation (1916)

Lewis Terman, a Stanford psychologist, adapted the Binet-Simon Scale for American use in 1916. The Stanford-Binet became the first widely-used intelligence test in the United States and established many of the conventions that survive in modern testing — standardized administration, age-norming, percentile interpretation.

Terman's adaptation also embedded several assumptions that would later become controversial. He treated IQ as substantially heritable and relatively fixed, used the test to argue for a hereditary basis of social stratification, and was an active participant in the eugenics movement of the early 20th century. The intelligence-testing field would spend much of the next century distancing itself from these views, but the political baggage attached to early American IQ testing has remained part of the public conversation ever since.

The Stanford-Binet itself has been revised five times, with the current SB5 representing a substantially more sophisticated instrument than the original — multiple cognitive domains assessed separately, modern factor structure based on Cattell-Horn-Carroll theory, and contemporary norming samples.

World War I and the rise of group testing (1917)

The American entry into World War I created an unprecedented demand for rapid cognitive screening — the Army needed to sort hundreds of thousands of recruits into appropriate assignments. The Army Alpha and Army Beta tests, developed by Robert Yerkes and a committee of psychologists, were the first large-scale group-administered intelligence tests.

Alpha was designed for literate recruits; Beta for recruits with limited English literacy. About 1.75 million American soldiers were tested using these instruments between 1917 and 1918. The data set was enormous, the methodology was uneven, and the results were used to draw conclusions about national, ethnic, and racial differences that wouldn't survive contemporary statistical scrutiny.

The post-war legacy was mixed. On one hand, the Army testing program established the practical feasibility of mass cognitive assessment and accelerated the development of test methodology. On the other, the conclusions drawn from the data were used to justify the restrictive Immigration Act of 1924 and contributed to popular eugenicist arguments through the interwar period. Modern psychometrics has largely repudiated this strand of its history, but the era's data still shows up occasionally in discussions of measurement bias.

David Wechsler and the modern clinical battery (1939)

The most important methodological development of the mid-century came from David Wechsler, a clinical psychologist at Bellevue Hospital in New York. Wechsler was dissatisfied with the Stanford-Binet for adult assessment — it was designed around childhood mental-age scaling and didn't capture the cognitive profile of adults well.

The Wechsler-Bellevue Intelligence Scale (1939) and its successors — the WAIS (1955) and subsequent revisions through the current WAIS-IV — introduced several innovations that became standard in modern testing:

Deviation IQ — scores based on a normal distribution with mean 100 and standard deviation 15, replacing the older mental-age-ratio method
Verbal and Performance scales measured separately, recognizing that cognitive ability isn't a single thing
Index scores — breakdown into Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed
Standardized administration protocols with detailed scoring rubrics for clinician training

The Wechsler scales remain the clinical gold standard for individual intelligence assessment. If someone is tested by a licensed psychologist today, the WAIS-IV (adults), WISC-V (children), or WPPSI-IV (preschool) is the most likely instrument used.

The factor-structure debates (1940s-1990s)

While clinical testing matured along the Wechsler lineage, theoretical research on the structure of intelligence went through several major shifts. The debates that played out across the second half of the 20th century shaped modern understanding of what cognitive ability actually consists of.

Raymond Cattell proposed the fluid/crystallized distinction in 1941 — the observation that "intelligence" actually consists of at least two functionally separate abilities, with different developmental curves and different responsiveness to education and training. John Horn extended this work in the 1960s. John Carroll's massive 1993 meta-analysis of cognitive factor studies synthesized the field into a three-stratum model — narrow abilities at the bottom, broad abilities in the middle, and g at the top.

The merged Cattell-Horn-Carroll (CHC) model is now the dominant theoretical framework, and most modern intelligence tests are designed to measure multiple CHC abilities rather than producing a single number. The shift from "IQ" to "cognitive profile" as the unit of interest is one of the more significant changes of the last fifty years.

The ICAR project and the open-source turn (2010s-present)

The most recent development in this lineage is one most non-specialists haven't heard of: the International Cognitive Ability Resource, an open-source psychometric project initiated at Northwestern University and developed across an international consortium of researchers. ICAR was started largely in response to a practical problem: research-grade intelligence instruments (WAIS, Raven's, etc.) are proprietary, expensive to license, and not freely usable in studies — which had become a serious bottleneck for academic cognitive research.

ICAR built an open-access item bank — verbal reasoning, letter and number series, matrix reasoning, three-dimensional rotation — calibrated against established intelligence instruments and made freely available for research use. The items are documented in published methodology, the calibration data is open, and the item bank can be used to construct short cognitive assessments that produce results comparable to traditional instruments.

For a contemporary illustration of how these methods are applied in practice — including how an ICAR-based assessment maps onto the broader history of psychometric measurement and what a per-domain breakdown across verbal, numerical, spatial, and matrix reasoning looks like in a modern instrument — a complete guide to IQ test methodology covers the connecting tissue between the historical instruments and the open-research approach in some detail. The free implementation built on ICAR items at IQ-Test.us is one of the few accessible examples of what an open, research-backed cognitive instrument looks like outside the proprietary commercial ecosystem.

The trajectory in summary

Five thematic shifts define the 120-year arc:

From clinical screening (Binet) to general-purpose cognitive assessment
From single-number IQ (Stern, Terman) to multi-domain cognitive profiles (Wechsler, CHC)
From mental-age scaling to deviation-IQ statistical scaling
From classical test theory to Item Response Theory and adaptive testing
From proprietary clinical instruments to open-source research-grade item banks

Each shift represents either methodological improvement or expanded accessibility — usually both. The trajectory has been broadly positive, with the unfortunate exception of the early-20th-century period when the tools were applied in ways their original developers would have rejected.

The practical takeaway

Modern IQ testing isn't the same enterprise it was a century ago. The instruments measure more dimensions, with better methodology, and the resulting scores mean something more specific than the older versions allowed. Knowing the lineage helps you read current results sanely — what a percentile actually represents, why per-domain breakdowns matter more than headline scores, and which instruments have the research backing to be worth taking seriously.

The field has done a lot of growing up. The popular conversation about IQ tends to lag behind the science by a few decades, which is part of why understanding the history is one of the more useful things you can do before forming a strong opinion about what these tests are and aren't measuring.