Photo by HikingArtist on Flickr.

DC school reform was a failure, claims a new report from the Economic Policy Institute (EPI). It’s a proven success, others insist. All sides of school reform debates are guilty of misinterpreting federal test data in ways that serve advocacy goals rather than finding truth.

The EPI report blasts recent DC’s sweeping 2007 school reforms and similar efforts in Chicago and New York City. One of the report’s most amazing claims is that school reform in DC actually lowered student test scores and increased achievement gaps. It reaches that conclusion through a flawed analysis of National Assessment of Educational Progress (NAEP) test scores.

They’re not the only offenders. In January, the Washington Post editorial board assured readers that despite alleged cheating on the DC CAS, NEAP data demonstrates that school reform has succeeded. A letter to the editor the next day from Alan Ginsburg, director of policy at the Department of Education from 2002-2010, argued that NAEP shows the exact opposite.

Beware of arguments that use NAEP to defend or attack policies like charter expansion or teacher layoffs. The reality is that NAEP is not meant for this purpose. You will not find typical peer-reviewed research drawing such conclusions from NAEP data, because it’s a fairly well known error that’s been widely discredited.

I have decided this needs its own term: “misnaepery.”

What is wrong with using NAEP data in this way?

NAEP is the test given to a random sample of students in grades 4, 8, and 12 across the country. It’s designed to gauge long-term trends in student academic proficiency. It doesn’t look at how a fixed group of students learns over time.

Each test looks at a different set of students from the one before. Those who take the test one year in grade 4 are usually in grade 5 the next year, where they won’t take the test. Those still in grade 4 wouldn’t necessarily be in the random sample again anyway.

A test that looks at different groups of students in different points in time (“trend” or “repeated cross-section” measures) doesn’t clearly tell you whether a school is doing better at educating those students, because they are different students. Maybe the demographics of the neighborhood or city changed. Maybe some moved to or from charter schools.

The 8th grade NAEP is measuring not what that middle or junior high school has done since a previous group of students took the NAEP, but the effect of everything those students did up to grade 8. If something changed in the district’s kindergarten 9 years prior, that would affect the scores of 8th graders who entered kindergarten before and after the change.

These shifts are called “cohort changes.” In short, when you measure a group of students and then a different group of students at another time, the second group could be very different for many reasons. I wrote a more technical paper about this if you want to see a more mathematical analysis of the bias inherent in these types of measures.

In the case of DC school reform, misnaepery is especially inexcusable because a panel of experts from the National Academy of Sciences specifically warned that the NAEP does not provide causal evidence on the DC reforms’ impact. The EPI report’s authors may be right that reform proponents made exaggerated claims that reform was successful when test scores rose. But making even more exaggerated claims in the other direction is the wrong response.

We need better data and more objective research

Instead, we must be humble about what can be learned from existing data. We must also invest in better data and more focused data-gathering efforts. Instead of repeated cross-sections, we need longitudinal “growth measures,” where you take a group of students who were exposed to a policy (and ideally others who were not) and follow those same students over time.

The NAS experts in 2011 recommended a set of metrics, mostly longitudinal, that DC could use to evaluate school reform policies. That would help, though it wouldn’t entirely prove reform worked or didn’t, unless there were another group of kids who didn’t benefit from reform at all to serve as a control group.

Better data would also help estimate the impacts of specific, replicable reforms, rather than trying to settle a pointless debate about whether the broad suite of DC education reforms as a whole were collectively good or collectively bad.

Some researchers do use data intelligently to answer focused questions about specific changes, such as this paper from last summer about school closures.

To improve DC education, we need purposeful experiments that try out promising practices and then collect the data to evaluate them. We need to collect more useful data and to recognize the limitations of the data we have. Researchers have a responsibility to their audiences to not oversell what existing data can tell us.

We've just launched our brand new website and are working out some kinks. Find something that looks like a bug? Please help out by sending us an email with the details!

Steven Glazerman is an economist who studies education policy and specializes in teacher labor markets. He has lived in the DC area off and on since 1987 and settled in the U Street neighborhood in 2001. He is a Senior Fellow at Mathematica Policy Research, but any of his views expressed here are his own and do not represent Mathematica.