Academic Minute Podcast
Joseph Cimpian, New York University – Misleading Numbers: Examining Data Reliability in Public Health
On New York University Week: Data can be very important, but it needs to be correct.
Joseph Cimpian, professor of economics and education policy, discusses how bad data can have big consequences.
Joseph Cimpian, Ph.D., is Professor of Economics and Education Policy at NYU Steinhardt School of Culture, Education, and Human Development. His research focuses on the use and development of novel and rigorous methods to study equity and policy, particularly concerning gender, data validity, and language policy. One line of his research examines individual and contextual factors related to gender gaps in STEM. Another line examines how invalid and mischievous survey responses skew estimates of majority-minority group disparities. Other research examines how education policies can be amended to provide access and supports to students learning English. His work has been funded by the Spencer Foundation, the National Science Foundation, and the Institute of Education Sciences. His research has been published in numerous journals, including Science, Pediatrics, Child Development, the Journal of Policy Analysis and Management, and Educational Researcher, and has been featured by the New York Times, the Washington Post, NPR, and Brookings, among other outlets. At NYU, he teaches intermediate and advanced graduate courses on data analysis and causal inference, and he co-directs the Master’s degree program in Education and Social Policy. He is an immediate past Editor of Educational Evaluation and Policy Analysis, and is on the editorial boards of several other education and psychology journals.
Misleading Numbers: Examining Data Reliability in Public Health
Data shape policy, perceptions, and resource allocation. But when data are tainted, skewed, or false, problems arise. We found this issue in the Centers for Disease Control and Prevention’s Youth Risk Behavior Survey.
At first glance, the CDC’s findings suggested alarming disparities between male teens who identified as lesbian, gay, bisexual, or questioning – or LGBQ – and their heterosexual peers. LGBQ males reported higher rates of steroid and intravenous drug use, weapon possession, and getting into fights. But here’s the twist: after accounting for potentially invalid data, the disparities vanished, revealing greater similarity between the groups.
However, not all disparities are affected by these inaccuracies. LGBQ youth remained at a higher risk for being bullied and contemplating suicide, a stark reminder of the challenges they face daily.
Given the survey’s impact on policies and resource allocation, data accuracy is vital. Our research team proposed methods to identify and mitigate flawed survey data. Using machine learning, we found discrepancies between LGBQ and heterosexual respondents in areas that should not differ, like height and carrot consumption. Some gave exaggerated, implausible responses, including unrealistic heights, excessive carrot consumption, and likely false claims of being gay. They also reported high drug use and risky behaviors. Our method also accounts for careless or random responses, reducing the influence of suspicious data.
These methods can be applied not only to surveys of youth, but to other areas where doubts about accuracy have emerged, including studies on public health behaviors during the COVID-19 pandemic, studies of adopted youth, and surveys about sexual orientation among adults.
Our understanding of the world is only as good as the data that inform it. We must ensure that the data we collect and interpret are as accurate as possible.