Magazine Perspective

Fix It—Don’t Throw It Out

Student evaluations of teaching are valuable, but the system needs reform

By Bernhard G. Gunter

Fall 2024

A few weeks after the end of a semester, I am always excited to receive my student evaluations of teaching (SETs). As a term faculty member in the Department of Economics at American University (AU), I find these evaluations to be useful. Until I got a continuing appointment, they were important for my reappointments. Beyond that, students’ feedback on my teaching helps shape my future work. It’s one way for me to know what’s working well and what I need to tweak or change going forward.

Similar evaluations are widely used at other colleges and universities, especially in the United States, to measure how effectively instructors teach. Since their introduction in the 1920s, the percentage of higher education institutions using SETs has gradually increased. Today, more than 90 percent of US colleges and universities utilize them, according to research by Terrance Jalbert, a professor of finance at the University of Hawai‘i at Hilo.

Students usually fill out evaluations at the end of each semester, ideally giving useful feedback on teaching practices. In addition, SETs provide institutions with information for use in reappointment, promotion, and tenure decisions. Evaluations also enable comparisons across courses and instructors. Despite these uses, SETs receive many criticisms, including that they do not correspond to learning outcomes, they encourage poor teaching and grade inflation, and they are subject to racial and gender bias. Institutions must address these problems. Students’ feedback on teaching can be helpful, but the system needs reform.

Most assessments of SETs have shown that they are flawed, with criticisms going back several decades. A recent and frequently cited study by social psychologist Wolfgang Stroebe concludes that these evaluations encourage poor teaching and contribute to grade inflation. Stroebe argues that because students tend to give higher scores to instructors who give higher grades, SETs reward lenient-grading instructors who require little work from students, while strict-grading instructors often receive low scores.

Low response rates are also a problem. They introduce significant bias into data because the results are unlikely to accurately reflect the overall population of students in a class. Many institutions have shifted from in-class evaluations to online ones to make the surveys easier to administer. This, however, further reduces response rates, perhaps because students feel disengaged once they are no longer in class. In addition, students who did not like the professor and/or the course are more likely to complete online surveys than students with favorable views. The reasons for this are unclear.

In my experience, scores are also lower in large classes because students are often less engaged and have fewer opportunities to personally connect with the instructor than in smaller classes. For example, I have taught the same macroeconomics course many times over the past fifteen years. The number of students in the course has varied from thirty to more than two hundred. The larger the class size, the lower the evaluation scores have been. This phenomenon is especially problematic for term faculty like me. We often teach large introductory courses. Tenured faculty are more likely to teach advanced seminars and graduate-level courses, both of which typically have small class sizes. Students are more likely to get to know the professor well in a small class, which may lead them to give better evaluations.

Another problem is that institutions usually report the average (mean) score instead of the median, but SETs typically skew toward the high or low end of the scale. For example, among the 24,060 evaluations completed in the fall 2023 semester at AU, on a scale from one to five, with one being the lowest and five being the highest, 83 percent of students rated the statement “I would recommend this professor to other students” a four or five. Furthermore, in a small class, a few students giving very low scores would cause an instructor’s average score to plunge.

Finally, student evaluations have complex gender and racial biases. A 2019 AU task force that reviewed the SET system found that a faculty member’s race, gender, age, accent, and other characteristics sometimes result in biased responses. However, these biases are often situational. For example, a female instructor teaching physics may get lower scores than a male physics instructor. On the other hand, a male instructor teaching gender studies may get lower evaluations than a female instructor teaching gender studies. Taking these complexities into account, the task force concluded that bias most affects instructors from underrepresented groups in a given field.

In light of the problems with SETs, institutions have a variety of options for changing how they approach student evaluations. More thoughtfully designing evaluations, especially the wording of questions, would be useful. Evaluation questions like “How would you rate your overall experience in this course?” are problematic. Because no objective criteria exist for answering such questions, students create their own subjective criteria and responses. Instead, institutions should utilize objective questions and reframe evaluations around tangible qualities and teaching practices. For example, questions could ask whether assignments were rooted in materials covered in class or how long, on average, the instructor took to grade assignments.

In general, institutions need to create evaluations that measure the concrete practices of effective
teaching rather than the more subjective emotional responses students have to a course and its instructor—such an approach could also reduce gender and racial biases in student evaluations. Evaluation surveys should ask students to seriously consider the hallmarks of good teaching, including clarity, competency, and professionalism.

Institutions should also explore ways other than student evaluations to assess teaching. A 2018 study by Stephen Benton concluded that “effective instructor evaluation is complex and requires the use of multiple measures—formal and informal, traditional and authentic—as part of a balanced evaluation system.” The American Association of University Professors recommends gathering course syllabi, tests, and pedagogical materials for evaluation by an instructor’s colleagues. The organization also suggests implementing self-evaluations and class visitations by trained observers.

Some institutions are now considering scrapping their existing evaluation systems entirely and are seeking new ones based in research and thoughtful reflection. The University of Oregon (UO) has adopted a framework for the Continuous Improvement and Evaluation of Teaching. According to UO’s website, this initiative aims “to more transparently evaluate teaching with (1) a clear definition of teaching quality and (2) evidence from three sources: peers, students, and faculty themselves.” Non-numerical student evaluations in which students write out responses to questions are still part of the system.

High-quality teaching is a critical component of education. Student evaluations are one useful way to measure teaching effectiveness and document instructional development for a teaching portfolio. However, academic institutions need to address the current system’s flaws. In the end, we must continue to provide a way for students to have a voice and for faculty to remain accountable to those they teach.

Illustration by Christopher Silas Neal

The Awe We Need

Fix It—Don’t Throw It Out

Bernhard G. Gunter

A Detour for Democracy

Give Them Ballots

The Joy of Learning

Make Them Believers