I asked for “two or three of your best experiments done by unbiased social scientists who wrote it up in unbiased peer-reviewed journals showing significant prejudice in the housing market or in the job market.” I didn't want to waste my time on low quality stuff. Yayfulness gave me the names of a couple of books and a link to an opinion piece written by Nicholas Kristof containing other links.
http://www.nytimes.com/2015/02/22/opini ... .html?_r=2
I was interested in pursuing three of the links that Kristof provided. For the first link that I looked at Kristof says,
Kristof wrote: Researchers discovered that candidates for medical school interviewed on sunny days received much higher ratings than those interviewed on rainy days. Being interviewed on a rainy day was a setback equivalent to having an MCAT score 10 percent lower, according to a new book called “Everyday Bias,” by Howard J. Ross.
Well, I know a lot of people who don’t perform as well when depressed and get depressed on cloudy days, so I expected to see some difference. It’s not an unconscious thing as Kristof talks about because those interviewed usually know they don’t perform as well on cloudy days. But I thought I’d look into this “much higher ratings.”
In going to the link I see an abstract for a paper by Redelmeier and Baxter. The result was
Redelmeier and Baxter wrote: A total of 2926 candidates were interviewed over the 6-year period. As expected, their demographic characteristics were unrelated to the weather (Appendix 1, available online at
http://www.cmaj.ca/cgi/content/full/cmaj.091546/DC1). Overall, those interviewed on rainy days received about a 1% lower score than those interviewed on sunny days (average score 16.31 v. 16.49, p = 0.042). This pattern was consistent for both senior interviewers (16.39 v. 16.55, p = 0.08) and junior interviewers (16.23 v. 16.42, p = 0.041). We next used logistic regression to analyze subsequent admission decisions. The difference in scores was equivalent to about a 10% lower total mark on the Medical College Admission Test.
I'm sure that if you got a score of 16.31 out of 20 instead of 16.49 just because it was a rainy day would not make you happy. The equivalence to something on the MCAT may be true, I don't know, but the fact remains that the score on a sunny day is less than 1% higher, and I don't call that “much higher.” As Yayfulness points out often the real problem is in the interpretation of the experiment's results. I saw no evidence that the score differences were due to a bias of the interviewer rather than the effects of a cloudy day on those interviewed, and since the effect was smaller than I expected I wasn't interested in pursuing this further.
Another link looked interesting.
Kristof wrote: Consider a huge interactive exploration of 14 million reviews on RateMyProfessors.com that recently suggested that male professors are disproportionately likely to be described as a “star” or “genius.” Female professors are disproportionately described as “nasty,” “ugly,” “bossy” or “disorganized.”
So I followed that link. The page linked refers you to another page where you can get background information on this study by Benjamin M. Schmidt. The author of the study says
Benjamin M. Schmidt wrote: This no peer review, and I wouldn’t describe this as a “study” in anything other than the most colloquial sense of the word. (It won’t be going on my CV, for instance.)
Judging by the numerous limitations of the study that the author points out, I would agree with his self-assessment.
So then I came to the third study which seems to be what Zedability was talking about as well.
Kristof wrote: One reaction from men was: Well, maybe women professors are more disorganized!
But researchers at North Carolina State conducted an experiment in which they asked students to rate teachers of an online course (the students never saw the teachers). To some of the students, a male teacher claimed to be female and vice versa.
When students were taking the class from someone they believed to be male, they rated the teacher more highly. The very same teacher, when believed to be female, was rated significantly lower.
When I clicked on the link Kristof provides I come to another press release with glowing praise for this report by “Lillian MacNell, lead author of a paper on the work and a Ph.D. student in sociology at NC State.” Wanting to read the report for myself I clicked on the link provided. Unfortunately you have to pay $39.95 to see the report, and I wasn’t so sure I was
that interested. I began to search around to see if I could learn more about the report in other online sources.
I found numerous articles, but mostly they just repeated the press release. In an article (December 10, 2014 by Kaitlin Mulhere, Inside Higher Ed) praising MacNell’s study I found this interesting bit of information.
https://www.insidehighered.com/news/201 ... valuations Kaitlin Mulhere wrote: The lead author, Lillian MacNell, said personal experience encouraged her to conduct the study. (The co-authors are Adam Driscoll, an assistant professor at the University of Wisconsin-La Crosse and Andrea Hunt, an assistant professor at the University of North Alabama. Driscoll and Hunt earned their Ph.D.s from N.C. State.)
MacNell, a doctoral student at North Carolina State, was grading for an online course and often received emails from students challenging her decisions. They complained about her grading, and in some cases, went over her head and emailed the professor directly.
She vented to a male colleague who was also grading for the course, saying that it was frustrating how much students were protesting her decisions.
“He said, ‘What are you talking about?’ He hadn’t received anything.”
Both MacNell and the male colleague were using the same language and rubric to grade students, she said, so there was no reason why students should be accepting his decisions but not hers.
This raises red flags for me. The lead author wants to prove that it is the student’s sexism, and not her skills, that is the problem. While it is
possible that she could conduct the study in an objective and fair manner, she would have to pay strict attention to the ordinary rules for research. Unfortunately, as I was to learn later, she did not.
In “Inside Higher Ed,” “University of Venus,” I read an article by Jeanne Zaino.
https://www.insidehighered.com/blogs/un ... valuations While giving praise for the MacNell study, she admitted that the study may be flawed.
Jeanne Zaino wrote: MacNell et al’s findings have come under much scrutiny. Perhaps the most cogent and well thought out critique is by Steve Benton and Dan Li who raise a number of questions about the study’s design and analysis.
Going to the critique I finally found someone who had actually read the MacNell paper.
Steve Benton is an experienced researcher and Emeritus professor of Special Education, Counseling, and Student Affairs at Kansas State University. He served K-State for more than 25 years as both professor and department chair. He earned his Ph.D. in educational psychology and is a Fellow in the American Psychological Association and American Educational Research Association. Dan Li holds a B.A. in Online Journalism, an M.A. in Mass Communication, and she received her Ph.D. in Media, Technology, and Society from Northwestern University. I’d encourage you to read their full report here.
http://ideaedu.org/ideablog/2014/12/wha ... ender-bias It is only about four pages long.
I’d like to give some of the highlights of the report.
Benton and Li wrote: Although MacNell et al. (2014) believe their research demonstrates gender bias, a closer look at the study’s design and analysis reveals much to refute about that assertion.
Here are our chief concerns about the study:
• Researcher expectancy effects. Researcher expectancy effects can occur when those carrying out a study know what is expected. MacNell et al. report that, “All instructors were aware of the study and cooperated fully.” So, in other words the instructors knew that in one section they were identified as a person of the opposite gender. The authors should have employed a double-blind procedure so that neither instructor would have known which section was the “perceived gender.”
They could have performed a content analysis of the discussion boards, but they did not.
Related to this issue is the fact that the participating students were enrolled in an anthropology/sociology course. Was gender bias a topic in the course? Did the instructors inadvertently express views about gender bias?
• Inappropriate design and analyses. The design and statistical analyses were flawed in several ways. First, the authors performed sophisticated analyses (i.e., principal components analysis, structural equation modeling, MANOVA) on a sample of only 72 participants who responded to 15 items. Such statistical procedures require a much larger sample size relative to the number of variables measured.
Although the authors viewed their study as a 2 x 2 factorial design, they failed to test the interaction effect of Actual Gender by Perceived Gender.
If the data were consistent with the authors’ hypothesis that “students would rate the instructors they believed to be male more highly than ones they believed to be female, regardless of the instructors’ actual gender” (p. 5), we would expect to see that students in Section C, who thought their female instructor to be male, would give a higher rating than their peers in Section A, who rated the same female instructor but knew she was female. In keeping with the authors’ hypothesis, students in Section D should in turn have rated their male instructor higher than those in Section B. We were perplexed as to why the authors did not conduct such comparisons. Moreover, they only reported descriptive statistics for a combination of two sections, which masked the actual distribution of ratings in the individual class sections.
• Differential loss of subjects between groups. With 72 students randomly assigned to six class sections, each section should have contained 12 students. However, this was not the case. There were only 8 students in the section where the female instructor was perceived to be female, while the other three sections contained 11 or 12 students. For such a small sample, such a difference in attrition may have had some noticeable influence on the ratings of instructors. Unfortunately, the authors did not provide any explanations for the variations in class size.
• Inappropriate Type I error rate. The authors did not report an a-priori Type I error rate (i.e., probability of rejecting a true null hypothesis—that is the hypothesis of no difference). Then, in the results section they decided to use an unconventional .10 level on the student ratings index when .05 is typically used. Although they provided a rationale for this decision, we side with Krathwohl (1993) who recommends that if researchers are depending on a single study they should reduce Type I error to .01 or .001 (the opposite of what MacNell et al. did). At the very least the authors should have suspended judgment about any conclusions (Keppel, 1991) rather than boldly stating, “This study demonstrates that gender bias is an important deficiency of student ratings of teaching.”
• Student gender. Readers are given no information about the breakdown of student gender within each of the class sections. Existing research has suggested that student gender may have a modest but significant effect on the ratings of male and female instructors (Centra & Gaubatz, 2000). While MacNell et al. claim to have collected information about student gender, they did not report it and thus the gender composition of the subjects remains unknown to readers. Therefore, the observed differences in ratings cannot be fully attributed to gender bias if the effects of student gender were not controlled.
• Questionable validity of instrument. The 15-item instrument, apparently designed for this study, was comprised of Likert-type items inviting students to respond from 1 = Strongly disagree to 5 = Strongly agree. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were intended for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was intended “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the student ratings index correlates with any other relevant measures.
As we noted previously, the only significant differences (p < .05) were found on three traits—fairness, praise, and promptness. We contend that those three characteristics are not necessarily an indication of overall teaching effectiveness. In fact, the one item that measured “overall quality” was noticeably left unanalyzed. Why did the authors choose not to report any analysis on this important variable—one that is typically reported in studies of student ratings?
In conclusion, the MacNell et al. study falls short of other studies investigating gender and student ratings. Centra and Gaubatz (2000), for example, analyzed student ratings of instruction from 741 classes in 2- and 4-year institutions across multiple disciplines. They found a significant but nonmeaningful student-gender by instructor-gender interaction: female students, and sometimes male students, gave slightly higher ratings to female instructors. Centra (2009) also found that female instructors received slightly higher average ratings.
In a review of 14 experimental studies, Feldman (1992) found few gender differences (in only two of the studies) in global ratings. In a follow-up study Feldman (1993) found a very weak average correlation between instructor gender and student ratings (r = .02). In reviewing the experimental studies he wrote, “Any predispositions of students in the social laboratory to view male and female college teachers in certain ways (or the lack of such predispositions) may be modified by students’ actual experiences with their teachers in the classroom or lecture hall” (Feldman, 1992, p. 152).
And, in point of fact, no differences in ratings were found in the MacNell et al. (2014) study between sections that were taught by the actual female instructor and the actual male instructor.
This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables should not be cause for alarm. The gender differences in student ratings that have been reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) are not large and should not greatly affect teaching evaluations as long as ratings are not the only measure of teaching effectiveness.
Tl;dr The three “studies” that I looked at in Kristof’s article recommended by Yayfulness are either misrepresented, unscientific, poorly controlled, or performed by biased researchers as shown by the authors themselves and by competent reviewers. Some of the “studies” show a combination of these factors. Well, I’m glad I asked for the best so I don’t have to waste my time on more junk science.