While lurking on a Twitter exchange about race, education, and schools, I saw a great reminder from Bill Fitzgerald scroll by. In effect [and apologies, Bill, if I've summarized incorrectly], it's worth engaging around important topics even it's clear the discourse isn't going anywhere because you never know who might be listening, seeking to learn. To revisit an earlier post, I decided not to worry about the manager of the nursery, and consider instead the walkers just out for a Sunday stroll who may overhear the discussion about the daisies.
I am going to make one claim here in this post and one claim only: when adults look at multiple choice items, we see them differently than students do. Experience, background knowledge, expertise, confirmation bias, 20 years of living - a wide variety of things influence how we read an item. Any teacher who's seen students ace a "hard" item or tank on an "easy" one will know that it's not until students actually take the items that we get a real sense of the item's difficulty.
Item design is a science - and an art. Objectivity plays a large role. BUT:
One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of a p value. Basic Concepts in Item DesignThis is why there's field testing. Or why we should field test classroom tests and why states have to field test items from their large scale tests. The test designer (teachers or Pearson writers) do their level best but we need certain statistics (available only after students have actually taken and responded to an item) to reach conclusions about how high quality an item is. The most common statistic we can use is what's known as a p-value. This value is the percent correct - the higher the p-value, the more students who got an item correct, the easier the item was for the group of students who took the test. There are guidelines around p-values but generally speaking, "p-values should range between 0.30 and 0.90." There's a lot more to unpack around item difficulty but we'll just leave this here.
In the absence of these p-values, our observations about the difficulty of an item are just that - observations. Hambleton and Jirka (2006), two psychometricians/unicorns reviewed the literature around estimating item difficulty and found studies where highly qualified, well-experienced teachers were inconsistent when it came to accurately estimating how students would do on an item. "No one in any studies we have read has been successful at getting judges to estimate item difficulty well." Pretty compelling evidence that we need to temper our opinions with supporting evidence from students who, you know, actually took the assessments.
So now onto the game. Let's pick an. any item. How about Item 132040149_1 from the released Grade 4 Assessment items?
Now, in order for this game to work, you have to play along. Click the link above to read the Pecos Bill story and do your best to answer the question. You may look at it and conclude it requires skills "required skills out of the reach for many young children" or that the number of steps to answer this question are too many and too complicated for 4th graders. Now consider the question below also from the fourth grade test:
Which one would you expect to be harder for the students? The top one or the bottom one? What's informing your decision making? What evidence are you using? What percent of students do you think got the top item correct? How about the bottom one?
Hit me up here or on Twitter and share your thinking. I'll follow up with the answer in an upcoming blog, provided I get through my rose-tending to do list.
The reveal is here.