Pick an item, any item - the reveal

Yesterday, I wrote a quick run through of item p-value in which I ignored a bunch of stuff about item analysis in order to focus on the big idea of predicting item difficulty. 20 anonymous people - of unknown age, teaching experience and background, answered a quick poll I put up about two items from the released NYS items. 20 adults (I assume -and based on their responses to the optional question, it's a reasonable assumption) picked which item they thought was harder. And 17 adults - about the size of an elementary school PLC - or 85% of them picked Item 1. Reasons for picking 1 included:
  • To answer question 1 you need to go off and look up the contents of the 4 referenced paragraphs and then keep them in mind while evaluating the most plausible answer. This would put more strain on the working memory. The content needed to answer question 2 is right there in the question, easily visible.
  • Question 1 requires students to go back into the reading selection; that is not required for Question 2.
  • "Predicts the action" is awkward phrasing, question requires students to flip back into the story to reread. Also, question number 2 is a more familiar format
All reasonable. All made by (again, I assume) well-educated adults using the evidence in front of them to draw a conclusion. And yet, the reason we need student item data...

49% of NYS 4th graders who took the test got this question correct. 
25% of NYS 4th graders who took the test got this question correct. 
Does this mean that the 20 adults don't understand teaching, students, or education? Not by a mile. It's a reminder that when it comes to assessment - especially multiple choice item design - when adults read items we see difficulty different than the test taker. Think one of the released items is especially hard? Check the p-values by using the released charts. See the small blue number in the top left? That's the item's code. Look for that code on the p-value chart. If you're in a school in NY wondering how to make connections to your students, look at your students' p-values in the reports released by the RIC and start to have conversations about the implications. SED has released guidance on how to analyze the data and educators across the state are writing (written by my co-blogger, Theresa) about how to state assessment data to inform conversations.

State ed testing takes 300 minutes, 1%, 2 weeks - however you chose to present the numbers - of a child's year. It is the LEAST important assessment children will engage in over the course of a year. What's to be gained - or lost - by framing the LEAST important thing students do as a way to advance our agendas? How does it help alleviate students' (and parents') stress when we give the LEAST important thing in the education landscape the most attention? 

Disclaimer: And as you read these posts, please know, gentle reader, that I am an advocate of performance-based, portfolio, and authentic assessment. I love roses but have committed to the science of the teaching profession which means working to ensure we're describing the daisies correctly. So the usual disclaimer - I am not defending NCLB, VAM, APPR. I'm not even defending the NYS assessments. It's my hunch that we're making it harder to fix the big picture when we neglect to accurately define the parts of the whole.

Pick an item, any item

So we're going to play a little game in this post. But first, let me set the stage.

While lurking on a Twitter exchange about race, education, and schools, I saw a great reminder from Bill Fitzgerald scroll by. In effect [and apologies, Bill, if I've summarized incorrectly], it's worth engaging around important topics even it's clear the discourse isn't going anywhere because you never know who might be listening, seeking to learn. To revisit an earlier post, I decided not to worry about the manager of the nursery, and consider instead the walkers just out for a Sunday stroll who may overhear the discussion about the daisies.

I am going to make one claim here in this post and one claim only: when adults look at multiple choice items, we see them differently than students do. Experience, background knowledge, expertise, confirmation bias, 20 years of living - a wide variety of things influence how we read an item. Any teacher who's seen students ace a "hard" item or tank on an "easy" one will know that it's not until students actually take the items that we get a real sense of the item's difficulty.

Item design is a science - and an art. Objectivity plays a large role. BUT:
One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of a p value. Basic Concepts in Item Design
This is why there's field testing. Or why we should field test classroom tests and why states have to field test items from their large scale tests. The test designer (teachers or Pearson writers) do their level best but we need certain statistics (available only after students have actually taken and responded to an item) to reach conclusions about how high quality an item is. The most common statistic we can use is what's known as a p-value. This value is the percent correct - the higher the p-value, the more students who got an item correct, the easier the item was for the group of students who took the test. There are guidelines around p-values but generally speaking, "p-values should range between 0.30 and 0.90." There's a lot more to unpack around item difficulty but we'll just leave this here.

In the absence of these p-values, our observations about the difficulty of an item are just that - observations. Hambleton and Jirka (2006), two psychometricians/unicorns reviewed the literature around estimating item difficulty and found studies where highly qualified, well-experienced teachers were inconsistent when it came to accurately estimating how students would do on an item. "No one in any studies we have read has been successful at getting judges to estimate item difficulty well." Pretty compelling evidence that we need to temper our opinions with supporting evidence from students who, you know, actually took the assessments.

So now onto the game. Let's pick an. any item. How about Item 132040149_1 from the released Grade 4 Assessment items?

Now, in order for this game to work, you have to play along. Click the link above to read the Pecos Bill story and do your best to answer the question. You may look at it and conclude it requires skills "required skills out of the reach for many young children" or that the number of steps to answer this question are too many and too complicated for 4th graders. Now consider the question below also from the fourth grade test:

Which one would you expect to be harder for the students? The top one or the bottom one? What's informing your decision making? What evidence are you using? What percent of students do you think got the top item correct? How about the bottom one?

Hit me up here or on Twitter and share your thinking. I'll follow up with the answer in an upcoming blog, provided I get through my rose-tending to do list.

Chasing Down Pineapple Chasers

Imagine you're strolling with a loved one in a local park one bright Sunday morning. You and your companion pass a cluster of flowers and you overhear another stroller say, "Look at those gross weeds. They should be pulled out, they ruined this entire garden." You look where he's pointing and see the happiest, cheeriest, sunniest, albeit ugliest, bunch of daisies you've ever seen. You look at his face and recognize the speaker as a highly regarded and respected nursery owner. What do you do? What would Emily Post say? What would Freud advise? You look over at your loved one, panic clear on your face. If your loved one is like mine, s/he smiles, squeezes your hand and asks, "is it worth it?"

You decide it's not. No damage done. Who cares that the nursery owner confused a rare variety of daisies with weeds? But then you look over and see a group you recognize from the local gardening club, nodding along. "Bad weeds" you hear one mutter. "Terrible things. They should be yanked." Another pulls out a garbage bag and covers up the bunch of daisies. "No one should have to see these weeds." She says and you hear the passion and conviction in her voice. Her voice practically vibrates with anger.

The nursery owner is consulting a gardening book and reading aloud the problems he sees with the weeds and your stomach drops as you recognize he's misreading some of the information. "They're going to strangle the whole garden." You hear him mutter and you start to twitch, knowing that the daisies actually attract a particular strain of butterflies that help germinate a different section of the park.

You know because you're a botanist. You spend your professional life studying plants. While your work actually involves roses, you had to study daisies in order to better understand the species you grow. There is the real possibility, you admit, that you're wrong. The longer you stand there, the louder the group gets, the more convinced you are that you must be off-base; daisies aren't THAT necessary and it would be great to use the space for more of your roses..

So, you say nothing. The moment passes. The group is unified by their hate for those damn not-really-weeds. Not much you can say.  So you walk on with your companion, working hard to not give your loved one yet another lecture on the importance of ugly daisies in a well-balanced ecosystem.  On your way out of the park, you hear members of the gardening group telling incoming strollers how the owner of the nursery had, just that morning, published a piece in a national gardening newsletter, "setting the record straight" on those nasty weeds.

What's the role of expertise in conversations like this? Do you, with expertise in flowers, though you way prefer roses to daisies, speak up? Does your obligation to speak up change based on the size of the crowd? Is it changed by knowing the nursery owner isn't fond of you, and has even publicly called you "uninformed?" In truth, last time you spoke up, you had a middle school flashback to being told "your opinion doesn't matter because you're not tall/ short/ athletic/ musical/ smart enough/ right-handed enough" to comment. Even worse, the last time someone spoke up, it seemed some members of the gardening group became even more insistent and vocal about calling the much maligned daisies "weeds."

Help me out here, gentle readers. If you were the botanist, what you do? What if you were the nursery owner, would you want the botanist to speak up? Is there a right time and place to speak up? Is it worth it?