Interview with Aaron Thornton Ph.D.

Topic: Error Control in Word Recognition Testing

BECK: Hi Aaron. Nice to speak with you again.

THORNTON: Good Morning Doug. Nice to be here.

BECK: Aaron, one of the papers you published in the late 1970s was the Thornton-Raffin paper (Thornton AR, Raffin MJ. Speech discrimination scores modeled as a binomial variable. J Speech Hear Res, 1978, 21:507-518) addressing statistical significance and word recognition tests. That paper always fascinated me. I thought everyone in hearing healthcare really needed to read it and understand the implications. Can you please review the reasons you wrote the paper and your impressions 25 years later?

THORNTON: Sure, I'd be happy to. Ever since I started in audiology, there was a need to reduce test time. As you know, many people shorten the standard 50-word tests by dividing them into two 25-word half-lists, and most scoring sheets organized the standard 50-item word lists into two 25-word columns to accommodate either full- or half-list testing. Those among us who used the full 50-item lists would often tally the 25-word columns separately before adding them for the full-list score. Consequently, we noticed frequent large discrepancies between the two half lists for many patients, despite studies showing close equivalency in difficulty among the half lists. These observations led us to try using multivariate statistical analysis to find the optimal 25-word subsets that would best predict the full-list scores. This was a bad course of action because we misunderstood the cause of the differences in half-list scores.

BECK: Just to be clear, we're talking about the CID-W22 lists, and the NU-6 lists?

THORNTON: Yes, that's right.

BECK: And we can save this for another discussion, but really the list itself doesn't matter, does it?

THORNTON: You're correct. The key to the problem is in the scoring of the tests, not in the nature of the words themselves, but we will get to that later. Mike Raffin and I took scoring sheets for patients and entered the response, correct or incorrect, for every individual test item into a database. Our initial sample was 1030 patients for each of the four 50-item word lists, for a total of 4120 patients and 206,000 scored words. Eventually, we tripled the number of observations in the database.

BECK: Incredible.

THORNTON: That gave us a large sample, especially for that time. In my naiveté, we actually did a stepwise multiple regression and let the analysis specify which subset of 50 words had the highest correlation with the full 50-word list, for each of the four lists. And we used half the data for analysis and the other half for verification. We could've stopped there, published it and said. "Look -- we found the best 25 items." But I couldn't help looking at the data and doing things like reviewing the distribution of errors for all patients who had 78% on their full-list scores and asking, "What was the distribution of 25-word scores?" And I asked, "What if you randomly select 25 words and you do it over and over and over again, how much do these 25-word scores vary?

BECK: Once you start down that road, the analysis becomes essentially infinite. What did you learn?

THORNTON: Well, I learned that the question I posed wasn't any good! By observing the details I noticed there was tremendous skewing of data. The error variance wasn't symmetric and it wasn't constant. If you looked at patients with 50% speech discrimination ability, the variance of the 25-word subsets was huge compared to the variance of 25-word subsets for people who had 98% discrimination. And as I was trying to puzzle through it, a colleague from the statistics department said "Well gee, that's what you would expect when things are scored binomially." He was right! The results were characteristic of binomial variance. I realized the way we had thought about speech discrimination scores, and the way we tried to set up fixed criteria for significant differences (i.e. +/- 6%) had no relationship to the characteristics of the tests and the testing error! Our original quest to find 25 words that worked like 50 was just plain nonsense.

BECK: So if someone scored very well on the first test, such as 96% correct, on subsequent tests they would be expected to also score very well. However, it their first score was 48%, or something similar, the variance in their second score could potentially be enormous?

THORNTON: Exactly. Therefore, when people compared hearing aids using word recognition tests, and somebody would score 50% with hearing aid A and 64% with hearing aid B, many people looked at that and said "Hearing aid B is better." Of course that is simply not the case.

BECK: That would be so nice and easy -- if only it were valid!

THORNTON: Ahhh - you did read the study! And you're correct again. The higher score on hearing aid B does not mean hearing aid B is better! The fact is that if you retested hearing aid A and hearing aid B, hearing aid B might get 64% and hearing aid A might get 50%. The differences would be more influenced by sampling variance and the differences between lists, rather than differences between hearing aids!

BECK: So phonetically balancing the lists is not an issue?

THORNTON: No. For the purposes that these tests serve, the phonetic balancing doesn't seem to matter very much. We are generally interested in measuring information loss in the transmission and reception of speech, and the information is pretty evenly distributed across phonemes. At one time there was considerable attention given to list equivalencies, but these are most meaningful for group data used in research on normally hearing subjects. When you throw in the variable of ear disease, the equivalencies based on a homogeneous normal subject pool have less validity. For a test such as the W-22 or NU-6 I prefer to pool the 200 words and randomly draw 50 for a test. It has fewer problems with regard to accidentally repeating specific lists with patients, which happens all too frequently.

BECK: I did a lot of work on speech perception with cochlear implants in the early 1980s. We looked at balancing tests phonetically and I was never sure it mattered.

THORNTON: It probably doesn't matter, for the same reasons. Equivalent tests would correlate strongly and have the same mean for target subjects, regardless of whether or not they were phonetically balanced. That's not to say that paying attention to phonetic balancing wouldn't be useful in achieving the goal, just that phonetic balancing isn't the goal, in and of itself.

BECK: Excellent point! So when the paper first came out, how was it accepted among the clinicians?

THORNTON: What's very interesting is that the paper almost did not come out! Two reviewers agreed that our paper should not be published. One of them said "There's nothing new here." He argued correctly, that anybody who's studied test characteristics already knew this information, and he asked, "Why on earth should something like this be published again?" The second reviewer said the paper shouldn't be published because it's antithetical to everything that we know about speech testing. He said, "This can't possibly be right. It's just complete nonsense." Based on the certainty of the second reviewer, the editor decided that the paper should be accepted.

BECK: I think you're right, in the statistical literature and in behavioral sciences there probably wasn't much new there, but as far as applying it to clinical audiology there was quite a bit to think about! I can imagine it must have caused uneasiness among the clinicians who read it!

THORNTON: No, not really. It was largely ignored for quite a while. Eventually it became sort of a cult piece that some academics love to teach, but when people go out in the field they scarcely apply the knowledge.

BECK: What I remember from that paper, and tell me if I'm anywhere near correct because I don't have it in front of me, was you could look on the chart, and if you used a 25-word list, and if the first word recognition score was 80%, then there was a 95% probability that a repeated test would have a score of +/- 12%, more or less?

THORNTON: Yes. That is the general idea, but the variability is not symmetric. So, if the first score was 80%, there would be a 95 percent probability that the second score would fall in the range 56-96%. The "plus 16%, minus 24%" error range is not symmetrically distributed around the 80%.

BECK: So basically, when you have someone in the office with a word recognition score of 88%, and they come back six months later and have 96%, you cannot say they had an improvement in hearing.

THORNTON: Exactly, not even if you had tested with 50-word lists both times. With computerized systems, such as the Tympany Otogram, we print out a "percent correct" score and an upper and lower confidence limit so we have the two bracketing numbers printed out alongside the patient's score, and the clinician doesn't have to compute it. With the computer-assisted audiometer developed at the Massachusetts Eye and Ear Infirmary, the error range is shown graphically on a PI-function. We were able to simplify the interpretation and make the error range appear symmetric by scaling the percent-correct axis for equal variance. This permitted an instant recognition of the significance of any score differences.

BECK: So, if we actually were to use the full 50 word list, as opposed to the 25, what happens to the results?

THORNTON: The more words you use, the lower the error variance and the more certainty you have that the test result represents the patient's discrimination ability for that test. Basically, the error decreases as a function of the square root of the number of items in the test. If you look at most psychometric tests with binomial scoring, hardly anything is done under 100 items. But even with 100 words, there is a +/- 13% error range around a score of 50% at the 95% confidence level.

BECK: But getting back to the issue you mentioned a while ago, the clinical time involved is enormous. I couldn't imagine sitting there with a patient and doing 50 words per ear per test, let alone 100 words.

THORNTON: Many people use full lists in the clinic, which has always been the practice at the Massachusetts Eye and Ear Infirmary. Brian Walden, at Walter Reed, actually implemented a practical 100-item test by using paired words. For example, the carrier phrase is spliced to two test words, "You will say shoe carve," and the patient repeats back "shoe and carve." It takes no more time than the one word item to be repeated, and it effectively doubles the number of items without sacrificing the accuracy that you need.

BECK: What about other methods of obtaining accurate scores without running out the clock?

THORNTON: On way of conserving time is to save it on a subgroup of your population. If a person is going to get a normal speech discrimination outcome, which we might define as 92-100%, then you don't need to pin it down any finer than that. Your treatment of the patient won't change.

BECK: So I guess we can talk about ten word screening tests and where they came from?

THORNTON: These have been around for quite some time, but I don't think that they are well understood. In fact, the "word difficulty" is included in the article Mike and I wrote. But the hardest words are not necessarily useful for screening. The data showed very poor sensitivity and specificity for a simple screening based on the 5 or 10 hardest words. That is, such a screening would pass too many people with very poor word recognition ability and it would fail too many with normal capability. Some of the harder words are poorly articulated and missed by people with normal hearing. So, we went back and looked at the difficulty of the words for each group of patients with differing discrimination abilities. For example, everyone who got 92% correct on the 50 words was one group. Everyone who got 90% or poorer was another group. Then we looked at how a single word performed against each of those groups. For example, the ideal word is one in which it is heard correctly by everyone with 92% or better word recognition and heard incorrectly by everyone with 90% or poorer word recognition.

BECK: Sure, that would be terrific.

THORNTON: We identified a subset of words that closely approximated that ideal and then put them together in four 10-item tests, one for each of the W-22 word lists. We included these on the first CD of American speech materials, which we produced under a contract with Qualitone. The QMASS recordings are still available through Qualitone and Starkey.

BECK: And if I recall, the error analysis of the 10 item screening tests was fairly amazing.

THORNTON: Yes. In an analysis of 14,754 patients having full list scores of less than 92%, only 101 (0.68%) passed the ten-item test. And when errors occurred, they weren't extreme. In other words, we predicted 92-100% but the occasional miss was rarely below 88%. We were happy with that and we used that test for my last 15 years at the Eye and Ear Infirmary. Basically, you give the first 10 items, and if the patient gets them all correct, you're done. You saved a huge amount of time. However, if the patient misses one item, you continue to finish all 50 words and now you have the precision of the 50-word list, but only on the people who need it. We also evaluated whether presenting the harder screening words first would affect the 50-item score. It didn't.

BECK: Was there ever a time when you presented the first 10 words and if they missed a word, you finished with a 25-word list?

THORNTON: No, never. If they miss an item on the 10 word list, you need the accuracy of the 50 word list to really know what's going on.

BECK: And then you took it another step with the Tympany Otogram?

THORNTON: Yes, with the Otogram, because the computer continually calculates these probabilities while the testing is going on, we set up a fixed criterion for the amount of variability you're willing to tolerate.

For example, if you look at a 50-word list there is a +/- factor of 18% error for patients getting 50 items and a 50% score. And for 25 words, the error is +/- 24%. So, if you want to set up a fixed error criteria of say +/- 15%, or whatever, then as the patient is taking the test and getting words right and wrong, you can take their score at any moment in time and calculate whether or not you have reduced the error variance down to your criterion. So what happens is that for people with good speech discrimination you can reach the criterion after giving only 12 words. And if their discrimination is a little worse, it may take 20 words. If it's a mid-range score, it'll probably take 50 words. So the test doesn't have a fixed number of items. It has a fixed error tolerance. It's a much better way of doing this, and we're making the precision of the test uniform irrespective of the patient's score. You can only do that using computer scoring.

BECK: I know we're way past our scheduled time, but can you just give me a few random thoughts on live versus recorded, and perhaps male versus female presentations?

THORNTON: Well, we could spend hours on this, but briefly....Live tests have added variability regarding talker and inter-talker differences. Male and female voices are also a source of tremendous variability. For example, consider a patient with word recognition ability on Hirsh recordings of the W-22s of 64%. Suppose I tested live voice, 1000 words, and I get the binomial variance down to essentially nothing and the real score becomes 78%. But then we substitute a female speaker with a different voice, and the score is 58%. Even though we're using the same lexical items and the same words, the acoustic signals are different and the intelligibility is different. Has the patient's hearing changed or is the difference due to the two talkers? That's the problem. You've taken a horribly variable test and made it even more variable by introducing live versus recorded and gender-based voice issues. Even the same person giving live presentations day-after-day has variable presentations. Your day-to-day performance is not perfectly consistent even if you have the same voice. It's very hard to control vocal effort. This has been published repeatedly -- intelligibility varies dramatically with vocal effort, to say nothing of attention, time constraints, yawning, swallowing, distractions, etc, and so getting back to your question, a single talker who would be very highly disciplined and always doing it the same way every time might maintain a fairly reasonable consistency with themselves, but they will still be far more variable than a recording! Recorded presentations are perfectly replicable; live presentations are not!

BECK: So if you were seeing a patient today and you wanted to get a reasonably accurate score, would you even consider doing a live voice test?

THORNTON: No. We had some 30 audiologists working at the Infirmary and every time a patient came in they would see a different audiologist. There's no way you could have any consistency using live voice testing. We would almost always find the patient's responses to the recorded list produced markedly different results from the live voice testing in other clinics. I think the whole area of speech discrimination testing just has to be rethought and we need to change the practices, perhaps by first eliminating the practice of scoring tests as the percent of words correctly repeated. Transforming percent scores to equivalent AI would bridge many of the differences among current tests, but there are even better solutions for today's needs.

BECK: Aaron, the implications of this are simply amazing, reminds me of the old Firesign Theater album "Everything You Know Is Wrong." I'd like to continue to explore this with you sometime in the near future.

THORNTON: Sure Doug. That would be great. Thanks for your interest in these matters.

Interview with Aaron Thornton Ph.D.

Aaron Thornton, PhD

Aaron Thornton, PhD