Setting up a Speaker Shootout or Component Comparison the Right Way
One of the most popular articles we ever write on Audioholics is speaker or cable shootouts. Shootouts are really just a comparison of two or more products. This sounds like a fairly straightforward process where you place two competing products in the same room and take a listen/look. But the reality is that it is much more complicated. There are many ways that you can affect the outcome of a shootout by placement, accessories, even where you have your listeners sit. Through this technical article, we'll explore many of these issues and give you some helpful hints on how to set up the most valid and fair comparison.
The first concept to address is the purpose of a comparison. Is the purpose of a comparison to determine absolute quality? No. A comparison is NOT about rendering judgment on absolute performance - it is about comparing two or more things. Through the comparison you ferret out differences. That is all. What changes a comparison into a shootout is that you ask one critical question at the end - "So, which did you like better?" That is an evaluation. That is a shootout.
When Audioholics does comparisons, we spend the vast majority of the write-up discussing differences. Basically this is because value judgments are really only useful to a handful of people. Our readers know their preferences. They can read the descriptions of a product (no matter how much we personally like a particular item) and decide if that is something that would interest them. Sure, at the end we give our evaluation but that isn't the meat of the article. The meat is the comparison - as it should be.
Picking the Products
The first step, as you might imagine, is to pick the items you'd like to compare. Are you interested in speakers, CD players, interconnects? You probably already have something in mind. Many enthusiasts have a particular product they'd like to compare (usually something they own or something they want to own) and a limited selection of comparison models. Maybe they want a new pair of speakers and want to know if the ones at the store are better (and more importantly are they a large enough improvement to justify the price) than the ones they currently own. In a more professional setting (like an Audioholics shootout), we decide what type of product and start to collect samples from manufacturers.
Once you've decided what you want to compare, you've got two different scenarios. If you are looking to upgrade (or just change) from one product to the other, MSRP doesn't matter. What matters is the amount of money the upgrade will cost you. This might just be the MSRP or it might be the MSRP minus the money you can get for you speakers on Craig's List or EBay. Sales, other discounts, taxes, and possibly shipping will also have to be considered in this calculation. Once you arrive at that number, you now have a basis for determining if the performance increase is worth the expenditure.
In a professional setting, MSRP is extremely important. Not street price, not b-stock price, not how much you can get them off an auction or classified website. MSRP is the value the manufacturer places on their product. It tells the consumer what other products should perform similarly. Now, many manufacturers will set an MSRP on their product and immediately put them on "permanent sale." This is a marketing tactic designed to trick consumers into thinking they are getting a great deal on a high performing piece of equipment. Generally, the street price much closer reflects the performance than the MSRP. In a comparison setting, you want to compare it to other similar MSRPs instead of street prices. Will the product compare unfavorably to others in its category? Probably. But there is nothing you can do about that. The big problem with using street price is that they are malleable. What costs $100 on Black Friday costs $250 on sale the next day (MSRP $399). Do you use the Black Friday price or the regular sale price?
Some people love to throw "ringers" into their comparison. These usually take the form of a product that is either way above or way below the MSRP in question. I highly recommend that you don't do this. No one wins. If it is way above the MSRP and does well, no one is surprised and you've just wasted everyone's time. Not only that, but you've probably lowered the ratings of the other products because of the unfair comparison. On the flip side, if it is lower MSRP and does badly, you've artificially inflated the ratings of the rest and perhaps kept your observers from being more critical of the items that were lower performing than the rest but not nearly as bad as the ringer. Again, this is a waste of time. Now, if the low MSRP one does well or the high MSRP one does poorly, you've either just alienated the manufacturers of the rest (former) or the one (latter). No matter how you look at it, you lose. Don't do it.
Equalizing the products on price is only the first step. To reduce bias (for a more complete discussion please read my extensive discussion on the matter), you'll need to make as many of the variables the same as possible. Type, size, configuration… anything and everything that can be equalized should. If you are comparing amps, you may want to select only based on MSRP. Of course, you could whittle that down and choose only digital (or Class A/B… etc.). For speakers, you could compare bookshelf speaker, you could limit the size, whether they are ported or not (or even location of the port), size of woofer… whatever you want. The danger of over-limiting is that you may limit yourself down to a very small sample size (i.e. there may only be a few products available that fit your requirements). The key is to limit as much as you can without making your comparison too narrow. Also, make sure that your limiting factors are "real." Is the shade of the button backlighting really going to make a difference in an amp's performance? If not, don't limit your comparison group based on that factor.
Consumers doing a comparison naturally do this and they do it in a way that would be invalid in a professional setting but perfectly valid for them. Their limiting factors are often price (not MSRP but what they have access to within their budget), availability, and looks. With the famous WAF (wife acceptance factor) often playing an overly large roll in the decision. In a professional comparison, you need to be more systematic than that. Since a consumer only needs to identify the components they want to buy, it doesn't matter that they haven't equalized on type or MSRP. In a professional comparison, you are doing so for an audience. A consumer has an audience of one.
Validity is a touchy subject when choosing which items to include or exclude from your sample group. For a professional comparison, there are always people that will criticize your choices based on over or under limiting. The idea is to get the "big" things that will make the most difference. With speakers, no one would disagree that it is unfair to compare a bookshelf to a tower speaker. On the other hand, some would think it is valid to limit on woofer size or orientation. As a general rule, that'd be a much more "controversial" claim and could probably be excluded if you wish. Of course, if you have a bunch of speakers all with 6.5" drivers, perhaps you could. It's really up to you.
The next, and probably most important thing to equalize is everything else. EVERYTHING. Whereas before I was suggesting that you make judgment calls based on the things that will make the most difference, here you need to be very careful. What you want to do is make sure that every other component of the comparison is as identical as possible. Use the same components, cables, cable lengths, similar placement, identical connection method, etc. Any deviation from an identical setup will bring down the wrath of the dissenters and, I might add, rightfully so. When you are doing a comparison, you want to ensure, as much as possible, that they differences can be attributed to the two items in question. If you hook up one pair of speakers with a mid-level receiver and the other with an external decided amp, people will cry foul. No matter how many times you reassure them that the amps weren't clipped and the speakers were played within their tolerances, there will always be the question of whether the differences heard were a function of the different amplification methods. It is just safer all the way around if you equalize everything.
On the other hand, if you are testing a component (like a CD player, amp, or receiver), you'll want to make sure you are using the same speakers, components, connection methods, etc. The idea is to isolate the items under consideration. Everything behind and in front of them in the chain should be the same. Even things like cable length (which no self respecting engineer would suggest makes an audible difference) should be equalized as much as possible. Why give people fuel for the fire when you don't have to?
Setting up the Comparison
The first step in doing a comparison is picking the room. For consumers looking to make a purchase this should be YOUR room. Showrooms are NEVER, EVER like your room. They are either much better or more likely much worse (if you take even the smallest amount of our advice on this site).Try to do in-home auditions as much as possible. Remember, a store with a good return policy (check for those restocking fees) is just begging you to take items home for comparison. These days, even the internet direct companies are starting to loosen up their shipping policies offering free shipping (at least one way, sometimes both). Check the cost of the return shipping - it very well may be worth it for the piece of mind that you bought the right equipment.
In a professional setting, you'll want to pick a room where you can easily fit the gear, the participants, and everything else that is required. An acoustically treated room is best - preferably something that has a fairly flat frequency response. This ensures that changes will be more audible than in a non-treated room.
Yes, I said more audible and not just audible. Why? The room is affecting all the items equally - at least in theory. So if there is a 75Hz suckout; it is there for all the items. If the room has been measured, you probably already know where the problems are and can warn the participants or modify the results afterwards. So any negative (or positive for that matter) effects will be applied to all the components equally. This is also why it is important to equalize all the other components in the system. Any affect any one of them has on the sound will be the same for both items under comparison. As long as the effect is the same, it shouldn't stop the participants from determining the differences in the comparison units. So, getting the best room possible is definitely the goal but having a less than perfect room is NOT a valid reason to discount a comparison's results.
Author's Note - This idea of a negative effect of a room or component not really mattering assumes a minor effect. Small lapses here or there will not overly taint the results of a comparison. Since all the comparison items are affected equally, the participants shouldn't even be aware of them. Larger effects (or for that matter small effects at critical points) can taint the entire thing and make the results suspect. For example, if you are comparing subwoofers and the room you are in has a suckout at 30Hz, subs that cut off around 28Hz will sound like they die far before that while subs with lower output will have time to kick back in. Generally though, a few smaller dips or spikes shouldn't make much of a difference.
The next step is to set up the components under comparison. For electronics, it is fairly straightforward in that you just have to have everything accessible. Displays might present a bit more of a problem but as long as the lighting conditions are about the same and the displays equidistant from the observers, you should be okay. Speakers, as you might imagine, present a special case.
Where a speaker is placed in a room can make a big difference sonically. Distance from walls, toe in, distance from each of the listeners, acoustical treatments, wall materials and more can all make a fairly substantial difference in imaging, soundstage, perceptions of brightness, etc. The accepted method of combating this is to place speakers so that the pairs are staggered. So if the right speaker is on the outside, the left speaker is on in the inside (rather than having one pair on the outside and one on the inside). Personally, I'm not convinced this is the best solution but it obviously looks to be the best. If I had a research grant, that'd be one of the first things I'd look at. I'd suggest if you have the time, using the staggered method but switching the speakers at least once to see if the listener's perceptions are any different.
I'll talk more about blind/double blind comparisons later, but when you are setting up your room, you'll want to consider whether or not you are going to use some sort of screen to hide the components. For displays, at the very least you should hide the logo though some might be able to tell which is which from the bevel. Amps, receivers, cables, other sources and electronics should all be screened off from the listeners. Speakers, again present a bit of a problem.
Some people believe that using a screen, even one that is designed to be acoustically transparent, attenuates the high end. Again, personally, I'm not convinced. Sure, you may be able to measure a bit of a difference but without credible proof I'm not buying that it is an audible difference. That being said, nearly every speaker manufacturer on the planet makes a grill for their speakers and most make them out of fabric. Remove the grill and put up a screen. While I, like many enthusiasts, do critical listening with the grills off, most of a speaker's duties will involve a grill. If nothing else, you haven't unfairly hamstringed anyone.
The last thing to remember (and this is a biggie) - level match. Nothing will unduly skew a comparison like having one component louder than the other. It is well documented that people associate loudness with quality (and for that matter, brightness with quality). If one component is louder than the other, it will consistently be rated more favorably. As an aside, making sure that that your listeners switch seats during each and every comparison is a way to offset any placement issues that may arise from being nearer to one speaker than another.
Doing the Comparison
The first decision you'll probably make after picking the components is who you want to do the comparison. If this is an end user, the only opinion that matters is your own (and maybe a few in your family). In a professional setting, you want people that you know have golden ears right?
Wrong.
The difference between a person with "golden" ears and others is 90% experience (provided no hearing loss). There is nothing whatsoever wrong with having a "regular" person off the street as part of your participant group. In fact, I'd suggest that it would lend your comparison more validity. A reviewer is someone that has very definite tastes anyhow. That very well may make them a bit more biased towards speakers/components that sound like their own rather than Joe Average that is going in there with no preconceptions.
That being said, it is probably easiest to include reviewers in a professional comparison. First of all, they already know the vocabulary. They will be able to describe what they are hearing in a way that is both easy to read and understand. Reviewers are used to doing these types of comparisons and can probably do them in half the time (or less) than a regular person. When you are trying to do a number of comparisons in a single day, this can make a world of difference.
Listener fatigue is the concept that you get desensitized to a listening test after an extended period of time. This is true, especially at louder volumes. The way to combat this is to limit the number of comparisons your group is doing and to TAKE BREAKS. Give your listeners a chance to relax between comparisons. When listening at higher volumes (these comparisons tend to be loud) it is especially important to take breaks to relieve fatigue. Also, vary the volume often. Not only will this reduce fatigue, but it gives your participants a chance to experience the components at a variety of volume levels.
I've mentioned before about the importance of switching seats but let me reiterate here. Even when you are not evaluating speakers where one seat might be ever so slightly closer to one speaker than another, room effects are not as uniform as you might think. While you might measure a 10dB suckout at a particular frequency in one seat, the next seat over might have a much less dramatic suckout or even a bump! Switching seats during each comparison will help balance out some of those room effects.
I've addressed the idea of the sighted vs single blind vs double blind listening tests before, but let me sum up. Essentially, in a sighted test, everyone sees everything. The listeners know what they are listening to at all times. They might not know exactly which component is being used but they know which ones are being compared. The single blind test is where the listeners don't know which components are playing but the facilitator does. The double blind test is where no one (listener or facilitator) knows which are playing until after the comparison. Most comparisons in audio are either sighted or single blind. Double blind usually takes equipment that most people just don't have (including us here at Audioholics).
The problems with the sighted test are obvious. If you know that your favorite speakers are playing, you're obviously going to be biased toward them. With the single blind test, many problems are eliminated but you have the problem of the facilitator affecting the results purposely or even unconsciously. There is a ton of research out there on experimenter expectancy bias and more but I'll let those interested read about all of that on your own time. Personally, I believe that if you take a few precautions, you'll be just fine. Here is a short list of suggestions:
- Rules for the comparison should be set beforehand so that each participant will know what to expect.
- Comparison pairs should be randomly selected.
- Comparison pairs should be masked/disguised as much as possible. Components not in the current comparison must also be out of view of the participants.
- The facilitator may not speak to anyone during each comparison (participants or anyone else).
- The facilitator must be out of sight of the participants during as much of the comparison as possible.
- The facilitator should allow the users to switch freely between the comparison pairs at will as many times as necessary.
- Once the comparison is over, notes should be collected from the participants and revisions should only be moderated by the facilitator.
The problem with the single blind test is that the facilitator can sometimes give off clues to their preferences. That is why I suggest no talking (even talking to others about unrelated things since it might have an effect) and being out of sight from the participants as much as possible. The facilitator should basically be invisible.
Also, I would highly recommend doing both sighted and single blind tests (the blind test first). If nothing else, it would be interesting to see how the participant's observations change when they know what they are listening to. If their observations don't change, that too is interesting. Regardless, I would NOT reveal which components were which in the blind test until well after all the comparisons (sighted and blind) are completed. I'm a bit of sadist in this regard and I often don't let the participants know until they read the final report.
Finally, while I've talked about the importance of the facilitator not interacting during the comparison, it is probably more important that the participants don't interact. They need to have a completely unique and authentic experience that they can report. Also, make sure after every comparison you collect the forms so that they aren't modified after the inevitable discussion during the breaks.
Collecting the Data, Writing the Results
I am a big believer in forms. For a comparison like this, they can really make a big difference. If you have the time, what you could do is to get your participants to agree on opposed pairs of descriptive terms. Bright/laid back, flabby/tight, red/blue… whatever they agree are good terms to describe the components they are comparing. Then, for each comparison, they will rate each of the components on each one. So if the pair of terms is bright/laid back, for each component they'd indicate whether it was more bright or laid back:
Bright 1 2 3 4 5 6 7 8 9 10 Laid Back
If you think the numbers are too biasing, you could just use asterisks or something. It doesn't matter. Normally, though, this is a lot of work. Often times the facilitator will have to come up with the paired terms themselves. Too often, I think, these comparisons are left too open-ended. Each participant is asked for their subjective opinion of each component without much if any direction. This is a mistake as you'll have a hard time integrating the results at the end. At the very least, you should ask a few open-ended questions like, "Describe the top/mid/low end of the speaker, " or, "Did you notice any coloration of the sound with the amp," etc. At the end of each form, I'd have a simple "Which did you prefer" question. I don't think a "Why" is necessary but many times participants want to explain themselves so you may want to include it. Remember, you can always leave things out of your final report; it is much harder to gather information after the fact. It is a delicate balance of asking for enough information without overly taxing your participants.
Writing up the Results
At the end, you should have a stack of paper from each of the participants with their observations of each of the component pairs. It is now time to collect them into a format that is easy to read or understand. If you have asked multiple choice questions, you can get the average of the responses and report that. Even if you have nothing but open-ended questions, you can summarize the results by saying, "Nine out of ten of the participants thought that amp A was forceful and eight thought amp B was more veiled." With subjective comments and a small group of comparison pairs and participants, you could just include all the comments as-is. With larger groups, you'll need to summarize. My favorite way is to put numbers to as many of the comments as possible (like above) with a few examples taken directly from the notes to back them up. Lastly, any other comments that don't match other participants can be included verbatim at the end. The larger the group of participants, the more you'll want to collate the date. With a small number of comparison pairs and participants, it is fine to just reprint the observations verbatim.
Understand this is the place where the facilitator's bias can also creep in. If it is possible to have an outside party write the summary that would probably be best. Otherwise, it is imperative that the facilitator make sure they aren't letting in their own bias. This often won't take the form of adding or changing comments but instead will involve omitting comments/observations that they don't agree with. The easiest way to avoid this is to allow the facilitator to have a section at the end where they discuss their own impressions of the comparison pairs. Also, participants should be able to read the final report and comment on it before publication. It is possible that through the editing process a sentence was modified for clarification and somehow lost the original intent. It is important that participants have a chance to catch those mistakes.
The end of the report will include any evaluation of the components - basically a summary of the last question on your form. This is probably going to be the most controversial of the report so it is very important to include all the comments as verbatim as possible.
Conclusion
The thing to keep in mind (and I said it at the beginning of this article) is that a comparison is just that - a comparison. This is not a method of determining absolute quality. What you are uncovering is how one component differs from another. You'll find as you summarize the results of your comparison that people will feel differently about a component based on which component they are comparing it to. A component might be thought of as high quality and wonderful in one comparison and completely slammed in the next. This is normal and should be expected. The only true sign of quality is if a component fails or triumphs in ALL or most comparisons. This is an indication of a truly exceptional or inferior product.