Formal Experiment Design

This experiment is designed to assess various ways of presenting the rating scale used to rate the courses and instructors. The course and instructor rating information is a fundamental component of the SIMians Course Comment Forum. We therefore would like to design scales that are intuitive and that can be quickly understood by the user with a high degree of accuracy.

We have already informally tested two different versions of the scale. In our low-fi prototype and first interactive prototype, we provided a strictly graphical rating system. This system allowed the user to assign a rating consisting of "thumbs-up" images or "thumbs-down" images (with three "thumbs-down" representing the worst possible rating and three "thumbs-up" representing the best possible rating). The heuristic evaluation performed by the McInterface group reported that this rating scale was difficult to understand.

We subsequently changed the rating scale for the second interactive prototype. The current system features a strictly numeric scale where the user selects a number from 1 to 5 (with 1 representing not recommended or low difficulty, and 5 representing highly recommended or high difficulty). Although the results of the informal usability tests for the second prototype suggest that this is a better rating system, it would be useful to formally test various ways of presenting the rating scale.

Hypothesis 1A: A strictly numeric system will be understood by the user more quickly and more accurately than a strictly graphic rating system.

The informal usability test results suggest that the numeric system is superior to the purely graphical system. This may have been a function of the "thumbs-up/thumbs-down" metaphor which may not be a familiar concept to some users. However, even if a more familiar metaphor were used, such as a scale consisting of a varying number of stars (or some other icon such as pencils or books), we feel that it would be easier for users to comprehend the difference in meaning between a higher number and a lower number than a larger set and a smaller set of image objects.

Hypothesis 1B: A combination graphical and numeric rating scale will be understood by the user more quickly and more accurately than a strictly graphical and a strictly numeric rating system.

We feel that a rating scale that combines both numbers and images would be an optimal rating scale since the images could be used to visually represent the meaning of the numbers, but the numbers would provide greater clarity.

Hypothesis 2A: A Scale of 1 to 3 will come closer to capturing the user's "true" rating than a scale of 1 to 10. One tester of the second interactive prototype explictly stated that a scale of 1 to 10 would be too large. We think that a user will have less difficulty choosing a rating reflecting his/her true opinion with a smaller rating scale, rather than a larger rating scale.

Hypothesis 2B: A scale of 1 to 5 will come closer to capturing the user's "true" rating than either a scale of 1 to 3 or a scale of 1 to 10.

We feel that there would be a lower limit to the size of the scale, and that a scale of 1 to 3 would not provide adequate granularity for the user to choose an rating reflecting his/her true opinion.

Factors (Independent Variables):

Factor 1- Presentation of the scale [Between-Subjects]:

Response Variables:

Response Variable 1- The amount of time it takes for the user to assign ratings for course difficulty and instructor.

Response Variable 2- Whether the user chooses not to rate the course difficulty and/or the instructor.

Response Variable 3- Whether the user asks for clarification regarding the rating system during the test.

Response Variable 4- The user's satisfaction with the scale, as measured post-test using Likert scales.

Response Variable 5- The user's opinions regarding whether he/she felt that the ratings he/she assigned accurately reflected his opinions of the course difficulty and instructor, as measured post-test using Likert scales.

Table 1: Blocking of Experiment by Factor Level

Presentation of the Scale	P-N	P-G	P-NG
Span of the Scale	S-3	S-5	S-10
	S-10	S-3	S-5
	S-5	S-10	S-3

Given the nature of the website, we would prefer to limit the testers to SIMS students. Optimistically, we could perhaps test approximately half of the student body or about 36 subjects. This would come out to four repetitions of each scale presentation and scale span combination. That is, there would be 12 repetitions for each scale presentation level and 12 repetitions for each scale span level. Realistically, we might only be able to recruit about a fourth of the SIMS student body which would (obviously) reduce the number of repetitions per cell in half.