Public Opinion Quarterly Advance Access originally published online on November 25, 2008
Public Opinion Quarterly 2008 72(5):985-1007; doi:10.1093/poq/nfn060
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
This article appears in the following Public Opinion Quarterly issue: Special Issue: Web Survey Methods [View the issue table of contents]
Effects of Design in Web Surveys
Comparing Trained and Fresh Respondents
Address correspondence to Vera Toepoel; e-mail: v.toepoel{at}uvt.nl.
| Abstract |
|---|
|
|
|---|
In this paper, we investigate whether there are differences in the effect of instrument design between trained and fresh respondents. In three experiments, we varied the number of items on a screen, the choice of response categories, and the layout of a five-point rating scale. In general, effects of design carry over between trained and fresh respondents. We found little evidence that survey experience influences the question-answering process. Trained respondents seem to be more sensitive to satisficing. The shorter completion time, higher interitem correlations for multiple-item-per-screen formats, and the fact that they select the first response options more often indicate that trained respondents tend to take shortcuts in the response process and study the questions less carefully.
| Introduction |
|---|
|
|
|---|
Panel surveys, where the same households or individuals are interviewed repeatedly at various points in time, have important advantages over independent cross-sections, such as efficiency gains in recruiting, reduced sampling variation in the measurement of change, and the possibility of analyzing behavior at the individual respondent level (see, e.g., Baltagi 2001
For several reasons, trained respondents may answer questions differently than those with little or no experience as panelists. If respondents have seen questions on the same topics before, they may have adjusted their behavior and, for example, may have acquired more knowledge on these topics or talked about them with others. In this paper, we do not look at this type of panel conditioning but at the effect panel participation in general can have on response behavior, even though the questions address different topics. Panel members may learn from taking surveys and become familiar with the question-answering process, learn how to interpret questions, and make fewer errors than new respondents. Or, conversely, experienced respondents may also learn to reduce the burden of their task and complete the survey more rapidly, at the cost of accurately reading and answering each question, thereby making more errors than fresh respondents.
Analyzing differences in the effect of design between trained and fresh respondents is important for two reasons. First, for survey methodology research it is important to know whether the existing evidence on design effects for fresh respondents also applies to trained respondents (and vice versa), or whether separate experiments are necessary for the two groups. It can also give more insight in what drives the effects of design, since some theories are more relevant for fresh respondents and others for trained respondents. Second, the evidence of design effects implies that the quality of substantive research outcomes can depend on the survey design and often has implications for choosing the best design. It is important to know whether such implications differ for fresh and trained respondents and whether tailoring the design to the experience of respondents is useful.
In particular, we analyze whether trained respondents react differently to web survey design choices than inexperienced respondents. First, they may be able to process more information on a screen and, for example, make fewer errors when multiple items are presented on a single screen. Second, they may have a different tendency to give socially desirable answers and their experience may make them, for example, less reluctant to select a response category that seems unusual in the range of responses. Third, they may react differently to (changes in) question layout. The goal of this study is to explore differences in web design effects between trained and fresh respondents in these three aspects.
| Background |
|---|
|
|
|---|
Survey experience may influence responses to survey questions. In ongoing household panels, one could in principle test whether the time since respondents entered the panel (the duration) or the number of surveys in which they have participated affects responses. However, in most panels almost none of the respondents are completely fresh, while the effect of panel experience may possibly be nonlinear, with a noticeable difference between no and some experience, but much less or no effect when going from some to more experience. Bartels (1999
Existing studies show that answers to questions in (web) surveys are affected by design choices, such as the ordering of questions (see, e.g., Krosnick and Alwin 1987
; Couper, Traugott, and Lamias 2001
), the categorical answers that the respondent can choose from (see, e.g., Schwarz et al. 1985
; Rockwood, Sangster, and Dillman 1997
), or the layout of the questions (see, e.g., Dillman and Christian 2002
; Winter 2002a
, 2002b
; Christian 2003
; Christian and Dillman 2004
; Toepoel, Das, and Van Soest 2006
). Some studies have also analyzed whether such design effects vary with respondent characteristics such as age, gender, or education level (see, e.g., Krosnick and Alwin 1987
; Knauper, Schwarz, and Park 2004
; Fuchs 2005
; Stern, Dillman, and Smyth 2007
; Tourangeau, Couper, and Conrad 2007
), or attitudes such as a need for cognition or need to evaluate (see, e.g., Toepoel et al. 2009
). Despite the growing empirical support for (web) design effects, there exists virtually no reference to respondents experience in answering surveys. As a result, empirical tests have not taken into account how experience may affect the question-answering process in web surveys. In this study, we analyze the differences in web design effects between experienced and fresh panel respondents.
Experience and the response process
Van der Zouwen and Van Tilburg (2001
) find that panel conditioning effects sometimes arise and sometimes not, without a clear indication of the situations in which they occur. Trivellato (1999
) concludes that panel participation mainly affects the way in which behavior is reported (response process), while it does not have pervasive effects on behavior itself. Golob (1990
) concludes that no panel conditioning effects exist in questions that require simple reporting tasks, suggesting instead that panel conditioning relates to the cognitive difficulty in answering questions. He finds no panel conditioning on car ownership variables that are measured using simple reporting requirements, but he does find panel conditioning effects for more cognitively demanding questions such as travel times for different modes of transport.
Trained respondents may "speed" through the survey to reduce the burden of their task. One way to do that is to answer strategically to avoid follow-up questions (cf. Meurs, Van Wissen, and Visser 1989
; Mathiowetz and Lair 1994
), but also in questionnaires without follow-up questions (such as those in our experiments), panel experience can make respondents go through the questions faster if they recognize the question structure and layout.
Trained respondents may also have a different social desirability bias than fresh respondents. Sharpe and Gilbert (1998
) find that repeated testing decreases the scores on the Beck depression scale and attribute this to socially desirable response behavior, triggered by the first interview. Chan and McDermott (2007
) and Wang, Cantor, and Safir (2000
) find similar effects.
Coen, Lorch, and Piekarski (2005
) compare frequent and infrequent respondents. They find evidence that responses of frequent responders are more in line with actual consumer behavior than those of less frequent responders. This finding is in contrast to the conventional view that past experience is not desirable with regard to measurement errors (Williams 1970
; Williams and Mallows 1970
; Meurs, Van Wissen, and Visser 1989
; Golob 1990
; Brannen 1993
; Mathiowetz and Lair 1994
; Sharpe and Gilbert 1998
; Bartels 1999
; Sturgis, Allum, and Brunton-Smith 2007
). Coen, Lorch, and Piekarski (2005
) find no evidence that frequent responders try to speed through the survey. In fact, they find a relatively high number of marks on check-all-that-apply questions, which goes against the idea of answering strategically to avoid follow-up questions.
Experience and web survey design
There is a growing literature that suggests that the design of a web survey has a significant impact on measurement error (see, e.g., Couper, Traugott, and Lamias 2001
; Dillman and Christian 2002
; Christian and Dillman 2004
; Tourangeau, Couper, and Conrad 2004
, 2007
; Dillman 2007
). Design may be more important in web surveys than in other modes of administration, because there are many tools available and because of potential variation in how the survey appears on a screen. Couper (2000
) concludes that more work is needed to determine the optimal designs for different groups of people, emphasizing the need for research on panel conditioning and web page design. Despite the widespread use of online panels, there appears to be no empirical research to date on the difference in design effects between trained and fresh respondents.
Items per screen
For web questionnaires, interface design varies in terms of the distribution of questions on the screen and the navigation methods used. At one end of the design continuum are form-based designs that present questionnaires as one long form in a scrollable window, while at the other end are screen-by-screen questionnaires that present only a single item at a time (Norman et al. 2001
). Presenting questions in a matrix is somewhere in between, reducing the number of screens without the need for scrolling.
The grouping of related items on a single screen is likely to lead respondents to view the items as related entities, thus increasing the correlation among them (Strack, Schwarz, and Wanke 1991
; Schwarz and Sudman 1996
; Sudman, Bradburn, and Schwarz 1996
; Tourangeau, Couper, and Conrad 2004
, 2007
; Dillman 2007
). Couper, Traugott, and Lamias (2001
) conclude that correlations are consistently higher among items appearing together on a screen than among items separated across several screens. However, the overall effect is not large, and none of the differences between pairs of correlations reach statistical significance. Tourangeau, Couper, and Conrad (2004
) replicate the above findings. Respondents seem to use the proximity of the items as a cue to their meaning, perhaps at the expense of reading each item carefully. Peytchev et al. (2006
) find few differences between paging and scrolling designs.
Nonresponse and time to complete the interview can also be indicators for the optimal number of items per screen. Lozar Manfreda, Batagelj, and Vehovar (2002
) find that a one-page design results in higher item nonresponse. Couper, Traugott, and Lamias (2001
); Lozar Manfreda, Batagelj, and Vehovar (2002
); and Tourangeau, Couper, and Conrad (2004
) find that a multiple-item-per-screen design takes less time to complete than a one-item-per-screen design. Evaluation questions can show whether respondents are comfortable with a particular survey design. Toepoel, Das, and van Soest (2009) find that placing more items on a screen negatively influences the respondent's evaluation of the layout.
We are not aware of any studies on the optimal number of items on a screen in relation to the survey experience. Our conjecture is that trained respondents can process more information on a screen, thus showing less item nonresponse when more items are placed on a single screen than fresh respondents. We expect them to complete the survey faster than fresh respondents, especially if many items are placed on a single screen. We also expect them to better evaluate a large number of items on a screen than fresh respondents.
Response categories
Studies on the cognitive and communicative processes involved in answering survey questions suggest that the choice of response categories can have a significant effect on the answers (the "scale range effect," see Tourangeau, Rips, and Ransinski (2000
): 249). Toepoel et al. (2009
) and Winter (2002a
, 2002b
) find response category effects in web surveys, while Krosnick and Alwin (1987
); Rockwood, Sangster, and Dillman (1997
); Schwarz et al. (1985
); Schwarz and Hippler (1987
); and Strack and Martin (1987
) find effects in other modes of administration. Schwarz and Hippler (1987
) argue that respondents use the response alternatives to determine the meaning of the question and use the frequency range as a frame of reference, presuming the values stated in the scale to be commonly held values. In other words, a respondent may be reluctant to select a response category that seems unusual in the range of responses. This results in higher estimates along scales that present high rather than low ranges. The literature suggests that response categories have a significant effect on responses to questions for which estimation is likely to be used in recall, whereas in questions in which direct recall is used in response formatting the response categories do not have a significant effect.
The second experiment in this paper assesses the impact of a response scale on both trained and fresh respondents. Because of the existing studies, we expect that trained respondents are less reluctant to select a response category that seems unusual in the range of responses.
Layout
Differences in question layout can lead to detectable differences in survey responses (see, e.g., Schwarz and Hippler 1987
; Dillman and Christian 2002
; Christian 2003
; Christian and Dillman 2004
; Tourangeau, Couper, and Conrad 2004
; Toepoel, Das, and Van Soest 2006
). A question format contains verbal and nonverbal cues that influence respondent behavior. Nonverbal cues include graphical, numerical, and symbolic languages that convey meaning in addition to the verbal language (Dillman and Christian 2002
). Jenkins and Dillman (1997
) have developed a conceptual framework to explain how visual languages may influence respondent behavior.
Redline et al. (2003
) show that the visual and verbal complexity of information in a questionnaire affects what respondents read, the order in which they read it, and, ultimately, their comprehension of the information. Friedman and Friedman (1994
) demonstrate that equivalent horizontal and vertical rating scales (graphical manipulation) in paper questionnaires do not elicit the same responses. Schwarz et al. (1985
) show that respondents gain information about the researcher's expectations using numerical labels as frames of reference. Schwarz et al. (1991
) find that respondents hesitate to assign a negative score to themselves in a face-to-face interview: a scale with numbers 0–10 results in lower scores than a –5 to 5 format. Changing numerical values attached to scales therefore can change the answers.
We are not aware of studies that look at the differences in the effects of layout for trained and fresh respondents. We conjecture that trained panelists may react differently to layout choices than fresh panelists. They may be used to a particular question format so that changing that format (e.g., from disagree–agree to agree–disagree) may not be noticed.
| Design and Implementation |
|---|
|
|
|---|
To study design effects on trained and fresh respondents, we used two online household panels administered by CentERdata. Both panels are representative for the Dutch (speaking) population in the Netherlands, aged 16 and over. The first, the CentERpanel (see also http://www.centerdata.nl/en/CentERpanel and the online appendix, Section A), has existed for 17 years. Panel members fill out questionnaires every week. Panel duration of respondents ranges from 17 years to a few months. Although the CentERpanel is an Internet-based panel, there is no need to have a personal computer with an Internet connection. If necessary, equipment is provided by CentERdata. Initial recruitment was based on a random sample out of the population register. To correct for attrition, new potential panel members are randomly drawn from the national register of landline numbers. If a household is willing to participate, their contact information and basic demographics are stored in a database. If a household drops out of the panel, a new household is selected from the database of potential panel members on the basis of demographic characteristics.
The second panel is the LISS panel (see http://www.centerdata.nl/en/LISSpanel and the online appendix, Section A) which started in 2007. Panel members complete questionnaires on a monthly basis though the Internet. As with the CentERpanel, Internet access is not a prerequisite for participation. The recruitment of panel members is based on a random sample of addresses drawn from the community registers in cooperation with Statistics Netherlands. Our experiments were the very first questionnaire for this panel. Thus, the CentERpanel consists of trained respondents (varying in panel duration, with a mean duration of 6 years and 8 months, standard deviation of 4 years), while the LISS panel consists of completely fresh respondents.
To analyze compositional differences between the two panels, we compared the distributions of all personal and household characteristics for which we have comparable measurements in the two panels. (Detailed results are made available in the online appendix, Section B.) The oldest age group (65+) is overrepresented in the trained panel, but this is corrected for by weighting. In general, the weighted distributions of the variables which were not used to construct the weights are also similar (and more similar than the unweighted distributions). This gives some confidence that if we find differences in design effects, these are due to panel experience and not compositional differences. The CentERpanel has suffered from attrition over the years and all experiments suffer from nonresponse because panel members did not do the survey in the particular period the experiments were held. We have no reason to expect that this is selective in terms of the factors that lead to design effects, but we cannot completely exclude this possibility with the available data.
We fielded the questionnaire in June 2007. In the CentERpanel, 1,356 panel members were selected to fill out the questionnaire; 981 of them (72.3 percent) responded. In the LISS panel, 4,530 panel members were selected, and 2,809 respondents (62.0 percent) filled out the questionnaire. To correct for differences due to nonresponse, we used weights based on gender, age, and education.
The questionnaire consisted of three different experiments. In the first, we used the Marlowe–Crowne Social Desirability Scale and varied the number of items per screen. The Marlowe–Crowne Social Desirability Scale is the most commonly used tool designed to assess social desirability. We use the 10-item version developed by Strahan and Gerbrasi (1972
). For each item, the respondent can answer "True" or "False." Five of the items are "reverse worded." For these questions, agreement indicates less social desirability. These items (items 3, 4, 7, 9, and 10—see the online appendix, Section C) were reversely coded. The sum of the 10 items coded as 0 or 1 gives the total social desirability score. We used three groups, with 1, 5, and 10 items per screen. The online appendix (Section C) presents screenshots for the three formats. We added some evaluation questions to determine whether respondents react differently to the number of items displayed per screen.
In the second experiment, we varied the answer scale in four questions. We used the same questions as Toepoel et al. (2009
), varying in cognitive difficulty. We used a low response scale, a high response scale, and an open-ended format. The response scales can be found in Appendix A. Screenshots showing these response scales for the first question are included in the online appendix, Section C.
In the third experiment we varied the layout of a five-point rating scale. The first group was presented answer categories in a linear vertical format from positive to negative (excellent, very good, good, fair, and poor). Five other groups were presented with different manipulations. The second group answered from negative to positive, the third in a horizontal format, for the fourth group we added numbers 1 to 5 to the response categories, for the fifth group numbers 5 to 1, and for the sixth group numbers 2 to –2. All manipulations of the five-point rating scale are presented in the online appendix, Section C.
Our experiments formed a separate questionnaire, not preceded by other items. The design effects for the experienced CentERpanel may depend on what these respondents are used to, but the respondents in the trained panel who fill out questionnaires every week have seen a large variety of different formats. Items are sometimes grouped together based on construct, sometimes on several screens to reduce the use of scrolling, but also sometimes presented in a one-item-per-screen format if this is convenient in the questionnaire. Various numerical labels can be added, depending on previous related research. A horizontal and vertical layout are both used, but multiple-item-per-screen formats always use a horizontal layout. In general, in Dutch-speaking countries, incremental scales (e.g., disagree–agree) are used more often than decremental scales (Hofmans et al. 2007
), so that trained respondents are more used to incremental scales. In our experiment on layout, however, we used a decremental scale as the reference to stay in line with the US literature (see, e.g., Christian 2003
; Tourangeau, Couper, and Conrad 2004
).
| Results |
|---|
|
|
|---|
In this section we discuss the results of the three experiments. For each experiment, we first discuss the response effect and then compare the answers of trained and fresh respondents.
Response effect: Items per screen
We found no evidence that correlations between the 10 items of the Social Desirability Scale were higher when the items were presented on a single screen (10-item-per-screen format) than when presented on 2 screens (five-item-per-screen format) or 10 separate screens (one-item-per-screen format).2
In principle, the web survey software can force the respondent to give a response. If respondents fail to give an answer, they would then be presented with an error message prompting them to go back and choose an answer. We deliberately did not program this feature, so that respondents could proceed without giving answers. We found no evidence that placing multiple items on a screen increased nonresponse.
If more items are placed together on one screen, fewer physical actions (keystrokes or mouse clicks) are required than when items are presented separately. This suggests that placing more items on a single screen would reduce the time needed to complete the questionnaire. However, we found no significant differences in the mean duration3 between formats (1, 5, and 10 items per screen).
Respondents answered some evaluation questions about the social desirability questions:
- How interesting did you find the questions?
- How would you evaluate the duration?
- How clear did you find the wording of the questions?
- How easy was it to answer the questions?
- What did you think of the layout?
- What is your overall opinion of these questions?
These questions were asked on a 10-point scale ranging from 1 ("very poor"/"not at all") to 10 ("very good"/"very much"). The layout of the questionnaire was evaluated best in the five-item-per-screen format (a mean of 8.0 in the five-item-per-screen format, compared to 7.8 in the other formats; F = 3.41, p =.033). We found no other significant differences in evaluation between the three formats.
Combining the 10 social desirability items resulted in an overall score of social desirability. We found no significant differences in social desirability scores between the 1-, 5-, and 10-item-per-screen formats.
Summarizing, we found hardly any evidence for differences between 1-, 5-, and 10-item-per-screen formats. This is in line with the results of Peytchev et al. (2006
), but in contrast to the findings of Couper, Traugott, and Lamias (2001
); Lozar Manfreda, Batagelj, and Vehovar (2002
); and Tourangeau, Couper, and Conrad (2004
).
Comparison of trained and fresh respondents: items per screen
There were differences in interitem correlations when the items were presented (1) one item per screen (Cronbach's alpha of.473 for the trained panel and.528 for the fresh panel), (2) 5 items per screen (alpha of.602 for the trained panel and.516 for the fresh panel), and (3) 10 items per screen (alpha of.515 for the trained panel and.498 for the fresh panel).4 Trained respondents had higher interitem correlations for multiple-item-per-screen formats, while fresh respondents showed the highest interitem correlation in the one-item-per-screen version. Moving from one to five items per screen increases interitem correlations among trained respondents but reduces it among fresh respondents. Moving from five to ten items per screen reduces interitem correlations for both trained and fresh respondents, but more so for trained respondents. This suggests that on multiple-item-per-screen formats trained respondents base their answer more readily on surrounding items. Fresh respondents were less affected by the number of items per screen.
Table 1 shows that the relation of the reverse worded items (items 3, 4, 7, 9, and 10) to overall scale scores (the part–whole correlations) was also weaker when the 10 items were presented on a single screen than when they were presented 5 items per screen or 1 item per screen for the trained panel. Apparently, trained respondents were less likely to notice the reverse wording when the items appeared on a single screen. Trained panelists seem to use the proximity of the items as a cue to their meaning, perhaps at the expense of reading each item carefully. Fresh panelists may be triggered by the new experience of participating in a survey and therefore read each item more carefully.
|
We found no significant difference in item nonresponse between trained and fresh respondents; 1.2 percent (12 out of 981 respondents) had one or more items missing in the trained panel, compared to 1.5 percent (42 out of 2,809 respondents) in the fresh panel. Linear regression of item nonresponse on the number of items per screen, a dummy for panel (trained versus fresh), and the interaction between these two showed no significant interaction effect.
There was a difference in the mean duration of the entire survey (consisting of all three experiments) between panels (t = –2.4, p =.016): 436 seconds for the trained panel and 576 seconds for the fresh panel. The mean duration to complete just the 10 social desirability items did not differ significantly between panels. Linear regression of the duration of the survey on the number of items per screen, a dummy for panel, and the interaction between these two showed no significant interaction effect either.
In the trained panel we found a significant effect of format in evaluation question 4 ("How easy was it to answer the questions?"). The five-item-per-screen format received the highest rating (mean of 7.8 in the five-item-per-screen format compared to 7.6 in the other formats; F = 3.32, p =.037). The fresh panel preferred the layout (evaluation question 5) of the five-item-per-screen format to other formats (mean 8.0 in the five-item-per-screen format compared to 7.8 in the other formats; F = 3.82, p =.022), indicating that both trained and fresh respondents prefer several items per screen but not too many.
Although this paper discusses design effects, we also looked at the mean score of the Social Desirability Scale used for the items-per-screen experiment. In contrast to Choquette and Hesselbrock (1987
), we found no evidence of differences in social desirability bias between trained and fresh respondents: the mean scores on the Social Desirability Scale were not significantly different (F = 2.16, p =.642).
Response effect: Response categories
To assess the impact of a response scale on respondents answers, we asked four questions on the frequency of various activities with a randomized answering format: a low response scale, a high response scale, and an open-ended format. See Appendix A for the questions and response scales used. We dichotomized answers to compare the results. A similar experiment on response category effects is described extensively in Toepoel et al. (2009
).
We found strong scale range effects in both panels. Table 2 shows that the number of respondents reporting watching TV for more than two and a half hours more than doubled in the high response scale compared to the low response scale. In comparison, answers to the open-ended format were in between the high and low response scale. Similar results are found for the other three questions. Our comparison of a high and low response scale shows similar results as previous research (Schwarz et al. 1985
; Krosnick and Alwin 1987
; Schwarz and Hippler 1987
; Strack and Martin 1987
; Rockwood, Sangster, and Dillman 1997
; Winter 2002a
, 2002b
; Toepoel et al. 2009
).
|
Table 3 shows an overview of the correlations between the answer score (1 if more than the reference level, 0 otherwise) and the response format for the different question types. A higher correlation coefficient (eta) between the answer score and the scale used indicates a larger effect of the response scale. With the high versus low response scale, the largest correlation between the answer score and the scale is found in hours watching TV (difficult to process), the lowest for days on holiday (easy to process). In line with previous research (Rockwood, Sangster, and Dillman 1997
|
Comparison of trained and fresh respondents: response categories
The effect of response categories on answers is not significantly different for trained and fresh respondents. Table 3 shows that the correlations for both panels are similar, both in significance and magnitude. Also, the fact that the effect of response scales depends on how well a behavior is presented in memory carries over between panels. For none of the questions did we find a significant interaction effect between the response scale and the panel. This indicates that experience does not make a panelist less sensitive to response categories. The conjecture that survey experience may make the respondents less uncertain and thus less reluctant to select a response category that seems unusual in the range of responses was therefore not confirmed.
Response effect: Layout
In our third experiment, we manipulated the layout of a five-point rating scale using verbal and nonverbal manipulations. Table 4 presents the question that was asked and shows the answer distributions for all formats for both panels. Format 1 acts as reference layout. The rating scale used in this manipulation is linear vertical, ranging from positive ("excellent") to negative ("poor"). In the second format the orientation is still (linear) vertical but the order is reversed. Format 3 has a horizontal orientation. In formats 4, 5, and 6, numbers are added to the linear vertical orientation in the reference layout: 1 to 5, 5 to 1, and 2 to –2. Table 4 shows that the middle category ("good") has the highest frequency in both panels for all formats, except for the fresh panel in the second format. In that case, the category "fair" was chosen most often.
|
Table 5 shows that the answers in a negative–positive format differ significantly from those in a positive–negative format (verbal manipulation: 1 versus 2). Respondents selected the response option "very good" less often when it was presented as a fourth alternative. No significant differences were found for the graphical manipulation (1 versus 3). Adding numbers 1–5 to the scale did not lead to significant differences either (1 versus 4), suggesting that respondents take these numerical labels starting with 1 as a default so that adding them explicitly does not convey additional information on the meaning of the scale. When we compare the numerical manipulations, we see that formats 4 and 5 produce significantly different results, as well as formats 5 and 6. The effect of adding numbers 5 to 1 is similar to the effect of the verbal manipulation: the response option "very good" is less often chosen. The strongest effect was found when numbers 2 to –2 were added. This manipulation showed significantly different answer scores compared to all other manipulations. Respondents are apparently reluctant to assign negative scores. Negative numbers might be interpreted as implying more extreme judgments than low positive numbers (scale label effect, see Tourangeau, Rips, and Ransinki 2000: 248
|
A chi-square test and a difference of means test showed significant differences for all nonverbal manipulations (all formats except format 2), indicating that the layout of the answer categories influences the answers. Also, the overall test comparing all six formats showed significant differences between formats.
Comparison of trained and fresh respondents: layout
Table 5 shows that layout effects carry over between panels. We only found a small difference comparing the numerical formats 4 and 5. The fresh panel shows a significant difference while no significant difference is found for the trained panel. However, when looking at differences in means, both panels show significant results.
Linear regression predicting the answer to the question by dummies for the five format manipulations (with format 1 as reference level), a panel dummy, and interaction terms between the panel dummy and the five formats showed no significant interaction effect between panel experience and the five formats. However, the interaction effect between the panel dummy and the graphical manipulation (horizontal format) approached significance (t = 1.83, p =.07).
Although not directly related to design effects, we did find a difference between the trained and fresh respondents. Combining all six formats and looking at the distribution of all answers, independent of the layout manipulations, we found that trained respondents more readily selected one of the first options, while fresh respondents more often selected one of the last options (
2 = 14.93, p =.01). This indicates that there is some difference between trained and fresh respondents, regardless of the layout of the scale. A possible interpretation of this difference is that trained respondents are more sensitive to satisficing and therefore select the first satisfying response category more often (cf. Krosnick and Alwin 1987
; Tourangeau, Rips, and Ransinki 2000
). These differences can be methodological measurement differences or real differences between the two groups.
| Discussion and Conclusions |
|---|
|
|
|---|
Despite the growing empirical support for (web) design effects, there exists virtually no reference to respondents experience in completing surveys. This means that empirical tests have not taken into account how experience may affect the question-answering process in web surveys. We have tried to gain more insight into the response processes of trained and fresh respondents. We did so by conducting three experiments on web survey design issues with two different panels: a new panel of fresh respondents and a panel that has been in place for seventeen years now, thus consisting of respondents that have extensive experience. The web survey design issues we considered were the effects of the number of items per screen, response category effects, and layout effects.
First of all, the social desirability scale used to assess the impact of a 1-, 5-, and 10-item-per-screen format showed no difference in social desirability scores between the trained and fresh panel. We found no interaction effect between the number of items per screen and panel experience on item nonresponse, time to complete the survey, or evaluation questions. The only small panel experience effect we found concerned interitem correlations for multiple-item-per-screen formats, indicating that trained panelists use the proximity of the items as a cue to their meaning more than fresh panelists do.
With regard to response category effects, we found no significant interaction effect between web survey design and panel experience either; our conjecture that survey experience may make the respondents less uncertain and thus less reluctant to select a response category unusual in the range of responses was not confirmed. The magnitude and significance of response category effects were similar for both panels, as was the relation between response category effects and the type of event in the survey question. Also in our third experiment, the design effects carried over between trained and fresh panel.
The finding that design effects are largely similar for fresh and trained respondents is reassuring for survey methodology research, since it suggests that experiments on fresh panelists can be extrapolated to trained panels and vice versa. It would also suggest that, for the quality of substantive research outcomes, the best design for a fresh panel is also appropriate for a trained panel and vice versa. It is too early, however, to draw this conclusion without reservations. First, we only consider three specific experiments, and more empirical work is needed to analyze whether our findings are robust. Second, we found some differences between response behavior of the fresh and trained panel that are not design effects but do suggest differences in the question-answering process. Trained respondents seem to be more sensitive to satisficing. The shorter completion time, higher interitem correlations for multiple-item-per-screen formats, and the fact that they select the first response options more often indicate that trained respondents tend to take shortcuts and study the questions less carefully. Our comparisons of distributions of covariates suggest that the two panels are similar and that these differences are due to panel experience, although we cannot exclude the possibility that selective attrition and nonresponse play a role.
Further research into these differences between trained and fresh respondents seems useful, and additional research is needed to determine whether our conclusions hold in different settings and for different questions, as well as disentangle why the effects occur. Finally, focusing on differences by age, gender, education, etc. within the trained and fresh respondent groups is worth additional investigation.
| Supplementary Data |
|---|
|
|
|---|
Supplementary data are available online at http://poq.oxfordjournals.org/.
| Appendix A |
|---|
|
|
|---|
Questions and answer categories in experiment 2: Response category effects
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NOTE.—Answer categories one to five in Format A match answer category one in Format B. Answer category six in Format A matches answer categories two to six in Format B.
| Footnotes |
|---|
VERA TOEPOEL AND MARCEL DAS are with the CentERdata, Tilburg University, PO Box 90153, 5000 LE Tilburg, The Netherlands. ARTHUR VAN SOEST is with the Department of Econometrics & OR, Tilburg University and also with the Netspar; RAND; OSA; IZA; DIW, PO Box 90153, 5000 LE Tilburg, The Netherlands. We are grateful to Mick Couper, Joachim Winter, two anonymous referees, the (other) participants of the MESS workshop in Zeist, August 2008, and the (other) participants of the Panel Survey Methods Workshop in Essex, July 2008, for useful comments.
1 In this paper we speak of trained or experienced respondents rather than professional respondents. The term "professional" implies incentives (money) as a stimulus to participate, while in this paper we consider the effect of prior survey experience (training). ![]()
2 This is based upon comparing values of Cronbach's alpha, the commonly used association measure in the analysis of this type of social desirability scales. See, for example, Ray (1984
); Fisher and Fick (1993
); and Beretvas, Meyers, and Leite (2002
). ![]()
3 Means were calculated after deleting outliers with more than two times the standard deviation (28 respondents in the fresh panel and 4 respondents in the trained panel). ![]()
4 Cronbach's alpha is based upon Pearson correlation coefficients which, for dichotomous variables, may underestimate true reliability since its upper bound is less than 1 unless the two variables have the same mean. This implies that its values cannot be compared with the usual reliability threshold of.7 for continuous variables. It does not, however, hamper the usefulness of Cronbach's alpha in comparing reliability in two surveys with the same dichotomous measurements. ![]()
| References |
|---|
|
|
|---|
Baltagi Badi H. Econometric Analysis of Panel Data (2001) Chichester: Wiley.
Bartels Larry M. Panel Effects in the American National Election Studies. Political Analysis (1999) 8:1–20.
Beretvas S. Natasha, Meyers Jason L., Leite Walter L. A Reliability Generalization Study of the Marlowe–Crowne Social Desirability Scale. Educational and Psychological Measurement (2002) 62:570–89.
Brannen Julia. The Effects of Research on Participants: Findings from a Study of Mothers and Employment. Sociological Review (1993) 41:328–46.[Web of Science]
Chan Jason C., McDermott Kathleen B. The Testing Effect in Recognition Memory: A Dual Process Account. Journal of Experimental Psychology: Learning, Memory, and Cognition (2007) 33:431–7.[CrossRef][Web of Science][Medline]
Choquette Keith A., Hesselbrock Michie N. Effects of Retesting with the Beck and Zung Depression Scales in Alcoholics. Alcohol and Alcoholism (1987) 22:277–83.
Christian Leah M. The Influence of Visual Layout on Scalar Questions in Web Surveys. Unpublished Master's Thesis. (2003) Retrieved January 3, 2007 on http://survey.sesrc.wsu.edu/dillman/papers.htm.
Christian Leah M., Dillman Don A. The Influence of Graphical and Symbolic Language Manipulations to Self-Administered Questions. Public Opinion Quarterly (2004) 68:57–80.
Coen Terrence, Lorch Jacqueline, Piekarski Linda. The Effects of Survey Frequency on Panelists Responses. (2005) ESOMAR retrieved July 27, 2007 on www.websm.org.
Couper Mick P. Web Surveys. A Review of Issues and Approaches. Public Opinion Quarterly (2000) 64:464–94.[CrossRef][Web of Science][Medline]
Couper Mick P., Traugott Michael W., Lamias Mark J. Web Survey Design and Administration. Public Opinion Quarterly (2001) 65:230–53.[Abstract]
Dillman Don. A. Mail and Internet Surveys. The Tailored Design Method (2007) Hoboken, NJ: Wiley.
Dillman Don A., Christian Leah. The Influence of Words, Symbols, Numbers, and Graphics on Answers to Self-Administered Questionnaires: Results from 18 Experimental Comparisons. (2002) Retrieved March 3, 2007 on http://survey.sesrc.wsu.edu/dillman/papers.htm.
Fisher Donald G., Fick Carol. Measuring Social Desirability: Short Forms of the Marlowe–Crowne Social Desirability Scale. Educational and Pshychological Measurement (1993) 53:417–24.[CrossRef]
Friedman Linda W., Friedman Hershey H. A Comparison of Vertical and Horizontal Rating Scales. The Mid-Atlantic Journal of Business (1994) 30:107–202.
Fuchs Marek. Children and Adolescents as Respondents. Experiments on Question Order, Response Order, Scale Effects and the Effect of Numeric Values Associated with Response Options. Journal of Official Statistics (2005) 21:701–25.
Golob Thomas F. The Dynamics of Household Travel Time Expenditures and Car Ownership Decisions. Transportation Research (1990) 24A:443–63.
Hofmans Joeri, Theuns Peter, Baekelandt Sven, Mairesse Olivier, Schillewaert Niels, Cools Walentina. Bias and Changes in Perceived Intensity of Verbal Qualifiers Effected by Scale Orientation. Survey Research Methods (2007) 1:97–108.
Jenkins Cleo R., Dillman Don A. Towards a Theory of Self-Administered Questionnaire Design. In: Survey Measurement and Process Quality—Lyberg L., Biemer P., Collins M., de Leeuw E., Dippo C., Schwarz N., Trewin D., eds. (1997) New York: Wiley. 165–96.
Knauper Barbel, Schwarz Norbert, Park Denise. Frequency Reports across Age Groups. Journal of Official Statistics (2004) 20:91–6.
Krosnick Jon A., Alwin Duane F. An Evaluation of a Cognitive Theory of Response-Order Effects in Survey Measurement. Public Opinion Quarterly (1987) 51:201–19.
Lozar Manfreda Katja, Batagelj Zenel, Vehovar Vasja. Design of Web Survey Questionnaires: Three Basic Experiments. Journal of Computer-Mediated Communication (2002) 7:3. http://jcmc.indiana.edu/vol7/issue3/vehovar.html.
Mathiowetz Nancy A., Lair Tamra J. Getting Better? Changes or Errors in the Measurement of Functional Limitations. Journal of Economic & Social Measurement (1994) 20:237–62.
Meurs Henk, vanWissen Leo, Visser Jacqueline. Measurement Biases in Panel Data. Transportation (1989) 16:175–94.[CrossRef][Web of Science]
Norman Kent L., Friedman Zachary, Norman Kirk, Stevenson Rod. Navigational Issues in the Design of On-Line Self-Administered Questionnaires. Behavior & Information Technology (2001) 20:37–45.[CrossRef]
Peytchev Andy, Couper Mick P., Esteban McCabe Sean, Crawford Scott D. Web Survey Design. Paging versus Scrolling. Public Opinion Quarterly (2006) 70:596–607.
Ray John. The Reliability of Short Social Desirability Scales. The Journal of Social Pshychology (1984) 123:133–4.
Redline Cleo D., Dillman Don A., Carley-Baxter Lisa, Creecy Robert. Factors That Influence Reading and Comprehension in Self-Administered Questionnaires. (2003) Paper presented at the Workshop on Item-Nonresponse and Data Quality, Basel, Switzerland, October 10, 2003. Retrieved March 1, 2007 on http://survey.sesrc.wsu.edu/dillman/papers.htm.
Rockwood Todd H., Sangster Roberta L., Dillman Don A. The Effect of Response Categories on Questionnaire Answers: Context and Mode Effects. Sociological Methods and Research (1997) 26:118–40.[CrossRef]
Schwarz Norbert, Knauper Barbel, Hippler Hans-J., Noelle-Neumann Elisabeth, Clark Leslie. Rating Scales: Numeric Values May Change the Meaning of Scale Labels. Public Opinion Quarterly (1991) 55:570–82.
Schwarz Norbert, Hippler Hans-J. What Response Scales May Tell Your Respondents: Informative Functions of Response Alternatives. In: Social Information Processing and Survey Methodology—Hippler H.-J., Schwarz N., Sudman S., eds. (1987) New York: Springer. 163–78.
Schwarz Norbert, Hippler Hans-J., Deutsch Brigitte, Strack Fritz. Response Scales: Effects of Category Range on Reported Behavior and Comparative Judgments. Public Opinion Quarterly (1985) 49:388–95.
Schwarz Norbert, Sudman Seymour. Answering Questions (1996) San Francisco: Jossey–Bass Publishers.
Sharpe J. Patrick, Gilbert David G. Effects of Repeated Administration of the Beck Depression Inventory and Other Measures of Negative Mood States. Personal Individual Differences (1998) 24:457–63.[CrossRef]
Stern Michael J., Dillman Don A., Smyth Jolene D. Visual Design, Order Effects, and Respondent Characteristics in a Self-Administered Survey. Survey Research Methods (2007) 1:121–38.
Strack Fritz, Martin Leonard L. Thinking, Judging, and Communicating: A Process Account of Context Effects in Attitude Surveys. In: Social Information Processing and Survey Methodology—Hippler H.-J., Schwarz N., Sudman S., eds. (1987) New York: Springer. 123–48.
Strack Fritz, Schwarz Norbert, Wanke Michaela. Semantic and Pragmatic Aspects of Context Effects in Social and Psychological Research. Social Cognition (1991) 9:111–25.[Web of Science]
Strahan Robert, Gerbrasi Kathleen C. Short, Homogeneous Versions of the Marlowe–Crowne Social Desirability Scale. Journal of Clinical Psychology (1972) 28:191–3.[CrossRef][Web of Science]
Sturgis Patrick, Allum Nick, Brunton-Smith Ian. Attitudes over Time: The Psychology of Panel Conditioning. In: Methodology in Longitudinal Surveys—Lynn P., ed. (2007) Chichester: Wiley. 1–13.
Sudman Seymour, Bradburn Norman M., Schwarz Norbert. Thinking About Answers (1996) San Francisco: Jossey–Bass Publishers.
Toepoel Vera, Vis Corrie, Das Marcel, van Soest Arthur. Design of Web Questionnaires: An Information-Processing Perspective for the Effect of Response Categories. (2009) Forthcoming in: Sociological Methods and Research Special Issue on Web Surveys.
Toepoel Vera, Das Marcel, van Soest Arthur. Design of Web Questionnaires: The Effect of Layout in Rating Scales. (2006) CentER Discussion Paper 2006-30, CentER: Tilburg.
Toepoel Vera, Das Marcel, van Soest Arthur. Design of Web Questionnaires: The Effects of the Number of Items per Screen. Field Methods (2009) 21(2). Forthcoming in.
Tourangeau Roger, Rips Lance J., Ransinki Kenneth. The Psychology of Survey Response. (2000) Cambridge: Cambridge University Press.
Tourangeau Roger, Couper Mick P., Conrad Frederick. Spacing, Position, and Order. Interpretive Heuristics for Visual Features of Survey Questions. Public Opinion Quarterly (2004) 68:368–93.
Tourangeau Roger, Couper Mick P., Conrad Frederick. Color, Labels, and Interpretive Heuristics for Response Scales. Public Opinion Quarterly (2007) 71:91–112.
Trivellato Ugo. Issues in the Design and Analysis of Panel Studies: A Cursory Review. Quality & Quantity (1999) 33:339–52.[CrossRef][Web of Science]
Van Der Zouwen Johannes, vanTilburg Theo. Reactivity in Panel Studies and its Consequences for Testing Causal Hypotheses. Sociological Methods & Research (2001) 30:35–56.
Wang Kevin, Cantor David, Safir Adam. Panel Conditioning in a Random Digit Dial Survey. Proceedings of the Section on Survey Research Methods (2000) 822–7.
Williams William H. The Systematic Bias Effects of Incomplete Responses in Rotation Samples. Public Opinion Quarterly (1970) 33:593–602.
Williams William H., Mallows Colin L. Systematic Biases in Panel Surveys Due to Differential Nonresponse. Journal of the American Statistical Association (1970) 65:1338–49.[CrossRef][Web of Science]
Winter Joachim K. Bracketing Effects in Categorized Survey Questions and the Measurement of Economic Quantities. (2002) Discussion Paper No. 02-35, Sonderforschungsbereich 504, University of Mannheim. Retrieved March 1, 2007 on http://www.sfb504.uni-mannheim.de/publications/dp02--35.pdf.
Winter Joachim K. Design Effects in Survey-Based Measures of Household Consumption. (2002) Discussion Paper No. 02-34, Sonderforschungsbereich 504, University of Mannheim. Retrieved March 1, 2007 on http://www.sfb504.uni-Mannheim.de/publications/dp02--34.pdf.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
hour or less