Microsoft word - 7.4_eva elvers- andreas persson_strategy at scb to test data collection instruments
A strategy at Statistics Sweden to test data collection instruments Cognitive methodologist, Process Department, Statistics Sweden, SE-701 89 Örebro, Process owner Design and plan & Build and test, Process Department, Statistics Sweden, Box 24 300, SE-104 51 Stockholm, Sweden, eva.elvers@scb.se
There are many methods to test questionnaires and other data collection instruments (for
example, expert reviews, cognitive interviews, and experiments). The methods have
different strengths and weaknesses, and the costs also vary. All surveys cannot be tested
with all methods; there has to be a balance with regard to survey importance, consequences
of data errors, available resources, and costs. Statistics Sweden (SCB) has developed a
strategy how to test different surveys, taking the abovementioned factors into account. This
strategy is based on survey characteristics mainly taken from a database with a set of
classifications for many surveys. Three characteristics have been chosen: official statistics
(yes or no), importance for society (two categories), and the importance of correctness
(three categories). For other surveys similar characteristics have been decided. Based on
how a survey is classified in the chosen characteristics, the strategy proposes different levels
of testing. These levels vary in both ambition and the methods included. SCB introduced
this strategy in the autumn 2010, and since then many surveys have successively been taken
1. Methods to test data collection instruments
The questionnaire plays a fundamental role in the production of survey statistics. The questionnaire
is the tool via which the data is collected from the respondents. Flaws in the questionnaire can lead
to negative consequences in many different areas. As such, it is important to test the questionnaire in
An important question is, then, how the questionnaire should be evaluated. Survey methodology and
practice have established a number of different methods for this purpose [4]. At Statistics Sweden,
we regularly use expert reviews, quick reviews (an initial screening of potential problems in a
questionnaire), cognitive interviews with probing, vignettes and think-aloud protocols [5], and
debriefings with interviewers or data editors. In specific cases, we also use monitoring, behavior
coding [3], and quantitative data, with or without experimental design, to evaluate individual
questions or the questionnaire as a whole.
These methods differ in many ways. For example, some require collected quantitative data,
substantial resources and developed hypotheses, whereas others do not. An important question is,
then, in what way these methods best should be applied. That is, given a certain set of
circumstances, which method or combination of methods is optimal? There are a few studies in the
literature examining how different evaluation methods compare or overlap [2], [6]. In general,
however, the literature lacks research comparing evaluation methods. Moreover, research also show
that different practices within methods can lead to different results [1]. Hence, the results from the
few conducted studies comparing evaluations methods might be difficult to generalize since many of
the evaluation methods (for example, cognitive interviews or interviewer debriefings) lack
standardized practices. Thus, research that compare different evaluation methods is lacking. Which
method(s) that should be applied in a given situation is therefore still a rather open question.
2. Questionnaire testing at a statistical agency
In practice, at a statistical agency, further factors than methodological usually play a role in the
choice of evaluation methods. Time and financial resources often put restraints on the evaluation of
the questionnaire. All surveys have fixed budgets and even though questionnaire evaluation is
important, it is still only one of many important factors in good survey design, all competing within
the same survey budget. In practice at a statistical agency, one has to consider costs, benefits, and
the big picture – the total design of the survey. As such, evaluating the questionnaire is important
(see above) but perhaps not always or indiscriminately. The total design, the expected resulting
quality, and the budget of the survey have to be considered.
Different surveys have questionnaires that differ in status. One important difference, among many,
is whether the questionnaire is new (i.e. draft status) or established and already evaluated.
Questionnaire evaluation does obviously not fill the same purpose in these two situations. Such
factors should influence in what way, and perhaps even whether, the questionnaire should be
evaluated at all, given the perspective of total design and set budgets. Statistics Sweden has
developed a set of conditions that stipulates whether a survey’s questionnaire should be evaluated
before data collection or not. The situations where testing is required are the following:
– The questionnaire has been changed, for example by adding new questions or a new data
– The context has changed in ways that could influence the measurement (for example,
changes in society or in words’ meaning).
– There are indications of problems with the questionnaire (based on process data, logs,
– The questionnaire has not been tested previously according to the routines of Statistics
– None of the above conditions are valid, but the client wants to evaluate the questionnaire or
testing is necessary due to policy reasons (for example if the questionnaire involves a
These conditions decide whether the questionnaire should be evaluated. Given that a questionnaire
should be tested, however, how extensive should the testing be and which method(s) should be
used? As shown above, the answer should be different for different surveys, depending on, among
other things, the status of the questionnaire and characteristics of the survey. In addition, there is no
clear answer from a methodological standpoint. Moreover, resources and total design have to be
considered. As such, how extensive the testing should be and which method(s) to apply appear to
require some individual investigation for each survey, based on the specific conditions at hand.
Unfortunately, such survey-specific investigations do not correspond very well to the production of
statistics at a statistical agency where hundreds of surveys are conducted every year in a steady
stream, where in-house communication can be challenging, where there is a conflict of resources,
where the measurement error has not always been of highest priority, and where both the survey
manager and the cognitive lab have to plan and allocate resources for testing well in advance. In
contrast, ideally there would be an explicit strategy which proposes different evaluations for
different surveys and, in that way, facilitates the testing process. To overcome the problems outlined
above, it seems that such a strategy cannot be survey-specific but must operate on a more general
level. That is, the strategy must discriminate between different surveys’ needs concerning resources
and methodological issues but not to the extent that it becomes too complicated or too complex to
communicate and apply in the regular production. Such a strategy should help both the survey
managers och the cognitive lab in planning for questionnaire testing. Although such a general
strategy undeniably would mean standardization, with the accompanying disadvantages of not
acknowledging uniqueness, it should just as well promote that questionnaire testing becomes a part
of each survey’s plans and not a last-minute resort. Thus, an explicit strategy for questionnaire
testing should facilitate questionnaire testing at a statistical agency and, thus, improve the
3. The development of the test strategy
The test strategy for data collection instruments should, hence, take survey characteristics into
account to discriminate between surveys, but without being too complex. Concerning resources, the
strategy has to be rational and not, for example, propose major testing for a survey of minor
importance. Another question is, then, how to determine survey importance. One perspective is that
of risk. A risk consists of the factors likelihood and consequences. The likelihood is difficult to
estimate in advance (especially with new questionnaires). The consequences, however, can be better
forecasted. Flaws in a questionnaire influence for example the respondents and their responses, the
data collection and the editing, and, in the end, the quality of the statistical output. The
consequences of flaws in the questionnaire depend on the uses of the statistics. Are important
decisions based on the statistics? Are the statistics widely used? Hence, flaws in the questionnaire
have different consequences for different surveys. Great consequences should therefore merit more
extensive testing to identify and re-design potential problems in the questionnaire. Such reasoning
is, of course, not unique for questionnaires but applicable also for other tests, for example tests of
IT-systems. Here such reasoning with consequences was used to determine survey importance and,
thus, the level of testing for different surveys.
There is a database at Statistics Sweden describing surveys through many, mainly administrative,
variables. Its information is used, for instance, in systems for publishing statistics, metadata, and
economic administration. The database covers official statistics from Statistics Sweden and all other
responsible agencies and further regular statistics from Statistics Sweden. The use has broadened
over time and increased considerably lately for various management and evaluation issues. In this
case three characteristics were chosen to discriminate between surveys’ needs for testing:
The Official Statistics Act states that official statistics are statistics for public information, planning
and research purposes in specified areas produced by appointed public authorities in accordance with
the provisions issued by the Government. Official statistics shall be objective and made available,
The survey has this characteristic if its output is considered important for one or more government
agencies in case of a critical situation for society or during times of alert.
The assignment of category for the consequences of incorrect information shall consider errors in
decisions, reduced confidence, and costs due to breaches of contract (and in general). The categories
mean (1) no or little harm, (2) harm, and (3) serious harm due to the incorrectness.
These three characteristics are all relevant when considering amount of testing for different surveys.
Together these characteristics give twelve possible categories for surveys. A few more
characteristics in the database were considered. Especially one of them was meaningful but was not
added, since its classification was strongly correlated with those already obtained.
The strategy is based on these three characteristics. How a survey is classified in them determines
the level of testing. The assigned level represents a minimum. The survey manager and the
management team of the survey can decide to test on a higher level. The main goal with the strategy
is not to capture survey uniqueness in terms of questionnaire testing (for example, that an
interviewer debriefing would suit a specific survey particularly well) but to establish a baseline,
based on the aforementioned characteristics. Three levels were considered appropriate. They are
called B, C, and D with increasing ambition. – There is another categorization with other
characteristics (such as the size of the survey and the sensitivity of the survey topic) and with a
further level A. It is used for surveys not featured in the database. Many of these surveys are
financed by fees. The survey, or sometimes just the data collection, is then commissioned by a
The numbers of surveys at different levels were studied in order to get a balance between testing and
resources. Only surveys with questionnaires were included (not surveys based on administrative
data or secondary statistics). The number of surveys at different levels is based on an investigation
in 2010 and are shown in Table 1. There are nearly 90 surveys in all. There are relatively few
surveys on the highest level (D). Two categories dominate in number, one in each of levels B and C.
Table 1. The twelve survey categories, the corresponding level of testing, and numbers 2010.
Official Important Correctness Number of statistics for society importance of testing surveys in 2010
There is one more important feature of the strategy: whether the survey is new or not – otherwise
expressed whether there is prior information available that can be used for test purposes or not. This
influences the possible set of testing methods (some methods require prior information, see above).
There are four testing levels A–D and two possibilities depending on whether there is prior
information available or not. Hence, there are eight combinations in all, as shown by Figure 1
Figure 1. Testing combinations and the testing methods used in each case
There is no previous collection to analyze (N)
(new or changed questionnaire or a changed context)
Testing level A Testing level B Testing level C Testing level D Combination AN Combination BN Combination CN Combination DN
(indications of errors or an untested questionnaire)
Testing level A Testing level B Testing level C Testing level D Combination AT Combination BT Combination CT Combination DT
The upper part of the figure shows the situation when there is no prior information available. This is
primarily relevant for new surveys or new questionnaires. In such cases, methods that require
already collected data (for example, monitoring or debriefings) cannot be applied unless a pilot
study is conducted (see combination DN). The choice of method(s) is fairly straight-forward. On the
highest level, D, the survey manager and the management team have to make an appropriate choice
together with the cognitive lab, based on the perspective of survey needs and total design. When
there are previous data, there are more possible methods, as the lower part of the figure shows. On
the higher levels an appropriate choice is made from the list in the figure.
Statistics Sweden aims to get certified according to the international standard ISO 20252:2007 for
market, opinion and social research. The requirement of the standard on pre-testing of
questionnaires is fulfilled by all four testing levels.
4. Successive implementation of the test strategy
The test strategy has been developed over a long time period. The first phase was a joint
development project for IT, cognitive methods, and statistical methods on testing. It was then
decided to move forward for data collection instruments. A group of four persons worked in this
second phase towards a strategy. Many co-workers served as critical and constructive readers from
different points of view in both phases. When there was a preliminary version for the test strategy, a
more formal way was used to communicate within Statistics Sweden. A referral was sent to the
departments most affected, i.e. the data collection and the subject matter departments, and also to the
group working on implementation of the ISO 20252 standard. The feedback from the referral was
very good with many constructive comments. Two referral rounds were used. This procedure
improved the result, and it also meant that the future users both got an understanding of the strategy
A formal decision was taken by the Director General in October 2010. This decision stated the
principles for the testing (i.e. the strategy) and also an implementation plan. Even if some surveys
were already tested, at least to some extent but perhaps not to their assigned level in the strategy, the
new strategy implied increased testing of surveys overall. Everything could, of course, not be
achieved at once. An ordering and a pace were decided for the next few years. The order of the
surveys was again related to risks. Surveys on a high level were put early in the plan, but their
testing may be implemented successively over a few years. A survey on level D, for example, can
start on level B and then proceed successively to C and D. New surveys and redesigned surveys all
follow the strategy since the beginning of 2011.
Plans for the next calendar year are made in the autumn. This is a suitable time to plan for
implementation in surveys that have not yet been tested on the appropriate level. Most repeated
surveys have a management team with some different roles. The methodologist of the team has been
given the task to point out the strategy, if need be. The testing is done by cognitive experts.
5. Conclusions
The strategy has several strengths. Since it is based on given classifications, it makes it possible to
plan for testing well in advance. Another strength is that the strategy is rational – limited resources
are used where they are best needed. Moreover, the classifications are not new for this test strategy
but taken from a database that many already are familiar with. The strategy therefore includes
relatively simple principles, which both cognitive and other staff can grasp and follow. In addition,
the many co-workers and departments involved in the developmental work assure that the strategy
has taken many perspectives into account and fits the big picture.
The strategy also has limitations. Even though it is based on ideas concerning how to best mix test
methods (to combine qualitative and quantitative data, or empirical methods with those including
primarily individual judgment) the proposed combinations of methods might not be optimal for
specific surveys. The strategy is a somewhat crude way of capturing survey’s needs of particular test
methods. However, the main goal was not to present optimal testing from a methodological view for
every survey but to facilitate and establish a baseline for testing in general. There is some flexibility
in the strategy (to adjust according to a specific survey’s needs) but simplicity was regarded as
highly important in the initial implementation.
Other factors that have contributed to the successful implementation are:
– standardization overall in statistics production;
– an increased understanding for the danger of measurement errors and the importance of well
– the ISO 20252:2007 standard with pre-testing of questionnaires;
– an understanding and positive reception of the test strategy.
Eight months after the formal decision on the test strategy around 80 surveys had been tested. About
half of these were quick reviews (level A), and the other tests were on higher levels. Some of the
tested surveys had not yet reached their minimum level but been tested on a lower level. There were
some reasons to wait, e.g. a redesign of the survey in the near future.
This strategy has got quite a bit of attention at Statistics Sweden for two major reasons.
Questionnaires are tested in a structured and well motivated way that makes planning possible. The
approach has been appreciated as such and can be used also in other areas.
6. References
[1] Conrad, F.G. and Blair, J. (2009). Sources of Error in Cognitive Interviews. Public Opinion
[2] DeMaio, T.J. and Landreth, A. (2004). Do different cognitive interview techniques produce
different results? In Methods for testing and evaluating survey questions. Presser, S., Couper,
M., Lessler, J.T., Martin, E., Martin, J., Rothgeb, J.M., and Singer, E. (eds.). Wiley. NJ.
[3] Fowler, F.J., 2001. Coding the behavior of interviewers and respondents to evaluate survey
questions. In Questions Evaluation Methods. Madans, J., Miller, K., Maitland, A., and Willis, G.
[4] Presser, S., Couper, M., Lessler, J.T., Martin, E., Martin, J., Rothgeb, J.M., and Singer, E.
(2004). Methods for testing and evaluating survey questions. Public Opinion Quarterly, Vol. 68,
[5] Willis, G. (2005). Cognitive Interviewing: A tool for improving the questionnaire. Thousand
[6] Willis, G.B., Schechter, S., and Whitaker, K. (1999). A comparison of cognitive interviewing,
expert review, and behavior coding. What do they tell us? Proceedings of the Section on Survey
Research methods, American Statistical Association, 28–37.
CHAMBER TM C ollaborations for H e A lth I M provements in HER2 B reast Canc ER Effective Therapy for a Premenopausal Woman withER/PR Positive, HER2-Overexpressing Metastatic Breast CancerDiagnosis: Hormone receptor positive, HER2-overexpressing, metastatic breast cancerHistory of Present Illness: This is a 36-year-old premenopausal woman who presented initially in 2002 with a 5-month