Back to Library Catalogue

Use, Design & Evaluation -

Steps towards an Integration

In D. Shapiro, M.Tauber & R. Traunmueller(eds.) The Design of Computer-Supported Cooperative Work and Groupware Systems (series 'Human Factors in Information Systems' volume 12) Amsterdam, The Netherlands: North - Holland. ISBN:0 444 81998 3.

Abstract
This paper argues for a shift in perspective away from thinking of the activities of design, use and evaluation as quite distinct activities, but as activities that are necessarily interleaved and mutually constitutive. Adopting this view has implications for the organisation of design teams, and for the role of "evaluation" in the larger design process. After some discussion of these points, the paper reviews some recent evaluation studies in the area of CSCW and notes some methodological issues that need to be addressed in evaluation work.
Document ID COMIC-Risø-2-8
Status Book Chapter
Type Working Paper
Version 3.0 Final (update of Risoe-2-3)
Date 19 August 1994
Task 2.2

 

 

Use, Design, and Evaluation : Steps towards an Integration

 

 

Liam J. Bannon

Dept. of Computer Science & Information Systems

University of Limerick, Ireland

 

 

 

 

 

Abstract

This paper argues for a shift in perspective away from thinking of the activities of design, use and evaluation as quite distinct activities, but as activities that are necessarily interleaved and mutually constitutive. Adopting this view has implications for the organisation of design teams, and for the role of "evaluation" in the larger design process. After some discussion of these points, the paper reviews some recent evaluation studies in the area of CSCW and notes some methodological issues that need to be addressed in evaluation work.

 

Keywords

CSCW, evaluation, experiment, design, groupware, iteration, methodology, use.

 

 

 

The argument of this paper, expressed in a nutshell, is that too often the work done under the rubric of design and that done under the label of evaluation is carried out as two completely independent activities, if indeed one finds cases where there has been any evaluation done at all. While this problem is endemic to much of design in general, the focus in this paper is on design and evaluation in the context of software development, focusing even more narrowly on CSCW experiences in this area. The paper is organised as follows: Section 1 provides a perspective on design that emphasises the iterative nature of design and the role that evaluation through use can play in this iteration. Support for such a view is drawn from a variety of sources. Section 2 introduces and critiques the concept of evaluation, arguing for a demythologising of the topic, and for the development of simple but useful tools for evaluation. After this general discussion, we then document the situation with respect to certain CSCW applications in recent years, giving examples of evaluations and their strengths and weaknesses.

 

1. A perspective on the design process

 

While empirical studies of the design process - both design generally and software design in particular - are still relatively few and far between, the last few years has seen an increased interest in understanding the nature of this process more fully, in order that it may be supported better through a variety of means. The simplistic idea of designs materialising "out of the blue" more or less completely formed in the head of a lone designer has given way to understanding the work of a design team who engage in a fine-grained analysis of the work needs of specific people in a specific context, and the slow and laborious process of developing a design from a tentative first sketch through to a fleshed out design model that can be prototyped, tested and refined in an iterative fashion, before the process becomes fully industrialised. My intent in this paper is not to attempt to study the details of this whole process, but rather to highlight certain aspects of the process which I believe have not been given the attention they are due to date. These aspects are all to do with the areas of use, testing and evaluation, not as terminal stages in some linear design model, but as necessary and interleaved aspects of iterative design. Since I wish to focus in this paper on these aspects of use and evaluation, I do not intend to spend much time arguing for my particular view of the design process over other frameworks, as my view has been developed and builds on a large body of work on information systems development, much of it Scandinavian in origin, which has evolved and been reported on extensively over the past several years, and to which readers are referred for further information, e.g. Ehn (1989), Floyd (1987), Greenbaum & Kyng (1991), Bødker & Grønbæk (1991), Grudin (1991), Bannon & Bødker(1991), Henderson (1991), Henderson & Kyng (1991). Common to these accounts is a concern for the evolution of a design practice that encompasses an understanding of the use situation, of the needs and concerns of the actors involved, and of the importance of an ongoing dialogue with the involved parties through concrete instantiations of design ideas (prototypes) that can be worked with by "end-users". Furthermore, there is an acceptance of the fact that user needs are difficult to articulate as system requirements, and that such requirements are at best temporary and local, for as time unfolds changes occur in the use situation and surrounding context that will inevitably affect system requirements. Part of the inherent nature of the design process can indeed be viewed as managing such contingencies.

Thus we should see design, use and evaluation as interleaved and inter penetrating practices, not as distinct steps in a linear development process that moves from analysis through design to implementation then use and, ultimately, evaluation. Indeed use can be seen as the basis for design. This need to reframe our accounts of the design process has become more accepted in recent times - Henderson & Kyng (1991) discuss "continuing design in use", and the notion of a cycle or wheel of design - moving from use -> observation -> analysis -> design -> implementation back to use - has been well articulated in Henderson, 1991. This alternative conception of the design process allows one to see how design can emerge from an understanding of current (mis)-use of systems, as a spur to re-design. The iterative nature of the process thus focuses attention on the interleaving of design and use as fundamental to design. Observations of use can be conducted in a variety of ways, from the designer’s own experiences with a simple model, to experiences with a more fleshed-out prototype, through to more rigorous evaluation of the system prototype in a field study with potential users. The latter activity has often been labelled Evaluation in older flow models of design and occurs at the end of the development process. Changes made as a result of such studies are relatively minor, as the system has been more or less frozen at this stage. Changing perspective to the wheel of design does not imply that formal evaluation studies per se are required at the early stages of design, but that some form of use and evaluation, be it simply with mock-ups or simple prototypes, or even simple tests with storyboards, get carried out from the very beginning of the design process, and are allowed for, nay built into, the whole design process. The depth and extent of these studies of use will of course vary as the design concept develops, but it is the shift in perspective that sees such studies as an integral part of design that I wish to stress, as a perspective for what follows in this paper.

The relevance of some of the points in this paper to the area of CSCW is especially important for a number of reasons. For one thing, as noted by a number of people (e.g. Grudin, 1989) the intuition of designers about useful software for groups is likely to be poor, and thus understanding of the use situation even more important than usual in design, thus leading to the need for early prototyping and feedback from users. For another, the early days of CSCW were characterised by a large number of papers that described design models for ‘supporting’ various aspects of group activities, embedding a model of group communication or group co-ordination activity that seemed open to question (see Robinson & Bannon, 1991). There was little or no evidence that such models were of any practical relevance to the task at hand, and further, no discussion of such issues as possible testing or evaluation of the designed systems took place. Worse still, when refinements to the initial model were presented at a later date, often these refinements were also based on abstractions rather than on any clear empirical evidence for the relevance of these new features in actual work situations. There appeared to be a total neglect of the need to understand the use situation, and of the need for iterative design, to take the experiences of use of systems seriously.

The question is not simply whether such abstractions were of use or not, but to actually test them and be able to say, based on some empirical evidence, no matter how meagre, that certain features were found to be of use and others not. Even more frustrating to the longer term development of the CSCW area, in certain cases we have seen the abandonment of one form of model for yet another model, without any clear rationale for the switch, other than the designer's whim or fancy. If such changes in direction have come about through some set of experiences in using a prototype, such information would be of great value to our community, as currently the number of reports of use of systems, be they successes or failures, from early mock-ups to full-blown systems, is quite meagre.

 

2. Another look at Evaluation

While much time and effort has been spent on formally developing evaluation concepts and methodologies, that is not our interest here. We rather focus more pragmatically on how we can improve the quality of our systems. Thus our focus is on evaluations in practice, rather than the nature of Evaluation as a concept - with a capital "E". On a simple level, there is an obvious need to know whether or not some designed system does actually perform its intended function. Evaluations are supposed to examine this aspect among others. There are a variety of forms of evaluation: expert investigation, questionnaire surveys, verbal reports, controlled experiments, design reviews, informal observations and formal analysis (See Karat, 1988 for an overview). Evaluations can be formative or summative (Scriven, 1967). Summative evaluations focus on the results that can be achieved with the designed artefact. Formative evaluations are concerned with improving aspects of the design during the design process itself.

Recently, there has been a surge of interest in the use of ethnographic methods as another form of data collection and analysis that might be useful in the systems design process (see Jordan, this volume, for some further information on ethnography). While the notion that ethnographic methods could be seen as evaluation tools might seem strange to some, given their qualitative, interpretative nature, there is no doubt but that such accounts can be useful, both in helping to inform the requirements process, and in understanding how people work with or around designed systems. The reason why the term evaluation might seem out of place here is because one normally assumes some metric or set of metrics against which a system is being "evaluated", and while this is relatively easy in the assessment of hardware or software features of systems ( speed and size of memory etc.) the whole issue of exactly what we are measuring against, what our criteria for evaluation are, is much more problematic in the case of measuring usability or utility of systems, from the point of view of end users at different levels in an organisation. Attempts have been made to firm up such metrics, under the rubric of "usability engineering", but such attempts engender their own problems, as noted by some of the pioneers in this particular field (Whiteside, J. Bennett, J., & Holtzblatt, K., 1988).

Given this uncertainty, a careful systematic account of what happens in particular settings when a prototype or system is installed, and how the system is viewed by the people on the ground, can provide useful information for "evaluating" the system and its fitness for the purpose it was designed. The next issue is how such information can influence the design process, as traditionally the focus of ethnographers has been on understanding a particular setting, and not on re-designing artefacts for that setting. While many ethnographers eschew this issue, but prefer to present their account and leave it to others to try to glean some design implications from the material collected, there are a number of attempts to try to make the results of ethnographic work more usable to the design community (see Blomberg, J. , J. Giacomi, A. Mosher, & P. Swenton-Wall, 1993).

Much useful insight can be gained about the success or failure of systems simply from the subjective assessments of designers and users, whether noted and collected informally, e.g., in diary reports, or from informal discussions and observations. User-based evaluations can be of a number of forms, and can provide a variety of kinds of useful information, depending on who the users are. Often, users are randomly selected people, perhaps with a level of education expected of intended users of the new interface. Simple empirical studies such as "Talking -Aloud" can provide much useful information on problems with specific features of the interface. Indeed, we can even perform evaluations of hypothetical systems through "Wizard-of-Oz" techniques, where the potential functionality of a system is simulated by a person rather than actually being built into the software. Such studies can give insight into whether the intended functionality is actually deemed useful by users, in advance of any attempt to actually build software to produce such functionality. Evaluating whole systems becomes more difficult, as first of all it is important that the "users" (a term that is fraught with problems - see Bannon, 1991) tested with the system are indeed drawn from the community of people who will be operating the new system. Failure to test appropriate users can lead to systematic problems that may not be uncovered until the system is operational (see Whiteside and Wixon, 1987 for an example). Also the time taken for empirical study is often far too brief for it to be possible to answer larger questions of how people may adapt and use the system over longer periods of time in real work situations (For more on the problems of evaluation in HCI, see Bannon, 1991, Bannon & Bødker, 1991, Thomas & Kellogg, 1988).

Given that our designs will inevitably be flawed, the important point is that the results of these designs are tested and the findings used to help in the re-design process. Indeed, in recent years, this realisation of the inevitable need for re-design has become a commonplace - as noted in the oft-quoted dictum : " Design to throw one away - you will anyway " . Iterative design, moving from initial design to a prototype, followed by evaluation, and back to re-design has become a buzzword in systems development. Also within HCI, we have seen an increasing concern with the need for appropriate evaluation of systems under the rubric of "usability". Despite the existence of the "human factors" area, which supposedly evaluated the usability and utility of designs, it appears that our methods and techniques were not able to prevent a large number of inappropriate designs coming to the market. This has also pushed the need for methods of early evaluation of prototype systems so that modifications can be fed back into the design process rapidly. Within the HCI community, Landauer (1991) stands out as someone who has consistently argued for the need to move away from narrow, detailed and laborious experimental studies which are of little use to the hard pressed design team keen to make improvements to the prototype design as rapidly as possible, towards more "lightweight " evaluation techniques, as has Nielsen (1989).

The key point here is to impress on designers the need for early evaluation of systems, whether under the guise of user involvement, rapid prototyping or user testing, or all three combined. Failure to do this can be attributed to a number of factors, some hinging on the personal history of design team members, others that may be structural. For example, it may be the case that the design team has not in the past worked much with users, and thus the necessary skills (often really quite simple) are not available, leaving no person who feels responsible or capable of assuming responsibility for such activities. Such activities do of course cause some disturbance in the life of the design team, (the real world is a bit "messy" after all) but the rewards can more than justify the disturbance. My view is not necessarily to attempt to handle such problems by adding another design team member who is solely responsible for "human factors" or for "doing ethnographies", as this often just increases variance in the design team, but rather to make all the team members aware of the need for studies of use, formal and informal.

So, having argued for the desirability of performing some form of evaluative activity at every step in the design process, I now wish to show, through discussion of a particular case, some of the problems that can occur in the conduct of an evaluative study.

 

3. Critiquing a CSCW Evaluation Study of THE COORDINATOR

While the argument above has been to the effect that ongoing evaluation can significantly improve the design of an application, it must also be noted that a number of the existing evaluations of prototypes or systems can themselves be critiqued from a variety of standpoints, conceptual, methodological, empirical. In such cases, the import of the published results must be carefully weighed. The old question of "who evaluates the evaluators" needs to be taken quite seriously. While addressing this thorny problem fully would require a separate paper, I base this section of the paper around a discussion of one particular study, by Carasik and Grantham (1988), which has been cited frequently in the CSCW evaluation literature. The rationale for this is that, in my opinion, this study exemplifies a number of important issues that need to be taken seriously in the interpretation of any evaluation study.

The Coordinator is a commercially available software system that has become one of the most talked about CSCW applications in recent years due to its articulation of a well-developed theory of "language as action" that has exerted considerable influence in the research community (Winograd, 1986) as well as having received enthusiastic commercial endorsement, at least within certain companies (Johnson, B., Weaver, G., Olson, M., & Dunham, R., 1988) who claim that use of the system has increased their productivity enormously. It has attracted both praise and criticism at both a conceptual and pragmatic level. The actual system can be simply described as a fancy electronic-mail-cum-project-management system. The system is built on the belief that human action is based on conversations, primarily conversations of a particular form, for action. Thus people using the system do not simply send mail, but make requests, or promises, or offer or decline to perform certain activities. (The system does allow for "free-form" responses, but this choice indicates abdication of the underlying framework on which it is built). Within this framework, the system then keeps track of the commitments made by individuals.

Support for the benefits that are provided by the system comes from a number of sources. In one survey and observation study on groupware systems in organisations, Bullen & Bennett (1990) note how the ability to link messages in electronic messaging systems (e.g. in Higgins, The Coordinator, All-in-One) was found to be a very useful facility across the board. Through its concept of a "conversation", The Coordinator supports such linking quite explicitly. At the same time, there is an at times quite bitter, almost ideological, dispute between different groups as to its explicit design goal, which is to change the way people in organisations think and act. Whether speech act theory itself is an adequate theoretical framework on which to erect any computer-mediated communication system is open to question (Bowers and Churcher, 1988, Levinson, 1983) but the main complaint against the system in use to date has been that it seems to exclude negotiation. Considering the volume and vehemence of rhetoric on the system, there is a relative paucity of case studies, and the evaluations that do exist are rather partial, and of varying quality. Existing evaluations of the system have generated conflicting results (Johnson et al, 1986, Bikson et al., 1988, Bair & Gale, 1988, Carasik & Grantham, 1988, Bullen & Bennett, 1990, Schäl, this volume). We will investigate one of these (Carasik and Grantham, 1988) in a little more detail here, in an effort to point out some of the problems that can occur in the design and execution of an evaluation study.

In any evaluation study, it is important to be aware of the original context into which one is "parachuting in" or introducing the new application, and how it will impact on the selected "users". Of course a crucial question is what do we hope to learn from such an evaluation? Is it concerned with testing some highly specific feature of the interface, e.g. a set of menu commands, to see if they are usable, or is it attempting to test the utility of a whole new application for a particular group of people working in a specific environment? In the former case, the selection of "subjects" for the study may be a lot easier than the latter, where it may be much more difficult to find a suitable test environment. Performing "evaluations" of systems in inappropriate environments means that we cannot put any weight on the obtained results, even if other more local aspects of the "evaluation" were conducted carefully. Due to the difficulty of finding appropriate test sites for field study evaluations of systems, a common problem with such kinds of studies are that other environments are found, less appropriate in terms of scientifically testing the case, but infinitely more tractable. If the evaluators are open about the situation, then such studies, although they must be interpreted with caution can still provide information of use, but care must be made in generalising any findings. The Carasik and Grantham (1988) case study is problematic on a number of counts in this regard, and since it has been cited quite widely as an example of how people "reject" the Coordinator, these problems need to be noted.

An initial problem with the study is in the selection of the test group, colleagues of the authors. This seems somewhat unfortunate, even if convenient for the researchers. In the published case study, it appears that one of the authors took on the role of championing the product, and suggesting to the "group" in which he was a member that they should all try out the system for a period. The group consisted of 15 professionals who provide consulting in information technology, are located in 2 different locations 35 miles apart, and are frequently travelling, or working at home. The authors openly admit that the people do not work extensively with each other, so to what extent they can be seen as a "group" other than that they are organised administratively into the same unit is questionable. It is also noted that many of these people currently have access to other email systems, and use a variety of different hardware and software (PROFS, UNIX, PCs, Macs). For the purposes of the trial PCs were made available to all (in order for people to be able to interact with the Coordinator), but it is not noted how many already had them on their desks, and how many were encumbered with an additional machine simply for this software trial. The intent of the study was to see if people would find the Coordinator system useful for intragroup communication. Given that, as the authors admit, the organisational division where the trial was carried out does not really constitute a work group of any coherent form, the likelihood of being able to adequately test this particular hypothesis under such conditions does not seem very high.

The actual implementation of the study raises further serious issues. One of the experimenters apparently took on the role of arranging to provide people with the software, and the additional PC required by some people in order to be able to use it. Within the organisation, the manager of the group expressed support for the study, but did not use system himself, rather having his secretary use it for him. The software vendor gave a 4-hour training class for the whole group on use of the software, but it appears that many users were unfamiliar with PC's and the authors note that many users never really made an effort to use the system. As they note: "some group members used the software extensively, while others simply avoided it". Such a result is not surprising, given the conditions, however, there is no teasing out of the interdependencies between problems with PCs in general and problems with the particular software, so one has no clear way of separating out the relative contributions of each of these factors to (non-)use. The authors do however note that the new system had no interface to other mail systems, especially a problem where people already use PROFS or UNIX mail, as the Coordinator users’ conversation base was restricted to the other people with PCs. But amazingly, no figures are given as to how use or non-use related to previous and current use of other email systems, which would seem to be a crucial factor.

It might be argued that my critique is overly harsh on a study which did at least attempt some form of evaluation of a system in a use context, and this is an area where, as noted earlier, there is a paucity of studies in existence. However, part of my reason for critiquing it in such strong terms is that the paper does not read as an informal anecdotal study of an attempt to experiment with the use of the system in-house, which is what, in my view it actually is, but purports to be more than it in fact is. So, for example, the authors discuss their evaluation in more formal terms, to the extent of discussing their methodology, and the use of 3 "instruments" in order to assess the effects of the "experiment". Let us look at the instruments used in the study and their appropriateness. The first was the use of a semantic differential scale which measures changes in subject’s cognition of test items, the second was a network analysis questionnaire, which noted the attitudes of people to other members of the work group, and the communication patterns observed, and the third was a matrix questionnaire looking at the potential substitutability of different forms of communication within the group. While each of these "instruments" may in certain settings, under certain conditions, produce useful information, their relevance in this situation, given the site set-up described earlier , is in my view deeply problematic. Given the informality of the "experiment" and the conflating of a number of issues outlined above, at best the study could report on anecdotal reports of people, and subjective impressions, even still laced with caution due to the extreme limitations of the study. But it is difficult to imagine how such instruments could provide much new insight, given the informal nature of the "experiment".

The authors usefully note the anecdotal reports of their users that are very negative on the "new language" which the system introduces to label communicative acts, on the poor interface etc. It is then noted that only at management urging would people continue to use the system, but there is no detailed logging of use of the system displayed, so one has no idea of what level of use was ever achieved with the system, and by whom. It is noted that after 6 weeks the group decided to discontinue use of the Coordinator, and adopt PROFS instead, citing the following difficulties: Lack of language clarity, difficulty in learning the tool, and cross-system compatibility. These appear to be quite sensible reasons for stopping use of the system in this particular setting. To put it from a "users" perspective, they were asked to use a new piece of software that was foisted on them at the behest of one member of staff that was curious about the potential of the new tool, and was backed up, at least initially, by management. It is difficult to see how the system meets any real identified need of this group. The trial also involved for some people having to use a completely different computer system to what they currently used, to communicate with co-workers that they did not need to communicate with extensively in the first place, so is it any wonder they were reluctant to use it?

In sum, what this study is able to say about the general utility or usability of the tested product is very debatable. The lack of quantitative data published from the study makes it difficult to check out the details of the case - in terms of total time for different users on the system, number of messages sent, correlation’s between use of the new software and availability of alternatives, etc. The one table produced, which shows the change in the semantic differential before and after the test simply shows that they rated the concept "Collaborative Work" as more boring after their experiences. The likelihood of such a measure providing any useful information in this kind of informal trial is very low. Indeed the one significant before-after measure could simply be seen as a response by those involved to the hype surrounding the study itself. To be fair to the authors, in the final part of the paper, they do raise some important points about the Coordinator, but these points become attenuated due to the low quality of the empirical evidence that is adduced.

To conclude this Section, my purpose in discussing this study in some detail is twofold. On the one hand, some pitfalls in the conduct of a user software trial have been noted. Concomitantly, I wish to bring to the reader’s attention the fact that this study has been cited numerous times in the CSCW literature as empirical evidence for the general "failure" of the Coordinator in actual work settings. While the paper does have some interesting commentary on the Coordinator, much of which I personally agree with, and provides some informal empirical information, it should be clear by now that it certainly should not be understood as in any way clearly demonstrating that the Coordinator is a failure. Evaluations are important yes, but it is also important to be aware of the quality of the evaluation, and of what can legitimately be learned from any particular study. In the present case, I have tried to show that, based on the evidence presented in the published paper both about the set-up of the study and the results proffered, very little of substance can be concluded about the utility of the Coordinator, even for the particular setting studied.

In the next Section, I wish to briefly outline some other evaluation studies that have been done on certain CSCW systems and note some of their features, positive and negative.

 

4. A Further Sampling of CSCW Evaluations

The MIT Information Lens Project

The Information Lens system (Malone et al., 1987) has been the subject of a number of research reports and the ideas behind it have now been incorporated into several commercial products. The system is designed to support people in managing their electronic mail. It has at times been referred to as an "intelligent" information sharing system. The filtering available in Information Lens is designed to screen users from "junk" mail and cull other messages of interest from a larger set, even if not directly addressed to specific users, thus extending the information sources available to individuals. It provides capabilities for organising mail based on various aspects of the incoming message. It allows users to make message templates of various forms and have rules (of an IF-THEN-ELSE variety) that act selectively on these "semi-structured" messages. If the sender has selected a colloquium form for the mail message, and a message form of type: colloquium has been defined by the group, then the sender can be provided with support for composing the message through a partially filled ("semi-structured") message template, and the receivers can make rules that utilise the information that a message is a colloquium announcement to file it appropriately. One can see how this could be quite useful to help put some structure on the myriad of different forms of email communication which at present are insufficiently disambiguated. It helps the sender to structure messages appropriately, and can serve a reminder function for what information is necessary for certain announcements (e.g. to remember to specify the location of a meeting) as well as helping the receiver to sort incoming mail appropriately, rather than have all kinds of messages mixed together in the incoming mail file, as has been the case in most email systems up until the past few years.

In an empirical investigation of the use of the Information Lens system, Mackay et al (1989) summarised their findings as follows: People without significant computer experience can create and use rules; Useful rules can be created based on the fields present in all messages without special message templates; People use rules both to prioritise messages before reading them and to sort messages into folders after reading; and people use delete rules primarily to filter out messages from low-priority distribution lists and not to delete personal messages to themselves. Mackay (1990) also shows the wide variability in patterns of use of the system, though overall there seems little doubt that prototype systems (after some iteration) are being used effectively in work situations.

Undoubtedly some of the ideas embodied (over time, evolving through use) in Information Lens have proved useful in practice. So it seems that people can makeup rules that are useful, but it this does not imply that they can be encapsulated into an "agent" and allowed to be triggered automatically. For instance, one key point noted by Mackay (1990) was how people tended to make up rulesets but then run them manually, i.e. the people themselves determined when to run the ruleset, in particular occasions of use, rather than have it done automatically, according to some pre-specified formula. This supports the notion that it is very difficult for people ahead of time to specify clearly the conditions under which certain rules should be run. Luckily, the technology allowed for the user to manually "trigger" the rulesets, although this was not part of the initial idea of how the system would be used.

In the field study, Mackay shows the importance of the social environment in affecting the use and development of Lens, with a local expert exerting considerable influence, as well as information sharing going on among the participants - sharing of rules developed by one person and then picked up by others, even without any explicit support for such sharing in the system itself. Mackay shows how quickly people settle into a routine, with changes to rules, addition, deletion, modification, often prompted by outside forces, such as a new version of the system, or a break from routine work, or a visit from the Lens researcher. For many (though not all) users, even in this fairly brief evaluation period, months would go by without further changes to the rulesets. Mackay shows how both individual differences and task differences can have a big effect on the use of the system. People in control of their mail do not see the need to invest in learning a new system, while people swamped already feel taking time out to learn the new system will only make matters worse. An interesting fact is that the initial idea of use was that Lens rules would be automatically run on the incoming mail, yet it turned out people found it useful, and it was possible (through a debugging feature), to run rules on a particular folder. So one user found that this debugging feature allowed him to run rules after he had read his mail. Others liked this idea and began to do the same. This finding subsequently was reflected in the later releases of the technology where there were possibilities of creating multiple rulesets that could be triggered by different events.

As Mackay notes:" Software does not remain static when it is introduced into an organisation. People in the organisation evolve their individual patterns of use, share them with each other, react to external changes, both technical and non-technical, and sometimes pro-actively modify the system to produce significant innovations." In the current discussion, these observations are important, as they show how ongoing evaluation of prototype Lens systems led to new ideas about the very conception of Lens as a tool, which could be factored into the next iteration of the system. In other words, the very conception of what Lens was, and how it was to be used, changed for the design team as they witnessed the way people actually used their prototype system. This provides a very powerful example of how important ongoing evaluation studies are as the original design idea becomes articulated and reified in a particular piece of software. On another level, it changes how we think of the design process, and the actors which it encompasses, as we see original ideas about what the tool is and how it can be used coming from the "users" themselves. It is for this reason that the term "user testing" is not one I favour, as it places unnecessary limits on the conception of design, and its separation from use, which I have decried earlier in the paper. Users as designers, or at least co-designers, becomes more than simple rhetoric if we shift our perspective in this way.

Xerox PARC CoLab

This project involved building a computerised meeting environment to support small (2 to 6 people) face-to-face meetings. A special room was constructed containing several workstations connected on a local area network. A number of software tools were developed to allow users to jointly work on documents and share the same views on these documents (WYSIWIS - What You See Is What I See). The project has now ended. Stefik et al. (1987) describe the design goals of the project and some of the software tools. This system has been the subject of an interesting evaluation project (Tatar, Foster, & Bobrow 1991), which presents a very thorough analysis of problems in use of the original system, and their possible cause, together with recommendations for improving the design of one of the tools.

Many people in CSCW are of the opinion that CoLab was used and tested extensively by a variety of people over its lifetime. However, while extensive use of the system was made by the designers during the design process, it was only quite late in the project that any more thorough evaluation of the system with outside groups was performed. It is important to note here that both of the groups that were studied were familiar with the underlying interface technology, and also had worked as groups previously - overcoming 2 major methodological problems encountered in several other attempted evaluations of systems. These 2 experiments done by Tatar et al. showed up some serious problems, in that neither of the groups ended up using the shared computational workspace that was the core idea of the project. In one case, the group stopped using the system altogether and resorted to working together with a pad of paper. In the other, they "managed to find a successful way of using the tool by using the video network to look at the screen of whoever was typing, thus employing the shared video workspace instead of the shared computational workspace" (Tatar, Foster, & Bobrow 1991)

What is significant in this evaluation study however, is the persistence of the investigators to provide a well grounded explanation, based on a conversational model of interaction, for why the 2 groups had problems. The major problems of users had to do with the visibility of certain operations and with problems in reference. Studies showed that people at times had problems in interpreting others comments when their views of the shared world did not match. So, for example, if people re-sized or moved the shared window on the system, peoples references to spatial locations might not always be appropriate for the other participants. This caused considerable disruption to the work of the group. The focus of work on this project would appear to have been more on examining the technical issues involved in developing software for the real-time computer support of groups than on an understanding of how people could or would use such a system in everyday work activities. To their credit, after enumerating these difficulties they show how they used these findings as a basis for redesign of a number of key aspects of the system. Results of studies on the new system are noted as positive though there is not much detail on them. One large question that arises from this work is whether the original conception of providing separate screens was a good one as separate screens allow for loss of gaze and gesture information that turned out to create problems. Perhaps, tongue-in-cheek, it should be "Back to the Chalkboard" - in the sense of a common reference - rather than "Beyond the Chalkboard" ?

 

DOMINO Office Procedure System

This prototype procedure system is interesting as the underlying model behind it has been extensively described in the literature and more recently evaluated informally by some of the design team (Kreifelts, T., Hinrichs, E., Klein, K-h., Seuffert, P & Woetzel, G., 1991). The system makes a number of assumptions about the nature of office work, and provides "support" for a number of work activities. A working prototype has been developed and in use in a research organisation, where initial studies of its use have been performed. The initial system model had been the subject of some criticism concerning its view on work activities, but what is interesting is to see what actually happens when in use. While in certain respects, having designers themselves perform the evaluation is open to critique, what we are arguing for here is exactly that members of the design team do try to understand the use of their system as early in the design process as possible, in order to evaluate the effectiveness of otherwise of their proposed model. From the viewpoint of a formal evaluation methodology, their procedure is indeed problematic, but it is precisely such informal studies that I would encourage design teams to engage in as early and as frequently as possible in the design cycle, as but one aspect of ongoing evaluation.

The small internal study of Kreifelts and colleagues shows that, indeed, the system was seen as problematic on the grounds of not allowing for sufficient flexibility, for example allowing necessary informal communication, for lack of integration with other tools - for example electronic mail and spreadsheets. While some of these difficulties could have been predicted without having to empirically test the system, such kinds of informal evaluations of experiences of use can have a powerful effect for the design team itself who see with their own eyes some of the difficulties experienced by their users. The point is not that such systems have no future, but that we must take seriously the findings that people do not simply "follow procedures" in an office (Suchman, 1983), and thus office support must be very tailorable and flexible if it is to be of practical use to the people doing the work. Subsequent work on the DOMINO system is explicitly taking account of the difficulties experienced by the users of the prototype, and has lead the designers to understand the need for a richer conceptual framework for understanding office activities. Supporting cooperative work with technology requires that we understand a lot about the details of how people together achieve a shared understanding, it is not something that passively occurs. The technology must support, or at least not hinder the subtle activities that people engage in to accomplish the apparent orderliness of their work .

In sum, what is of particular interest here is that, while conceptual arguments about the veracity of the initial underlying model of DOMINO had been in existence for some time, it was as a result of a small simple empirical "evaluation" of the prototype system in a work setting that the design team, on reflection, began to re-conceptualise the system and re-design it.

 

Conclusion

The intent of this essay was twofold: to re-think the role of observation / evaluation studies in system design, providing a perspective that integrates use, design, and evaluation in the design process; and to explore briefly one or two examples of evaluation studies to note some methodological issues that affect their interpretation. We have attempted to demystify the methods and purposes of evaluations, and have emphasised the need for quick and dirty methods for informal evaluation, quite distinct from more formal studies that may be conducted well after the design has been frozen. Conceiving of design as being part of a larger and inevitable cycle of observing use, developing requirements (formal or informal), designing, building and again observing, allows one to plan from the outset for various forms of evaluation in the design process, from very informal to more formal studies. After each kind of evaluation, care should be taken in interpreting the results of such studies, as we have seen how certain empirical studies have been conducted in a manner which makes them very difficult to interpret. Finally, just as in other aspects of the design process, methods alone will never suffice, as there is always a place for common-sense in their application, and in the interpretation of the results.

 

Acknowledgments

Thanks to a number of people for commenting on a draft of this paper: Marina Jirotka, Yvonne Rogers, Jonathan Grudin, David Jennings, and anonymous referees. Not all of their suggestions are reflected in the present text, for a variety of reasons, but their comments and critique are being taken on board for future work.

 

 

References

Bair, J. & Gale, S. (1988) An Investigation of the Coordinator as an example of computer supported cooperative work. (unpublished ms.)

Bannon, L. (1991) From Human Factors to Human Actors: The role of psychology and human-computer interaction studies in systems design. In Greenbaum, J. & Kyng, M. (Eds.) (1991) Design at Work.: Cooperative Design of Computer Systems. Hillsdale: Lawrence Erlbaum Associates, pp. 25-44.

Bannon, L. & Bødker, S. (1991) Beyond the Interface: Encountering Artifacts in Use. In J. Carroll (ed.) Designing Interaction: Psychology at the human-computer interface. (Cambridge: Cambridge University Press). 227-253.

Bannon, L. and O'Malley, C. (1984) Problems in Evaluation of Human-Computer Interfaces: A Case Study. In B. Shackel (Ed.) INTERACT 84 - Proceedings of IFIP Conference on Human-Computer Interaction, London, United Kingdom, September, 1984.

Bikson, T. (1988) Panel Discussion: Evaluations of Coordinator. Proceedings of CSCW'88, Portland, Oregon, Sept. 1988.

Blomberg, J. , J. Giacomi, A. Mosher, & P. Swenton-Wall. (1993) Ethnographic Field Methods and Their Relation to Design. In D. Schuler & A. Namioka (Eds.) (1993) Participatory Design: Principles and Practices. Hillsdale, New Jersey: Lawrence Erlbaum Associates. pp123-155.

Bowers, J. & Churcher, J. (1988) Local and global structuring of computer mediated communication: Developing linguistic perspectives on CSCW in COSMOS. In Proceedings CSCW '88, Portland, Oregon, pp 125-139.

Bullen, C. & Bennett, J. (1990) Learning from user experience with groupware. In Proceedings, CSCW '90, October, Los Angeles, CA, pp291-302.

Bødker, S. & Grønbæk, K. (1991) Cooperative Prototyping: Users and designers in mutual activity. In S. Greenberg (ed.) Computer-supported Cooperative Work and Groupware. London: Academic Press, pp. 331-356.

Carasik, R. & Grantham, C. (1988) A case study of CSCW in a dispersed organisation. In Proc. ACM CHI '88, 61-66.

Ehn, P. (1988). Work-oriented design of computer artifacts. Falköping, Sweden: Arbetslivscentrum /Almqvist & Wiksell International.

Floyd, C. (1987) Outline of a paradigm change in software engineering. In Bjerknes, G., Ehn, P. & Kyng, M. (1987) Computers and Democracy: A Scandinavian Challenge. Aldershot, UK: Avebury.

Greenbaum, J. & Kyng, M. (Eds.) (1991) Design at Work.: Cooperative Design of Computer Systems. (Hillsdale,NJ: Lawrence Erlbaum Associates).

Greenberg, S. (Ed.) (1991) Computer-supported Cooperative Work and Groupware. London: Academic Press.

Grudin, Jonathan (1989) "Why groupware applications fail: problems in design and evaluation," Office: Technology and People, vol. 4, no. 3, 1989, pp. 245-264.

Grudin, J. (1991) Obstacles to user involvement in software product development, with implications for CSCW. Int. Journal of Man-Machine Studies, 34, 3, pp.435-452.

Henderson, A. & Kyng, M. (1991) There's no place like home: Continuing design in use. In J. Greenbaum & M. Kyng (Eds.) Design at Work: Cooperative Design of Computer Systems (New Jersey: Lawrence Erlbaum Assoc.), 219-240.

Henderson, A. (1991) A development perspective on interface, design and theory. In J. Carroll (ed.) Designing Interaction: Psychology at the human-computer interface. (Cambridge: Cambridge University Press). 254-268.

Johnson, B., Weaver, G., Olson, M., & Dunham, R. (1988) Using a computer-based tool to support collaboration: A field experiment. In Proceedings CSCW '86, Austin, Texas, pp.343-352.

Jordan, B. (this volume) Ethnographic Workplace Studies and CSCW.

Karat, J. Software Evaluation Methodologies. In M. Helander (ed.) Handbook of Human-Computer Interaction. Amsterdam: North- Holland, pp.891-903.

Kreifelts, T., Hinrichs, E., Klein, K-h., Seuffert, P & Woetzel, G. (1991) Experiences with the DOMINO Office Procedure System. In Bannon, L., Robinson, M. & Schmidt, K.(Eds.) Proceedings of the Second European Conference on CSCW - ECSCW'91 (Dordrecht: Kluwer).117-130.

Levinson, S. (1983). Speech Act. Ch. 5 in Pragmatics. Cambridge, UK: Cambridge University Press.

Landauer, T. (1991) Let's get real: A position paper on the role of cognitive psychology in the design of humanly useful and usable systems. Book Chapter in J.M. Carroll (Ed.) (1991) Designing Interaction: Psychology at the Human-Computer Interface, pp.60-73. (New York: Cambridge University Press)

Mackay, W. (1990) Users and Customizable Software: A Co-Adaptive Phenomenon. Doctoral dissertation, Sloan School of Management, MIT.

Mackay, W.E., Malone, T., Crowston, K., Rao, R., Rosenblitt, D. & Card, S. (1989) How do experienced Information Lens users use rules? ACM CHI '89 Proceedings, Austin, Texas, 211-216.

Malone, T. Grant, K., Turbak, F., Brobst, S., & Cohen, M. Intelligent information-sharing systems. Communications of the ACM, Vol.30, No. 5, May, 1987, pp. 390-402.

Nielsen, J. (1989) Usability engineering at a discount. In G. Salvendy & M. Smith (eds.) Designing and Using Human- Computer Interfaces and Knowledge-Based Systems. Amsterdam: North-Holland, pp. 394-401.

Robinson, M. & Bannon, L. (1991) Questioning Representations. In Bannon, L., Robinson, M. & Schmidt, K.(Eds.) Proceedings of the Second European Conference on CSCW - ECSCW'91 (Dordrecht: Kluwer). pp. 219-233.

Schäl, T. (this volume). System Design for Cooperative Work in the Language/Action Perspective: A Case Study of The Coordinator in an Italian Company

Scriven, M. (1967) The Methodology of Evaluation. In R. Tyler, R. Gagne, & M. Scriven. Perspectives of Curriculum Evaluation. Chicago: Rand McNally, 1967, pp.39-83.

Stefik, M., Foster, G., Bobrow, D., Kahn, K., Lanning, S. & Suchman, L. Beyond the Chalkboard: Computer Support for Collaboration and Problem Solving in Meetings. Communications of the ACM, 30, 1, 32-47, 1987.

Suchman, Lucy A.: "Office Procedures as Practical Action: Models of Work and System Design," ACM Transactions on Office Information Systems, vol. 1, no. 4, October 1983, pp. 320-328.

Tatar, D., Foster, G. & Bobrow, D. (1991) Design for Conversation: Lessons from Cognoter. In S. Greenberg (Ed.) Computer-supported Cooperative Work & Groupware. pp 55-79.

Thomas, J. & Kellogg, W. (1989) Minimizing ecological gaps in user interface design. IEEE Software, Jan. 1989, 78-86.

Whiteside, J. Bennett, J., & Holtzblatt, K. (1988). Usability Engineering: Our experience and evolution. In M. Helander, (Ed.), Handbook of Human-Computer Interaction (pp. 791-818). Amsterdam: North-Holland.

Whiteside, J. & Wixon, D. (1987) Improving human-computer interaction - a quest for cognitive science (Discussion). In Carroll, J.M. (Ed.) Interfacing Thought: Cognitive Aspects of Human-Computer Interaction. Cambridge: Bradford Press, 1987, pp. 337-352.

Winograd, T. (1986) A Language/Action Approach to the Design of Cooperative Work. In Proceedings CSCW'86, Austin, Texas. Reprinted in Greif (Ed.) (1988)

Back to Library Catalogue