I recently came across a paper discussing the evaluation of user interface systems. In it the author proposes that complex user interface systems and architectures do not readily yield to the research methods we currently use. It was at this point I started to bristle with derision in a very defensive
“I’m a research scientist and the scientific method says that we must have objective measures to express an accurate understanding!”
… kind of a way. Interestingly, this work is actually a place holder for a panel session conducted at UIST 2007 and so is therefore more discursive, with an intention to some extent to produce a reaction, such as mine, from CS facing Human Factors researchers. But, the thread is picked up again  in a peer reviewed submission to CHI 2008.
Briefly, Olsen suggests that complicated interactive systems are not amenable to standard evaluation methods and describes this in terms of: the usability trap / systems give inconsistent results because they require the user to “walk up and use” (as opposed to learn by experience), they use standardised tasks, and require a task completion time of between one and two hours; the fatal flaw fallacy / which suggest that evaluating fatal flaws in systems research is not possible because it would take a very large amount of time for a small team to check each possible flaw; and legacy code / which suggests that the requirement to support legacy code is often a barrier to progress. Further, Olsen goes on to suggest that instead of direct evaluation we should couch our systems development in terms of key factors such as: importance, if the problem has not previously been solved, if the solution is generalisable, if the tool assists in solution development, empowers new design participants, or if it is found that there is some enhanced combinatorial effect.
Olsen makes some quite interesting points here even though in some cases the argumentation switches between developers and end users, and the aspects which Olsen considers key such as importance etc are subjective; what he considers important I may not. It seems to me that Olsen’s arguments suffer from a mismatch in granularity and a conflation of both the engineering and scientific domains (which is often common in computer science). At the high level Olsen is, in my opinion, absolutely correct – visionary interfaces and interactive systems cannot be tested without them being initially created, indeed the Web is a prime example of this. The first papers on the World Wide Web were rejected from the Hypertext Conference, even though it was a working system, because they did not show a clear improvement over other closed world systems such as Microcosm. However, its importance and validity were supported once the system was deployed. In reality this was a visionary idea made real by engineering but in my opinion it was not scientific research because Berners-Lee’s work was not focused on addressing our understanding of why or how the Web was ‘better’; these being two of the primary concerns of Science.
This set me to thinking as to whether new and innovative interfaces such as “sugar“, an interface I really like – and the interface designed to be used with the One Laptop per Child project, would fare at a research conference or under scientific peer review. In reality, this comes down to the arguments made with regard to the system. In the case of sugar if the design is influenced by existing work in the field then this obviously needs to be discussed within any research paper created, it seems to me that this would be enough to justify its incorporation into any research proceedings focused on interactive systems science and technology. However, without this well found viewpoint the justification for discussing a system within a research setting seems to be lacking. Next, claims for the system need to be discussed, with regard to sugar, the creator suggests that it is designed to assist learning and collaboration as well as communication between children. If this claim is made then there needs to be some way for it to be supported regardless of whether the system is actually based on previous third-party work. This is the point at which an evaluation must be made with specific reference to the granularity of the claims of the system’s creators.
It seems to me that Olsen’s panel paper is an interesting first step towards understanding how we can assess the contribution of large-scale interactive systems when our evaluation tools are necessarily based on often small-scale or incremental additions to knowledge. Some of the rationale proposed and the use cases discussed need further elaboration and input from third parties, interested in this research domain, would be only beneficial. However, if the work is based on well found empirical third-party research then its expression as a system can only be applauded. If the system is based on tacit knowledge or the authors “belief” then the system has little to differentiate it from any other subjective opinion and so its merit as a research artefact could not be well supported. In this case, there is a danger of stifling innovation , as was highlighted in the opening keynote of this years HT Conference  and resulted in some rapidly created panel sessions and BoFs. However, in the context of science and engineering and the conflation of the two in computer science, I’m not really sure how we accommodate both unless we have tracks specifically for each at the same venue but with different review criteria?
I’ve been ‘banging around’ this post for the last couple of months on and off and just today (28th June 2010) found this nice post on ‘Woo Fighters‘ via Research Blogging Psychology and based on . In this case both papers [1, 2] discussed above encourage some tenancies highlighted as possibly unscientific, in the light of . However, I see this as the obvious divide we suffer in Human Factors work between creation (engineering) and discovery (science).
- Olsen,Jr., Dan R. (2007). Evaluating user interface systems research UIST ’07: Proceedings of the 20th annual ACM symposium on User interface software and technology, 1 (1), 251-258 : 10.1145/1294211.1294256
- Greenberg, Saul and Buxton, Bill (2008). Usability evaluation considered harmful (some of the time) CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, 1 (1), 111-120 : http://dx.doi.org/10.1145/1357054.1357074
- Dillon, Andrew (2010). As we may have thought, and may (still) think HT ’10: Proceedings of the 21st ACM conference on Hypertext and hypermedia, 1 (1), 1-2 : http://dx.doi.org/10.1145/1810617.1810619
- Scott O. Lilienfeld, Steven Jay Lynn, Jeffrey M. Lohr, Carol Tavris (2003). Science and pseudoscience in clinical psychology: Initial thoughts, reflections and considerations Science and pseudoscience in clinical psychology Other: 1-57230-828-1