“Simply, there is no way that any human factors work can maintain external validity with a single study, the participant numbers are just way too low – even when we have 50 or 100 users – meaning that the sample is just too heterogeneous.”
I was really pleased recently to receive this (partial) review from ‘Computers in Human Behavior‘. It really seems like this reviewer actually understands the practicalities of Human Factors work. Instead of clinging to old and tired statistical methods more suited to large epidemiology – or sociology – studies this reviewer simply understands:
The question is whether the data and conclusions already warrant reporting, as clearly this study in many respects is a preliminary one (in spite of the fact that the tool is relatively mature). Numbers of participants are small (8 only), numbers of tasks given is small (4 or 6 depending on how you count), the group is very heterogeneous in their computer literacy, and results are still very sketchy (no firm conclusions, lots of considerations mentioned that could be relevant). This suggests that one had better wait for a more thorough and more extensive study, involving larger numbers of people and test tasks, with a more homogeneous group. I would suggest not to do so. It is hard to get access to test subjects, let alone to a homogeneous group of them. But more importantly, I am convinced that the present report holds numerous valuable lessons for those involved in assisting technologies, particularly those for the elderly. Even though few clear-cut, directly implementable conclusions have been drawn, the article contains a wealth of considerations that are useful to take into account. Doing so would not only result in better assistive technology designs, but also in more sophisticated follow-up experiments in the research community.
Thanks mystery reviewer!
But this review opens up a wider discussion. Simply, there is no way that any human factors work – ANY HUMAN FACTORS WORK – can maintain external validity with a single study, the participant numbers are just way too low and heterogeneous – even when we have 50 or 100 users. To suggest any different is both wrong and disingenuous; indeed, even for quota based samples – just answering a simple question – we need in the order of 1000 participants. I would rather, 2 laboratory based studies using different methods, from different research groups, using 10 participants, both come up with the same conclusions than a single study of 100 people. In HCI/Human Factors we just end up working with too few people for concrete generalisations to be made – do you think a sample of 100 people is representative of 60 million? I’m thinking not. And, what’s more, the type of study will also fox you… perform a lab experiment which is tightly controlled for confounding factors and you only get internal validity, do a more naturalistic study which is ‘ecologically’ valid, and you have the possibility of so many confounding variables that you cannot get generalisability.
‘But surely’, I hear you cry ‘Power Analysis will save us!’ (statistical power). ‘We can use power analysis to work out the number of participants, then work to this number, giving us the external validity we need!’ – Oh if it was only so easy [1]. In reality, ‘statistical power is the probability of detecting a change given that a change has truly occurred. For a reasonable test of a hypothesis, power should be >0.8 for a test. A value of 0.9 for power translates into a 10% chance that we will miss conclude that a change has occurred when indeed it has not’. But power analysis assumes an alpha of 0.05 (normally), the larger the sample the more accurate, and the bigger the effect size, the easier it is to find. So again large samples looking for a very easily visible effect (large effect), and without co-variates, gives better results. But, these results are all about accepting or rejecting the null hypothesis – which always states that the sample is no different from the general population, or that their is no difference between the results captured from ‘n’ samples; presuming the the base case is a good proxy of the population (randomly selected) – which it may not be [2].
So is there any point in empirical work? Yes, internal validity suggests an application into the general population. What’s more, one internally valid study is a piece of the puzzle, bolstered with ‘n’ more small but internally valid studies allows us to make more concrete generalisations.
References
- Schulz, K., & Grimes, D. (2005). Sample size calculations in randomised trials: mandatory and mystical The Lancet, 365 (9467), 1348-1353 DOI: 10.1016/S0140-6736(05)61034-3
- Jocob Cohen (1977). Statistical power analysis for the behavioural sciences Academic Press, New York, USA Other: http://books.google.co.uk/books?hl=en&lr=&id=Tl0N2lRAO9oC&oi=fnd&pg=PR11&dq=Statistical+power+analysis+for+the+behavioural+sciences&ots=dpZFQhgXYw&sig=rep2pgNKtp_2aKozGHZNtIumngg#v=onepage&q&f=false