Web Accessibility Evaluation Tools Only Produce a 60-70% Correctness

Completeness...

Completeness…

As you may already know, last week a critique appeared of a paper authored by three friends of mine, Markel Vigo, Justin Brown, and Vivian Conway. Indeed, Markel and I are colleagues in the same laboratory. This critique was written by Karl Groves regarding a paper entitled “Benchmarking web accessibility evaluation tools: measuring the harm of sole reliance on automated tests” and was published at the W4A2013 conference.

I read this paper at the time and I just re-read it over the weekend to re-familiarise myself with the contents. I’ve not read Karl’s critique before making comment on the paper as my intention is not to respond to the critique but restate what the paper says and why that’s important. Indeed, the title pretty much says it all, however, the abstract and the rest of the paper supports the statements expressed.

So what does the papers say?

  1. Web accessibility evaluation can be seen as a burden which can be alleviated by automated tools.
  2. In this case the use of automated web accessibility evaluation tools is becoming increasingly widespread.
  3. Many, who are not professional accessibility evaluators, are increasingly misunderstanding, or choosing to ignore, the advice of guidelines by missing out expert evaluation and/or user studies.
  4. This is because they mistakenly believe that web accessibility evaluation tools can automatically check conformance against all success criteria.
  5. This study shows that some of the most common tools might only achieve between 60 and 70% correctness in the results they return, and therefore makes the case that evaluation tools on their own are not enough.

The paper therefore hinges on the accuracy of evaluation tools, and so the experiments use four experts, one of which is a blind assistive technology user, to assess a set of pages as a ground truth. This ground truth serving as the baseline for comparison against the automated success criteria evaluation. A number of tools are compared, however in all cases they are found to under-perform.

The data for this study is open and available. In this case, as per the scientific method I am able to check empirically any assumptions made by the authors:

  • If I do not like the webpages selected I may pick different ones,
  • if I do not feel that the evaluators are truly experts, I may evaluate the webpages myself,
  • if I do not agree with the choices of evaluation tool, then I could select a new set.

None of these changes would affect your ability to refute the contribution of this paper which is that web accessibility evaluation tools in and of themselves will only produce a 60 to 70% correctness. All other data as supplied with this paper (and often not supplied at all by authors preferring anecdotal evidence) is only there to remove the need for interested parties to re-collect data over all stages of the experiment.

  • In the most expansive interpretation, the primary concern is the percentage correctness of the tools with regard to meeting success criteria when evaluating webpages. Indeed, we do not need to believe that the paper is generalisable because the method enables us to test it ourselves regardless of the pages selected, the evaluators performing the study, or the tools chosen.
  • In the most limited interpretation, the percentage correctness of the tools selected,  on the pages selected, on the date and time of evaluation was 60-70% when compared to the ratings of the evaluators – this constrained definition is all that the data is supplied, or required, to validate.

We might of course, in the future, evolve more intelligent tooling which may refute this work, THIS IS SCIENCE!

I suggest anyone wishing to disagree with this work perform the same experiments, once that data is collected then we may have a basis for discussion and critique, as opposed to anecdotal thought experiments.

Advertisement

3 thoughts on “Web Accessibility Evaluation Tools Only Produce a 60-70% Correctness

  1. Simon,

    Good post, and thank you for the response. It is unfortunate, however, that you didn’t read or respond to what I wrote. It is also unfortunate that the paper’s authors have similarly chosen to not respond directly to my statements. The blanket response “well, just replicate it” is an attempt at dodging my response and my criticisms of the paper (which again, you admittedly haven’t read). Furthermore, there’s little use in attempting to perform the same experiments when the conclusions presented have fully nothing to do with the data.

    You said:
    “Web accessibility evaluation can be seen as a burden which can be alleviated by automated tools.”
    Actually, they don’t say that.

    “In this case the use of automated web accessibility evaluation tools is becoming increasingly widespread.”
    No data is supplied for this at all.

    “Many, who are not professional accessibility evaluators, are increasingly misunderstanding, or choosing to ignore, the advice of guidelines by missing out expert evaluation and/or user studies.”
    No data is supplied for this at all.

    “This is because they mistakenly believe that web accessibility evaluation tools can automatically check conformance against all success criteria.”
    No data is supplied for this at all.

    “This study shows that some of the most common tools might only achieve between 60 and 70% correctness in the results they return, and therefore makes the case that evaluation tools on their own are not enough.”

    Of all the things you said, this is the only thing actually backed by the data from the paper. Literally everything else is a case of affirming the consequent.

    The data that they do present is very compelling and matches my own experience. The significant amount of variation between the tools tested was pretty shocking as well, and once you get past the unproven, hyperbolic claims, it is very interesting.

    If this paper’s authors were to gather and present actual data regarding usage patterns (re: the claim that “the use of automated web accessibility evaluation tools is becoming increasingly widespread”) then I wouldn’t be so critical. There is no question that the data needed to substantiate this and similar statements simply isn’t supplied.

    Finally, I’d like to address the statement “evaluation tools on their own are not enough”. As I say in my blog post, this is so obvious that it is hardly worth mentioning. No legitimate tool vendor says this. I’ve been working as an accessibility consultant for a decade. I’ve worked for/ along/ or in competition with all of the major tool vendors and have never heard any of them say that using their tool alone is enough. Whether end users think this or not is another matter. Again, it’d be great if the paper’s authors had data to show this happening, since they claim that it is.

    The implication from this paper is that because tools do not provide complete coverage, they should not be used. This is preposterous and, I believe, born from a lack of experience outside of accessibility and a lack of experience in a modern software development environment. Automated testing, ranging from things like basic static code linting, to unit testing, to automated penetration testing is the norm and for good reason: it helps increase quality. But ask any number of skilled developers whether “passing” a check by JSHint means their JavaScript is good and you’ll get a universal “No”. That doesn’t stop contrib-jshint from being the most downloaded Grunt plugin (http://gruntjs.com/plugins). Ask any security specialist whether using IBM’s Rational Security is enough to ensure a site is secure, and they’ll say “No”. That doesn’t diminish its usefulness as a tool in a mature security management program.

    Perhaps what we need most in terms of avoiding an “over-reliance” on tools is for people to stop treating them like they’re all-or-nothing.

  2. Pingback: Affirming the Consequent – Karl Groves

  3. I think the actual paper says “Most tools exhibit high levels of correctness that range between 93-96% except for TAW and TotalValidator, which yield 71% and 66% respectively” (Section 4.3, just after Table 6).

    In the paper “Brajnik, G., Yesilada, Y., and Harper, S. 2012. Is accessibility conformance an elusive property? A study of validity and reliability of WCAG 2.0” my reading is mean correctness for human expert evaluators is 75% (Section 4.3.2 “For experienced evaluators, C ranges from 0.48 to 1.00 with a mean of 0.75”)

    My take is that tools, human evaluation and user testing are all complementary – and none should be used in isolation:

    • most of the tools surveyed can evaluate a large set of pages quickly and produce results with few false positives – but they aren’t complete (tools give breadth).
    • human experts have better completeness, but the number of pages that can be evaluated is limited because humans are much slower at evaluating pages and in some cases are much less correct (human expert evaluation gives depth).
    • user testing should have perfect correctness (any problem found by a user is a problem, though it might not map to a WCAG success criteria), but user testing a site with critical barriers is a waste of time at best and dangerous at worst (e.g. user testing flashing content which might trigger photosensitive epilepsy)

    Best Regards
    Mark

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s