I discussed the first RDWG Symposium on Web Accessibility Metrics last week; here, as promised, is the full and un-edited transcript!
So without further adieu, I’d like to give the floor to the symposium chairs, Giorgio and Markel, maybe just give a few introductory words, just a minute or two, and then we’ll get started with moderating the individual sessions. And again, instructions will be provided as we go step by step on how this Online Symposium — how the logistics will work out for that.
So Giorgio and Markel, please go ahead if you have any introductory welcome words.
>> GIORGIO BRAJNIK: Thank you, Shadi. I am Giorgio Brajnik. Together with the other editors of this meeting, Markel Vigo from Manchester University and Joshue O Connor have organized this Online Symposium on Web Accessibility Metrics.
Our work has been substantially supported by other members of the Research and Development Working Group of the W3C, a group that is led by Simon Harper and nicely nurtured by Shadi Abou-Zahra. They did a very good job setting up the logistics.
This is the first meeting on research and development aspects of — related to the Web in general, and so we hope that it is the initial event of a long stream of events that should become more and more successful and more and more useful.
We have chosen the first one dealing with Web accessibility metrics because we think that it is a topic that’s becoming increasingly important from an application viewpoint, but is also a topic that is written with many different traps and pitfalls. We would like to start with this symposium to draw a roadmap of different aspects, different questions that should and could be tackled by people that are practitioners, by people that are in the research areas, by people that are in the industry, so to shed some light on this topic.
And we started with these 11 papers, 11 contributions that we’ll be listening today, because we think that they help us highlighting many of the hot topics in Web accessibility metrics.
So I’d like to give the floor now to Markel, if Markel — Markel Vigo, then Joshue O Connor, then we’ll start with the presentations.
>> MARKEL VIGO: It’s Markel Vigo from the University of Manchester. Very happy to be here today.
Just a few things to add to what Giorgio said. I would say that we started working on this group in August. We’ve been trying to start a list of procedures for these meetings, and it’s a way to gather practitioners, researchers, and all the stakeholders in a common topic that we all find interesting and timely.
We are happy because we thought it was a narrow research area, and we got more submissions than we expected, so that means that people are interested. This is also supported by the number of people that joined the call. We ran out of telephone lines in less than 24 hours. So we are very happy about that.
I mean, I know we are running out of time, so I will give the floor to Josh O Connor from Ireland if he wants to add something else.
>> Josh, you might be muted locally. We cannot hear you, Josh.
Josh, one more try.
Okay. Sorry about that.
>> Shall we start with the presentations, then? The first speaker comes from Buenos Aires and I think is Maia Naftali, and she will tell us about the integration of Web accessibility metrics into semi-automatic evaluation process dealing with noisy data, errors of tools, and so on. Maia, are you there?
>> SHADI ABOU-ZAHRA: So just to explain the logistics a little bit, Maia, your line is still muted. So unfortunately, we need to be very tight with time, especially now due to the delayed start. So every speaker has pretty much exactly five minutes. You will hear a beep when the five minutes run out. Please make sure to end before five minutes. We do expect that all participants have read all the extended abstracts. The slides are online on the agenda page with the extended abstracts at the end of that. You need to open the slides locally. Some slides were just uploaded this morning or just shortly before this session, and some presenters have chosen not to use slides. So for each presenter, you need to open your slides locally.
For the presenters, please make sure to announce the slide number that you’re on and when you’re switching slides so that people can follow you. Please make sure to speak slowly and clearly so that the captioners can make sure everything is recorded well.
So without further adieu, Maia, you will be the first to go. Your line is now unmuted, and please go ahead, give us your presentation of five minutes. Thank you.
>> MAIA NAFTALI: Hi, everyone. Can you hear me?
>> SHADI ABOU-ZAHRA: Yes, you’re audible.
>> MAIA NAFTALI: Yes, I am starting. I will talk about experiments I have done about integrating metrics (audio breaking up.)
I guess you all scanned my words, but if you want — (Inaudible) — slides.
The first what for? Software engineering and assurance standpoint, metrics (Inaudible). It’s important to have a measure. I believe it’s important to calculate (audio breaking up)
I will talk about the difficulties found (Inaudible) — but not enough for large-scale evaluation. And then extra parameters. (Inaudible) — apart from the research.
Use of criteria and — (Inaudible) — AAA — (Inaudible) — some ideas to make this approach. First, the categorization. Here I have to develop a bit more. We need the scope of metrics. We can include them in automatic (Inaudible).
I have derived a basic metrics, only used — (Inaudible) — automatically. (Inaudible)
Then the semantic — (audio breaking up) — pragmatic measurements and tests.
Limit the scope of the metrics input will facilitate their inclusion, tools and evaluations, and therefore, any evaluation tools could calculate metrics at a known level of accuracy.
I think this is so. (Inaudible)
>> SHADI ABOU-ZAHRA: Maia? Hello, Maia?
>> MAIA NAFTALI: Yes, I’m here.
>> SHADI ABOU-ZAHRA: Yeah, your line has been very difficult to hear, also the captioner was not able to hear most of what you said. Did you finish? At the end now it was silent suddenly.
>> MAIA NAFTALI: Do you want me to repeat — (Inaudible)
>> SHADI ABOU-ZAHRA: No. So could you remind us on which slide you are on right now?
>> MAIA NAFTALI: (Inaudible).
>> SHADI ABOU-ZAHRA: The last slide. Could you maybe get closer to your microphone, speak a little bit louder, and just give us the final statement. Thank you.
>> MAIA NAFTALI: Sorry. I apologize for the inconveniences. To conclude, I believe that if we limit the scope of the checkpoint steps (Inaudible) assess, we could facilitate how to incorporate them. Therefore, any evaluation to calculate the metrics with known level of (Inaudible).
I’m sorry if you haven’t heard me.
>> SHADI ABOU-ZAHRA: Thank you very much, and apologies, everyone, for this. I guess this is the downside of having an Online Symposium is that phone lines are sometimes not at their best. On the other hand, we do save quite a bit of travel costs, I think.
So the next presenter is for paper to measuring accessibility barriers on large-scale sets of pages. Going to be presented by Sylvia Mirri. So Sylvia, your line should now be unmuted. Could you try to test?
Hello, Sylvia, can you hear us?
Sylvia, there is a bit of crackling on your line, but we’re not able to hear you.
Okay. I suggest we maybe continue with the next speaker, and then come back to Sylvia a little bit later. Sylvia, maybe try to redial and just make sure that your microphone is connected.
Let me just try once more to unmute. Okay. Sylvia, can you try once more?
>> SYLVIA MIRRI: Can you hear me? Can you hear me?
>> SHADI ABOU-ZAHRA: Yes, now we can hear you, so apologies for this. Please go ahead. You have five minutes. Thank you.
>> SYLVIA MIRRI: Okay. You’re welcome. Can you hear me now?
>> SHADI ABOU-ZAHRA: Yes, please go ahead.
>> SYLVIA MIRRI: Okay. Perfect. Okay. I’m sorry for that. Okay. Well, I’m Sylvia Mirri, an assistant professor at the University of Bologna in Italy. In our department with my research group, we are working on the development, the design and the development of an Accessibility Monitoring Application, and our idea is to provide simply an automatic way to monitor accessibility level of large set of websites.
So we are working on slide 1, and we are working on different set of for accessibility tools, and we would like to provide tools to periodically and automatically evaluate and to monitor accessibility level of websites.
In order to summarize results and try to provide a means to quantify the level of accessibility barriers, we have defined a metric we called BIF metric. I am on slide 2. This BIF acronym stands for barrier impact factor. The idea to summarize results collected by the automatic monitoring application rises up from barriers work through.
So our idea was to provide an application, full automatic application, and we tried to divide — to divide all the accessibility errors in terms of how they can affect users by means of assistive technologies.
So we have computed the BIF metrics. You can take a look at the formula on slide 3. The BIF metric is computed summing up error of — the number error of E — i, sorry — multiplied by the weight of i. So our idea was to provide the weight to each error according to the importance of the weight of that error, and we also have grouped users according to the assistive technologies they use according to their disabilities.
So I’m on slide 4. Our monitoring system shows evaluation results based on the BIF metrics, but also simple results based on different set of guidelines. And at this first level of this prototype, we have provided weights just to certain errors, not to error which can be detected only by manual evaluation. This is one of the measure difficulties, maybe. Try to provide a weight to each error and to provide also weight to manual evaluation error.
So this is just the first issue of these metrics. The AMA system, it’s at a farther level. This is not just a prototype right now. You can find a prototype version on (Indiscernible) because it is an open-source system, an open-source application. So if you want, you can take a look.
Our main topic, our main goal, was to provide a way to measure the accessibility level and the impact of accessibility errors based on a tool according to user-assistive technologies. That’s all.
>> SHADI ABOU-ZAHRA: Thank you very much. Thank you, everybody, for keeping the time so far. We are still behind schedule due to the late start.
I just want to remind everybody who has not entered their personal ID code to actually do so. We need to identify your line. You need to type in the number 43 and then the 5-digit code provided to you in an email. This is very important so we can identify your line so we can unmute you when you need to speak up. Otherwise, this will not be possible.
So our next presentation is from Nadia Fernandes on a template-aware Web accessibility metric, and just a sec. Let me unmute you, Nadia.
Just one second. Okay.
>> NADIA FERNANDES: Hi.
>> SHADI ABOU-ZAHRA: Yes, welcome, Nadia. Your line is unmuted. You have five minutes. Please go ahead.
>> NADIA FERNANDES: Hi. (Inaudible) (loud background noise)
From the developer’s perspective —
>> SHADI ABOU-ZAHRA: Nadia, sorry to interrupt you. Nadia, sorry to interrupt you. Could you speak a little bit more slowly and maybe — and maybe speak up a little bit. Get closer to your microphone.
>> NADIA FERNANDES: Okay. From the developer’s perspective, there’s accessibility evaluation tools provide (Inaudible) results. The same error is constantly repeating and confuses developers.
(Audio breaking up)
— same issue, therefore, common accessibility methods may be misleading. So our research question. What is the impact of (Inaudible) — how can we (Indiscernible) — accessibility problems (Inaudible) developers need. Finally, how can we devise a metrics that answers the (Inaudible) for accessibility guidelines from the point of view of the effort to (Inaudible)
So our strategy. To set in place to fulfill the use of (Inaudible) needed five common elements amongst the (Inaudible) we used an automatic (Inaudible) developed by (Inaudible). Modified to consider the algorithms.
We performed a study comparing — (audio cutting out)
Indication of the percentage of accessibility issues (Inaudible) so that we can verify (Inaudible) play an important role in accessibility.
Be consequently, problems (Inaudible) —
So our metric considered the specific (Inaudible) — problems of the entire (Inaudible) it has considered the (Inaudible)
This way developer can know if the effort to correct disability problems of the webpage is working. (Inaudible) — quantitative assessment (Inaudible) metrics.
(Audio breaking up)
Considering all the facts.
The average (Inaudible) for accessibility outcomes is approximately 35% (Inaudible) — will be addressed by (Inaudible)
We could see a decrease of one-third (Inaudible) corrected on average.
It would be even high per you consider more than the homepage. (Inaudible) — slowly, clearly, developers (Inaudible)
So I’m finished.
>> SHADI ABOU-ZAHRA: Thank you very much, and again, your line was a little bit crackly, it was a bit loud, but thank you very much for your contribution. I apologize to everyone. The slides were wrong. Actually, we do not have slides from Nadia Fernandes’ contribution. The slides that were posted were from paper 5. We will be using them later on. So let’s move ahead to the next presentation called a metrics to make different DTDs documents evaluations comparable, and the presentation will be given by Sylvia Mirri, so Sylvia, I’ll unmute you again. Please make sure to speak slowly and clearly and to keep the time. Thank you very much.
>> SYLVIA MIRRI: Hello, everybody. Now I am going to show you some issues about another metrics we have defined, trying to address a problem related to the Italian regulation about Web accessibility.
in Italy, we have a law that requires that websites — public administration websites has to be compliant to strict DTD. So we have defined this metrics trying to answer to the question: Is it possible to compare a webpage compliant tool strict DTD or maybe to a transitional DTD to another one? Just to provide a way to compute the accessibility level according to the Italian regulation.
So I’m on slide 1. We have defined these metrics to try to compare these different webpages.
On slide 2, we are issuing — sorry — we issued experimental approach. We tried to define these metrics to quantify the expected errors on a target DTD from a different one. And tried to evaluate three different properties, which are the validity, the strictness, and the markup quality.
And then we have tried to generalize this approach just to compare documents with different DTDs.
I’m on slide 3 now. We have computed different formula just to evaluate the expected errors in a page, and a page without errors, just to try to quantify errors and manual evaluation related to DTD noncompliance.
We have — I’m on slide 4 now. We have assessed some experimental evaluations according to a sample of 1,000 websites, and we have found different problems in Italian public administration websites, and we have — we have found measure difficult in defining these errors, estimated errors. But we have worked on a large amount of websites.
Actually, the experiment we have conducted is based on the — the monitor prototype, and so we have found a way just to compare different grammar documents and try to evaluate the accessibility according to the Italian regulation.
So that’s all.
>> SHADI ABOU-ZAHRA: Thank you very much.
>> SYLVIA MIRRI: You’re welcome.
>> SHADI ABOU-ZAHRA: Okay. We go to the next slide, for which you have already seen — some of you may have already seen the slides. If you actually refresh the agenda page, if you hit refresh in your browser, you should actually have the correct link in the correct paper now, so we are at paper 5, lexical quality as a measure for textual Web accessibility.
The speaker is Luz Rello. I will unmute your line now. Hello. Can you hear us?
>> LUZ RELLO: Yes, I can hear you.
>> SHADI ABOU-ZAHRA: Okay. Please speak loudly but slowly, and announce the slide that you’re on. You have five minutes. Go ahead, please.
>> LUZ RELLO: Thank you. Can you hear me? Yes? Okay. I am on slide now 2. Our objective is the measurement of the lexical qualities in the Web. Lexical quality, as you all know, is not a Web accessibility metric itself, but we thought that it has an impact or that it can be included in Web accessibility metrics because lexical quality has an impact in understanding the text. So lexical quality the representational aspect of the text all content, the quality degree of words in a text, spelling errors or typos.
Since Web content has to be perceivable or understandable, this could be useful or of interest.
So I am now on slide 3. The strategy we approach to measure lexical quality, first, defining a metric for lexical quality and designing a sample, and with this sample, we could measure samples of Web and measure what kind of errors different webpages — I mean which kind of lexical quality each kind of webpages would have.
So in the different — in the sample — excuse me — in our sample errors, the kind of errors that can be found in the Web. Regular spelling errors,time graphical errors, errors made by non-native speakers, dyslexic errors, and OCR errors.
Then we use — we use this sample to sample the Web. So first we measured a lower bound of the fraction of the webpages with lexical errors and compared this with each kind of error in our sample.
The cores responding fraction of webpages with the lexical errors, and using this in, we estimate lexical qualities by the major search engines. And we compare both data we have actually with the index of another search engine.
Now I’m in slide 4. So we calculate lexical quality for the measure of web domains, also for the domains of English-speaking countries, and also in media domains.
So we saw that our results were as expected. There were several that had lexical quality, universities (Inaudible) while most of the social media websites had poor lexical quality due to the (Inaudible) — by users.
We also wanted to find out whether our measure was — could provide independent information about the website itself. So that’s why we calculate a correlation using the top 20 sites in English from Alexa.com, and so this population was taken into account the unique — I mean, the number of unique visitors, the number of webpages, number of links, and ComScore unique visitors. And in the table, we see that this is correlated. That’s it.
>> SHADI ABOU-ZAHRA: Okay. Sorry, I thought you were finished.
>> LUZ RELLO: Yes.
>> SHADI ABOU-ZAHRA: Go ahead.
>> LUZ RELLO: Yes, so right now, what we are doing user studies to find out whether lexical quality has — see how it impacts accessibility and we are running (Inaudible) with users. Yes, I am finished. Thank you.
>> SHADI ABOU-ZAHRA: Okay. Thank you very much. Apologies for interrupting you.
>> LUZ RELLO: No problem.
>> SHADI ABOU-ZAHRA: Okay. So our next speaker is Markel Vigo, who is going to be talking about attaining metric validity and reliability with the Web Accessibility Quantitative Metric. Markel, please go ahead. You have five minutes. Please speak slowly and clearly and keep to the time. Thank you. Go ahead.
>> MARKEL VIGO: Okay. Thank you, Shadi. I am on slide number. I want to thank to my ex-colleagues from the University of The bask country because they were my colleagues at the time we developed this research.
So I am moving fast to slide number 2. As mentioned, this is a work that dates from — we started with this in year 2004. But we were not — like we say, we were not the first people working on this. I mean, accessibility metrics have been around since early 2000, and there have been several approaches that tried to tackle the problem. There is the preliminary paper by Sullivan. There’s another paper by others. You can find references on them in the paper.
Of course, there were preliminary approaches to measuring accessibility, but we noticed that there was something mentioned there. I mean, that there was still room for improvement. For instance, we noticed that metrics at that time were not normalized; that is, they were unbounded. You cannot make comparisons among them. Even if WCAG 2.0 was still drafted, people were already adopted the terminology of principles, guidelines, and criteria. At that time, 2004, five, and six, no metrics were given for WCAG principles.
Most of the metrics were considered automatic checkpoint evaluation, so semi-automatic checkpoints were not taken into account.
Moving to the third slide now. So we proposed a WAQM, which is a short name for Web Accessibility Quantitative Metric, in order to overcome these problems. And we draw from the strong points of the state-of-the-art metrics, like applying failure rate, considering the weight of — or severities of accessibility violations. But we also tried to improve trying to bounding the metrics, giving normalizing scores. That is, we were able to say how accessible a page was from 0 to 100. We were able to give some scores from WCAG 2.0 principles. We were able to estimate the weight of each violation in terms of WCAG 1 priorities, and we also were able to estimate the failure rate of semi-automatic issues.
So you can see in page 3 the algorithm. I am not going into details. I mean, if you want to have a look, just go to the extended abstract. And I am moving to slide number 4, the last slide with content, and just like to talk about the validity and reliability issues of WAQM metric.
How we tackle the validity issues. So when we talk about validity, we want to check whether the metric WAQM, how close it was from the actual accessibility, so we took a panel of experts that evaluated the accessibility of 14 pages, we took 14 pages, evaluated the tool, applied WAQM metrics, and we know there was a moderate correlation, so that was our approach for validity.
As far as reliability is concerned, we tackle this from a double viewpoint. We tackled tool reliability to see how applicable were results when different tools were used. So we evaluated almost 1500 pages with two evaluation tools, which were EvalAccess, measured with WAQM, and — sorry. Shadi?
>> SHADI ABOU-ZAHRA: Go ahead.
>> MARKEL VIGO: I am finishing. I have 30 seconds to go. We found there is a high correlation between the values provided by different tools, and we also wanted to measure the reliability between guidelines. So we used WCAG 1 and WCAG 2 guidelines with EvalAccess tool, and we also found — we didn’t found good results. We didn’t find good results because the nature of guidelines is different, and this determined very strongly the results given by metrics. So this is it for now. I am finished. Thank you.
>> SHADI ABOU-ZAHRA: Okay. Thank you, Markel. Sorry we are a bit under time pressure, but thank you all for keeping your timelines so closely.
The next paper is paper number 7, the case for WCAG-based evaluation scheme with a graded rating scale from Detlev Fischer. Detlev, I’ll unmute you.
>> DETLEV FISCHER: My co-author is Tiffany Wyatt, who can’t be here today. I am on slide 1. Can you hear me?
>> SHADI ABOU-ZAHRA: Yes, please go ahead.
>> DETLEV FISCHER: Okay. Well, we took a starting point from the observation that real-life websites usually show less-than-perfect accessibility, even those that want to be accessibility often have minor problems and problems on a range from not so serious to very serious.
On the other hand, WCAG techniques and failures have a binary test, so it’s either pass or fail, and that makes it quite difficult to deal with minor flaws. You either have to neglect them, saying it’s not so important, or you are quite strict and fail a page that may be quite good overall.
That’s why the German BITV-test uses a five-point grading scale to address this problem. BITV is basically a German implementation of WCAG.
I am on slide 2 now. The major difficulties with a normal rating — with a rating approach. When you rate individual instances on a page, results can be somewhere between pass and fail. For example, in alternative text for images, you know you may have a text which is quite good or not so good, and that’s still better than no text at all, for example.
Another problem is that some ratings apply not to instances or success criteria, which can be brought down to instances like images, but for patterns, for example, the overall structure of headlines. So it’s difficult to say what level of deficiency of, say, hierarchical headline structure, you will have a failure. It is a kind of a gray — it’s a scale rather than a pass or fail situation.
Another thing that needs to be reflected is that some instances of failure can be critical, while others are quite minor. And often, looking at the page, you have many instances that pass while others fail. So it’s a difficult situation. Should the page fail success criteria or not?
I am on slide 3 now. The graded rating approach that the BITV-Test has developed has been used for a number of years now. The new test, which is the WCAG 2 — kind of similar to WCAG 2 — has 50 checkpoints that map on WCAG level AA. And the checkpoints have a weight of 1, 2, or 3 points, a bit similar as in the other presentations we had. It’s not always mapping onto A, AA, and AAA. It depends on the criticality that those issues often have.
Full pass would be 100%, so you get the full weight attributed to the score. And then you go down to 75, 50, 25, or 0. In the case of a complete fail it would be 0%. And those ratings reflect both frequency and criticality of flaws.
The results per page are then aggregated to a site score, so you arrive at a result of, say, 80 points of 100 or 90 points of 100. At 90 points, the site would be considered to have good accessibility.
I’m moving to the last slide now, 4. One question we had is the reliability of graded ratings. You could express it as the degree of replicability, so if you have one test and another test tests the same sample of pages, you should have the same results. That’s why we have our conformance test are always run with two testers testing independently, followed by an arbitration phase. So in this arbitration phase, both testers compare results, and either they have overseen something so you can correct oversights, or you can also rectify ratings which have been too lenient or too strict, arriving at a kind of arbitrated consensus rating for the site.
Our experience has shown that the five-point graded rating scale is quite reliable. We tried to back that up by having introduced a statistics function that has now been added, so we can quantify interevaluator reliability over time and look forward to reporting on reliability maybe in a year’s time.
That’s it. Thank you very much.
>> SHADI ABOU-ZAHRA: Thank you very much, Detlev. That was exactly five minutes just before the beep. Thank you, everybody, for keeping your time so closely. We just got confirmation that we can extend a little bit, minute ten minutes or something, I’m not sure if people can stay on for so long, but we can go a little bit beyond in order to catch up for the delayed start. Again, I apologize for that.
Anyway, let’s go to the next speaker, paper number 8, a zero in eChecker equals a 10 in eXaminator, a comparison between two metrics by their scores, from Jorge Fernandes. Jorge, I will unmute you in just one minute.
Okay. Hi, Jorge.
>> JORGE FERNANDES: Okay, hello, Shadi. Are you hearing me?
>> SHADI ABOU-ZAHRA: Yes. So again, please speak loudly and clearly, but slowly, and announce the slides that you’re on. Please go ahead.
>> JORGE FERNANDES: Okay. Thank you. I would like to spend my five minutes discussing the following four points: First point and first slide, when in 2005 I saw eXaminator, again, eXaminator presented itself as a robot who was very critical in some aspects of webpage designs in an incomplete way. Sometimes it was in kind. Using a scale of 1 to 10, it failed to classify our Web constriction. Knowing (Inaudible) from foundation meetings and knowing that he had also developed error changed him to develop monitoring tool for (Inaudible) Web administration.
Sorry. We have two requirements. First one, reliable texts and the geographical expression of the results and, very importantly, that we will not lose site of the score which we call web@x, web at eXaminator.
What about reliability of tests? Are they good? Feedback received from several hundred users of our tool, especially in Portugal and Brazil, is good.
Until now, we have reviewed about one million pages. The development and consulting work supported by eXaminator has also produced good results .
Third point, and third slide, but we want to test eXaminator with other tools to test the reliability of the methodology of UWEM.
For the first time, we had the opportunity to compare our text-building exercise with other things, especially as great team as was involved in the production of WEM.
The appearance of eGovMon checker, which we call it eChecker — sorry if it is wrong — was great. And like us, they also had a sample based on municipalities.
In 2009, we tried to update another tool we made in 2007, which we call it Walidator. UWEM 1.2. Using all the available tests of eGovMon checker.
Preliminary data proved frustrating because we lacked the time (Inaudible) analyzes.
As you saw in our paper, the current analysis also shows a weak correlation between the scores obtained by the two tools.
Fourth slide. One of the reasons for the (Inaudible) of atomic-level text analysis is due to the fact that we are working on a metric for WCAG 2 and translating the related documentation that will serve as a context-sensitive help for the new tool. The tool based on WCAG 2 is now online, and the new metric is quite different from eXaminators. Instead of a single type of text, we have four types, two binary tests and two with progressing results.
We are ready to review our — and clarify all the tests we perform, and we truly are willing to make a text-to-text analysis with other tools.
Thank you very much.
>> SHADI ABOU-ZAHRA: Okay. Thank you, Jorge. Next speaker is Markel Vigo, with his second paper, context-tailored Web accessibility metrics. Markel.
>> MARKEL VIGO: Thank you, Shadi, again. Okay. So I will go fast again to slide number 2, after presentation. So with this paper, I tried to tackle the problem of use context because there was some research that found that guidelines were not enough to assure an accessible experience.
With this we don’t mean that conformance to guidelines shouldn’t be pursued. On the contrary. I mean, we believe that conformance to guidelines establishes a minimum requirement for having an accessible experience for all users. But the problem is that some people still have to struggle on websites.
So one of the hypothesis that says that why this is not so good is because guidelines and maybe evaluation tools and metrics are not able to consider or they have a limited scope on trying to consider the interaction context.
So we understand by interaction context that our definition of interaction context is that the evaluation should take into account the characteristics of the assistive technology used, also the features of the accessing device because it’s not the same used in a mobile device, laptop, or you know, a table PC, and the particular abilities of the user should also be taken into account.
So as far as metrics are concerned, we hypothesize that the evaluation reports are context tailored, so will be accessibility metrics.
I am moving into slide number 3. The previous presentation was about the Web accessibility quantitative metrics. It was WCAG-style metrics. It was good for WCAG, but we wanted to embrace different guidelines. So we propose one metric-based Logic Scores Preferences.
I’m not going into detail to explain Logic Scores Preferences, but the difference of — the reason why we adopt LSP is that it gives us the flexibility to manipulate the metric depending on the features of the interaction context.
So if we want simultaneousness in establishing the criteria or success criteria, we can apply a logical function to lots of rights. If our objective is to penalize the main component of many components, we cannot qualify the main conjunction.
I am going to the last slide because I think this is important, how we try to evaluate our approach with users and with users interacting with webpages.
So we developed two tools, one tool producing device-tailored metrics and the other producing user-tailored metrics. The tool that evaluated the device-tailored interaction context produced metrics that were able to capture how pages were accessible for determined access devices. You know, devices are different among them; therefore, guidelines apply in a different way.
So we did the user test. We confirmed that device-tailored metrics were a more accurate approach to measure user experience.
The other approach was about user-tailored metrics, and specifically the context of use that we were targeting was blind user accessing the Web using the JAWS screen reader. We evaluated pages according to their specific profiles, and in order to validate the metric we are we deployed the accessibility scores as linked annotations so the annotations were able to say to what extent the page that was linked was accessible.
So in our experiment, we found that annotations with accessibility scores were able to increase user orientation.
I’m finished with this slide. Thank you very much again.
>> SHADI ABOU-ZAHRA: Thank you, Markel. Thanks again to everybody for keeping their time. We go to the next paper, paper number 10, Web accessibility metrics for a post-digital world. There are no slides for this presentation, but David Sloan will give us the five-minute highlight. David?
>> DAVID SLOAN: Hi, Shadi. Thank you very much. I hope everyone can hear me. Our paper is adopting a more theoretical position than what you’ve heard already in this symposium, but I think it builds on some of the discussions on looking at context as an important aspect of Web accessibility metrics.
We look at metrics as being something important to website owners who are providing an online experience on users of that experience and on third parties who may be regulators or governments or lawyers, even.
And what we would like to do in this paper is present our view of why we need to think about extending technical accessibility metrics to consider two additional contributors to the online experience; firstly, the quality or inclusivity of the user online experience, so moving beyond goal completion to look at the user experience; and also the quality of the effort required or that has been undertaken in providing that online experience. And this is motivated, to some extent, by UK legislation which focuses on the provision of services rather than compared to the Act in Italy, which is much more directly focused on technical conformance. And we used to the post-digital world to describe a situation where increasingly the use of the Web in and amongst other ways of providing access to services, information, and experiences by organizations is something that has become normal and something that is standard and, therefore, addressing the Web in isolation particularly in terms of measuring how accessible a service is might, in some cases, have limitations.
Also, obviously, in current financial difficult climates, decisions are being made to get — do things in as cost-effective a way as possible, and in some cases, a situation that requires universal accessibility and — or policy that encourages something to be completely accessible, whatever that might mean, may have practical limitations given financial constraints and the resulting shortcomings in people and technical capability and knowledge to implement accessible solutions.
So we are pointing to, on the one hand, research that looks at measuring user experience. Mark Hazenzal has provided some experience in this and provided ways in which user experience can be defined and measured.
We are also interested in looking at how we can bring that research into play with a couple of tools that exist in the UK that sort of extend our definition of accessibility. So one is British Standard 8878, which is a procurement and implementation standard launched last year to guide organizations in the whole process of defining, procuring, implementing, and maintaining accessible websites. So it’s very much about the process rather than just the practice. And it could be used to measure the quality of the process, the amount of effort that people have undertaken.
The second one I’ll mention briefly is the text accessibility passport, which is a more specific container which can define the accessibility of a resource, not just in the technical conformance level, which obviously that’s an important aspect of it, but also in terms of the reaction to issues, the strategy undertaken to address those issues, and perhaps even user stories, you know, in terms of what users have experienced when using that resource.
And with this information available, it allows judgments to be made on a wider sense of the quality of process that’s been undertaken and the resulting quality and user experience.
I’ll stop there.
>> Shadi? I think Shadi has disappeared for a moment. I think we have one more paper, and it’s a paper by Annika Nietzio and several other people from Norway, and she will tell us about a presentation titled towards a score function for WCAG 2.0 benchmarking. Annika, I think now I need Shadi to unmute you.
Maybe Simon can help me. Annika? Annika, I think you are unmuted.
It looks like we have some problem. I’ll wait a couple of more seconds to see if something good happens.
>> Yeah. Annika?
>> ANNIKA NIETZIO: Hello, everyone. Can you hear me now?
>> ANNIKA NIETZIO: Okay.
>> Okay. Speak slowly and close to the mic, and it will be fine. The floor is yours.
>> ANNIKA NIETZIO: Thank you. So I’m going to present the recent work of the eGovMon project, which is — our goal is also to develop a score function for WCAG 2 benchmarking. So we start — or we move to the second slide now.
To go back a bit and look at previous experience that we have, we started out with the development of the Unified Web Evaluation Methodology called UWEM for WCAG 1, and from this development, we have created a UWEM indicator refinement process, which was a very detailed process followed to — (Inaudible) — starting out with the collection of requirements, grouped into different categories, so we have requirements for crawling and sampling, mathematical and statistical properties, and requirements for the influence of features of the Web content on the overall score.
Then there was a stage where we did the theoretical analysis, the dependencies and potential conflicts between the requirements. And finally, a very important stage was the experimental evaluation, where some potential metrics were selected and compared based on real and synthetic data. And finally, this led to the selection of the score function for UWEM.
So to summarize this part, the main lessons learned were that first we need the statistical background, but of course, experimental evaluation of these approaches is vital. And we also saw that the score function has to be tailored to the structure of the test set. So this means in the beginning we had the score function for WCAG 1, but now we need a score function for WCAG 2 tests, and basically we have to repeat the process because the structure of the test set is different.
Move to the next slide, where I would like to highlight some of the differences between these two sets of guidelines.
A major difference is in WCAG 1, we had tests that are more or less independent. They might cover the same HTML elements, but for counting and how they’re taken into account for the score, they can be seen as independent.
In WCAG 2, when you chose to implement the tests based on the techniques for WCAG 2, then we find that we have logical combinations of the test results, and these must be taken into account.
So to illustrate this, I’ve put in an example here. This is for success criteria on 3.3.2, labels or instructions. If you want to identify the purpose of a form field, there are different techniques, so you can use label element. In some situations, you can also use a title attribute or an adjacent button that is labeled with the correct text.
And so if you implement three tests for these three different situations, we have to also implement a logical combination between the different outcomes of the tests to see when we actually should report a failure.
Then we move on to the next slide. With this approach, we see that we have a different number of tests for success criterion or project point if you look at WCAG 1. Of course, we would like to create an accessibility score that is independent of the number of tests or independent of the granularity of tests that are actually implemented. And therefore, our solution is to use the success criteria as intermediary aggregation level.
Before we started with single test results and aggregated it in a page score in one step, and now we take two steps, first aggregate it into a score for a success criterion, and then we calculate the page score.
The next step would, of course, be to develop a website score, where we take all the page scores into account. This is a task that we haven’t tackled yet, but we see — in the paper we have described some of the challenges or the open issues that need to be resolved. So for example, you have to define a way how to accommodate results of tests that are applied on site level. There are a couple of checkpoints, for example, about navigation that are applied on site level, so there won’t be any results on individual pages.
And another question that’s also related to the sampling and the selection of webpages is how to deal with these conforming alternate versions.
Then we move to the final slide, which is — it might even be seen as a bit of a summary of the whole of the presentations that we have heard today. So what we found is that a general accepted practice for reporting WCAG 2 results does not yet exist. There are different approaches, and there might be correlations between the results, but there’s not one generally accepted practice.
So this leads to a situation where we have a number of tools for WCAG 2 that are available, but the scores are not really comparable. The differences lie in the granularity of tests, in how the instances are counted, and there are different result categories, there are errors, potential errors, warnings, different types of pass and fail results, and different severities.
And finally, the actual reports are quite different. Some tools report an absolute number, some tools report percentages of failure rates, and there are a couple of more sophisticated scores as we’ve also seen in other presentations today.
So our goal for the future to arrive at a really unified WCAG 2 score that meets the requirement of intertool reliability, so testing the same page with different tools leads to the same results. That would require some steps that are summarized in the final section of the paper, basically, a collaboration between tool developers and researchers to define comparable tests at an atomic level, and then also keep the test implementation very close to WCAG 2 and the instructions in these “how to meet WCAG 2” and these understanding documents. And then finally, agree on indicator requirements and possibly follow a similar process to the one that we described for the process of UWEM score. Thank you.
>> Okay. Thank you, Annika. So I think that Shadi — Shadi is not back yet.
>> SHADI ABOU-ZAHRA: Hi. I am sorry, folks. There is an entire power outage where I am, so no phone, no Internet. I am calling from my mobile phone. I’m not able to actually mute and unmute. Is anybody able to take that over?
Shawn, are you on the line? Are you able to maybe help with moderating a little bit?
>> She just whispered that she can.
>> SHADI ABOU-ZAHRA: Okay. Great. So I will be listening in, and Giorgio, I think we’re now going over to the panel session anyway, which is shared by the panel chairs, by Giorgio, Markel, and Josh, who will ask the presenters specific questions. Giorgio, if you try to, as far as you can, direct your questions to specific presenters so that we can unmute the line more easily, and Shawn will help out with the muting and unmuting.
And I ask everybody to also think about their own questions. There will be an opportunity at the end to have a very brief Q&A session. So without further adieu, it’s over to you, Giorgio, Markel, and Josh. Please go ahead.
>> GIORGIO BRAJNIK: Yes, thank you. So the original idea was to have a half an hour panel discussion followed by a half an hour of question and answers coming from the participants, from the floor. We need to shorten the times, and I suggest that we address a few questions to each of the presenters in the order in which they gave the presentation, so the first one would be Maia, followed by Sylvia, and so on. Of course, only one presented, even if the presenter gave two presentations.
So there are a couple of questions that might be — for which might be interesting to have the idea of what each presenter thinks is important and how to tackle these questions.
The first question that we, as chairs, came out with is that we would like to know who is the target user of the metrics that each of the representatives is talking about, and for what purposes? And why do they think that these people would be using the metrics?
So this would be the first question or set of questions.
And secondly but related to this is we would like to know something about how you estimate the validity of your approach, how much, what you think — what you say is an accessibility metrics. Do you think how much it really relates, the scores that it produces really relate to accessibility?
And what is the cost in your application area, if a wrong decision is taken based on the wrong values of the metrics?
And finally, the third point would be do you think that we need to go beyond the idea of metrics as being a reflection of conformance, of checkpoints, and perhaps think more about, like, user experience or usability for the disabled people or whatever?
Do we need to go beyond conformance?
So users of the metric, validity, and shall we go beyond the conformance? And I would like Maia to give us the first answers. Just one or two minutes, no more than that. Thanks.
>> MAIA NAFTALI: Sorry. Can you hear me now?
>> GIORGIO BRAJNIK: Yes, I can. Speak slowly and close to the mic, please.
>> MAIA NAFTALI: Yes, I’ll try. To answer the first question, the target user would be everyone who has a website or company and want to test the site, where am I, what is the level, how close I am to accessibility and conformance?
It’s really nice the scope of the metrics, and I think it will be valued because we will only be accessing (Inaudible) — can be tested with algorithms.
Regarding what is level of conformance, my answer is you have to go beyond, but going beyond conformance as far as developed metrics, (Inaudible) criteria, full set of (Inaudible) and have to work on that and develop a baseline (Inaudible)
>> GIORGIO BRAJNIK: Okay. Thank you, Maia. And how about Sylvia?
>> SYLVIA MIRRI: Can you hear me?
>> GIORGIO BRAJNIK: Yes.
>> SYLVIA MIRRI: Okay. Perfect. Well, actually, we have started to work on the monitor application just to provide a tool in Italy to monitor the level of their websites, the level of accessibility of their websites.
And so we think that the BIF metric can be used worldwide. But it is just a starting point because it is — actually, right now, it is a tool to measure just automatic evaluation errors and their impact. So this is just a starting point, but it can be used worldwide because the monitoring system can be tailored according to different accessibility guidelines.
There is a different issue related to the metrics to make different DTD document evaluations comparable because it is strictly based on a problem in Italy because the Italian regulation is so different from all the different accessibility guidelines. So it is pretty difficult to generalize this problem outside the Italian country.
But this metric can be useful in trying to evaluate the compliance and validity of DTD documents when they are compliant to different grammars.
>> GIORGIO BRAJNIK: Thank you, Sylvia. How about Nadia Fernandes? Just a minute, and you should be unmuted. Nadia, you are already unmuted.
Just a moment.
(No audible response)
Okay. I think we might postpone Nadia. Nadia’s opinion. And so let’s go to the next speaker, and who is Luz, Luz Rello. Let’s see if Shawn can unmute, yes, Luz?
>> LUZ RELLO: Hello?
>> GIORGIO BRAJNIK: Hello. Great. So how about users and purposes of using and validity of your metrics?
>> LUZ RELLO: Of validity, of course, one way to validate it. Right now we are working on validation of the metrics using user studies, so we see how much the lexical quality impacts in the readability of the text. So we are working on that right now.
And I mean, the users of this metric would be people who would like to evaluate Web content. So Web content is not only related to semantics, but also to its representation. So it seems that — I mean, it has been able to use lexical quality to assess Web quality content, and yeah, anyone who would like to assess that could use this metric.
>> GIORGIO BRAJNIK: Thank you. You have been very quick. Thank you.
>> LUZ RELLO: Okay.
>> GIORGIO BRAJNIK: So shall we try again with Nadia? Let’s see if we can get her voice. I don’t think so.
>> There is some noise on the line, so it seems that it’s a local issue. Nadia, you need to try to unmute locally.
>> Actually, if we can go ahead with Detlev while we’re working on it.
>> GIORGIO BRAJNIK: Okay. Good. So the next one would be Detlev.
>> DETLEV FISCHER: Yeah, hi. Just to answer the question for target users, one user is, of course, the testers themselves because the graded rating scale we explained or I explained should give them a means for a fair judgment of not-so-good content, you know, all those gray areas I think are better served.
The purpose is to have a tangible and fully documented measure where you can distinguish the good from the not-so-good, so beyond just pass-and-fail judgment to give a really meaningful rate of accessibility.
The adoption is — has been quite well adopted already and is being used already, so the validity of it, we try to back it by having these double tests or tandem tests, and also the results are fully documented, so they are open to third-party inspection. That, we hope, will improve the validity.
Regarding the limitations of metrics, I think what’s important is to include in metrics all success criteria of WCAG, and we know that many cannot be measured automatically, so I think that’s crucial, and any value will have to include things that need human assessment. That is our point of view, at least. That’s it.
>> GIORGIO BRAJNIK: Thank you, Detlev. Shall we try with Markel?
>> MARKEL VIGO: Okay. Hi, Giorgio. So the first question was who will be the target user of your metrics. So I presented two extended abstracts today. And both — I mean, there are two different approaches to measuring accessibility. So I would refer to WAQM, a conformance metric, while the other one would be a more context tailored metric.
The target audience for conformance metric for me is for those scenarios that have to guarantee that webpages meet a certain level of accessibility. You know, those that have to guarantee or to satisfy that some regulations or policies are met.
While contextual metrics can be — the target can be end users because, from my point of view, I’ve been using these metrics to increase user experience or to at least — to provide predictors of the experience of the interaction with the Web.
Regarding the validity of our approach, we tried to involve users in our experiment. Most of the times we got good results with high correlations. We all know that correlations don’t guarantee (Indiscernible) but it’s a hint toward validity.
Regarding how cost or risky are decisions made based on evaluations. It’s always risky, but it’s more (Indiscernible) because these metrics are being used for that purpose. So if you make a mistake, the user can be misled or in the end can even end up frustrated. So in both cases, it’s crucial, but it’s worse if you make the mistakes for context-tailored metrics.
In the end, the last question is what is the (Inaudible).
My answer is yes and no. I would give conformance to metrics — for guidelines for those scenarios, the conformance, like the ones I mentioned, like meeting policies or regulations.
But for those scenarios that try to measure the experience of the user, the real experience, way go for context-tailored evaluations or metrics, which is something I try to convey in my second presentation and also David Sloan made emphasis. So this is my position on these questions. Thank you.
>> GIORGIO BRAJNIK: Thank you, Markel. I think we have the channel open for Annika. Annika? Hello?
(No audible response)
>> ANNIKA NIETZIO: Question. The target users of the E-Gov tool, so we are — the project we are working together with a group of Norwegian municipalities and our target users are the site owners, so the municipalities and the Web editors who work with the pages on a day-to-day basis. And of course, they need to communicate with the Web designers and Web developers if some issues are discovered. So our — the main purpose of using the tool would be to improve the accessibility of websites and, therefore, we are trying to give very detailed reporting, so it’s not only metrics, but we also provide additional info about the relation to the WCAG content and references to the check pages and in some cases even code extracts from the HTML code and a lot of hints how to fix the focus really on fixing accessibility issues and improving the quality of the websites.
Regarding the second question, the validity of the approach, of course, the — the first answer here is that we are trying to follow the WCAG guidelines quite closely, and we can somehow inherit some of the validity from the WCAG guidelines.
But I’d also like to put this question a bit differently and maybe highlight another impact. Of course, we can try to give scientific proof of the validity of our metrics, but from the outside, seen from the user’s point of view, what’s really important is the credibility. And the danger that I see here is that we have a lot of different tools, and they present different results for the same pages. And this is something that could easily alienate the users. They might stop using the checkers. They might stop to care about accessibility at all. Or it could even be an approach to select the tool according — yeah, select the tool which gives the best results for your page, and that, of course, is not the right way to improve accessibility.
And finally, about the question about the limitations of conformance-based metrics and about the idea to maybe introduce metrics based on the effectiveness, as I see it with conformance measurements, we have quite a good experiment, so it’s a technical measurement. We can clearly define the requirements and assess conformance.
With effectiveness, it’s a bit more difficult because it’s somehow related to the experience of an individual user. And to measure this in a way that provides valid and comparable results might be an idea to use — users with specific requirements and look at how they can — look at the level of effectiveness they could achieve when using a page. Thank you.
>> GIORGIO BRAJNIK: Thank you, Annika. Very interesting point.
Shall we try with David, David Sloan? Let’s see if we can unmute him.
>> DAVID SLOAN: Hello. Can you hear me?
>> GIORGIO BRAJNIK: Yes, yes, go ahead.
>> DAVID SLOAN: Okay. So I briefly outlined the target users I felt were particularly potentially beneficiaries of the metric we were talking about. I’ll just go over that again. Firstly, in terms of the organization providing the website that enables the experiences or goals to be reached. Policymakers and monitors would potentially have a broader and more contextual understanding of the implications of technical accessibility in terms of the extent to which target audience can complete tasks and complete those tasks with a degree of success or other positive measure beyond just basic completion.
And this is something in the UK again, particularly in the public sector, there is a move to gather evidence that is documenting efforts taken to improve inclusivity in terms of provision of services. And it might help decision making in terms of do we fix a barrier or do we provide an alternative way around this? Now, obviously, this is potentially dangerous if it’s interpreted as leaving a barrier behind, so this has to be carefully thought out in terms of the decision-making process.
Developers obviously benefit from evidence that helps them prioritize and target fixes above and beyond a basic guideline indication of priority. If a fix can be made to help a group of users successfully complete a task or complete it more effectively than previously, then that might focus on that barrier rather than others that have less impact.
People who experience accessibility barriers are obviously beneficiaries of a metric that allows them to share feedback in however narrative a form on the quality of user experience, the encounter. So again, it’s allowing people to report the quality of task completion and make that evidence available to others who can potentially benefit from knowing that it’s possible to do something within a site, even though certain accessibility barriers might exist.
And thirdly, standard-setting bodies, regulatory bodies, people from a regulatory perspective can access this information to help understand the extent and quality of effort organizations have undertaken to make their online services accessible within a context of providing services, information experiences to the public.
In terms of validity, obviously, this is early days in terms of defining how we might go about defining a metric and measuring it. What we’re really trying to do is define what evidence should be gathered on top of technical conformance on which the measurement would be provided in order to get a sense of the quality of the experience and the quality of the effort that has gone into providing that inclusive experience, so validity would be dependent firstly on the dependence on the information being measured and secondly on the accuracy and completeness of the information being provided and also by end users who may be reporting user experiences. Wrong values might lead to an overly optimistic or pessimistic indication of the level of inclusivity of the user experience. And requiring too much evidence or hard-to-find evidence might mean organizations either cannot gather that information or cannot justify the effort gathering that information.
So there is a balance in terms of the amount of information that we define in order to provide a meaningful metric.
Finally, in terms of sort of beyond conformance, I think what we are discussing here is very much beyond conformance and looking at measuring quality of inclusive experiences in the context of the constraints of an individual organization. So while technical conformance might allow comparison across different organizations, there is a degree of importance in measuring how well an organization has performed in terms of what they are capable of doing, so this, again, is inspired by UK law in terms of taking reasonable steps to do as well as you can rather than necessarily being focused in areas of detail that might have less impact but apparently, according to a technical definition, might — there might be less impact on priority.
>> GIORGIO BRAJNIK: Okay. Okay. Thank you, David.
At this point, I would like to thank you for your patience and give you, to the audience, the ability to ask questions.
I think that you need to dial 41 hash, the hash sign, the pound sign, so 41 to raise your hand. And if you wish (Inaudible) — then dial 40 hash. So we are looking if there are questions from the audience. And please direct your question to some other speaker.
>> There’s already one, Giorgio, by John Gunderson.
>> GIORGIO BRAJNIK: Go ahead, John.
>> One of the issues is I hear people say they want to see reliability between different tools, but when you’re conforming to WCAG or looking at WCAG, how you conform to WCAG will depend on your perspective. WCAG doesn’t have any standard techniques. I mean, there’s suggested techniques in the techniques document. But won’t the results overall depend on perspective, like whether you’re supporting Web standards and if you have any coding patterns that you’re trying to get developers to do for accessibility?
I mean, there’s a wide range of ways to conform to WCAG. I guess I just would like to hear the panel respond to that. You know, there’s more than one way to conform to WCAG. How do you get reliability when people have different ideas about how to do it? Thank you.
>> GIORGIO BRAJNIK: Okay. Is any participant on the panel willing to give an answer? Especially those dealing with conformance-based metrics.
>> (Inaudible) — 41 and the pound sign for the panelist to raise their hand.
>> We have Josh, and then Annika, or — here we go. Annika first.
Annika, you should be unmuted now.
>> ANNIKA NIETZIO: Okay. So the question about different ways to conform to WCAG, this is, of course, also partly addressed by the WCAG techniques because usually there’s not only one technique to conform to success criterion, but there’s a selection. And this is also what I tried to present in my approach that you really have to take into account this logical combination. And this will cover the most commonly used techniques, it and will also, of course, catch the common failures because they are explicitly described in these techniques document.
There might be more obscure ways of providing accessible websites or sometimes it’s called the undocumented techniques or something, some things that will evolve and maybe become good practice in the future.
So this aspect has to be taken into account by the constant update of the techniques. And I think in that way, we can be pretty confident of capturing most of the approaches to making accessible websites. But of course, you cannot catch all possible techniques in that way.
>> GIORGIO BRAJNIK: Thank you, Annika. Joshue?
>> JOSHUE O CONNOR: Hi, everybody. Yeah. I think, you know, we can know in a general sense — can everyone hear me okay, first of all?
>> GIORGIO BRAJNIK: Yes, yes.
>> JOSHUE O CONNOR: Okay. Thanks, Giorgio. We can know in a general sense that there are some things relating to accessibility which are certainly best practice, and they’ve been very clearly defined because there are well-worn paths such as, you know, structural headings and good semantics and problematic determination between related elements or assistive technologies can relate to them, understand them, and provide a positive user experience.
The problem is that when the Web moves so fast, and Web development techniques are constantly evolving, changing, that I think the notion of conformance in itself is devalued a little bit because we need to be able to define ways of looking beyond their conformance but tapping more into the user experience, and that’s one of the reasons why I think David’s paper was very interesting because that’s something that isn’t necessarily defined by mere technical validation or document conformance.
And so this is what’s been very interesting about this research and the papers that we’ve been getting because you’ve been able to look at this from different angles.
So there’s no easy answer to that, but I think that from my experience with working with the domain for a few years now is we need to try and find ways of quantifying the user experience in a way that we can give it an appropriate weighting within WCAG conformance.
>> GIORGIO BRAJNIK: Thank you, Josh. Are there any other questions?
>> You want a reminder, if anyone has question, to dial on their phone keypad 41 and the hash sign or the pound sign.
>> GIORGIO BRAJNIK: I think there is a question by Simon, Simon Harper. Simon, we cannot hear you. Still no sound from Simon.
>> Do you have a local mute? Maybe, Simon, you have a local mute? We have unmuted you, so maybe you have a local on your phone system there.
>> GIORGIO BRAJNIK: There’s one more question by —
>> Simon says he’s trying to fix it. Sure. Okay.
>> GIORGIO BRAJNIK: Very good question. Let’s see. Let’s see who, between the panelists, would like to answer?
>> Sorry. What was the question gone? I didn’t really catch it.
>> JOSHUE O CONNOR: This is what I was trying to say, dealing with — that is pretty much a good example of what I was talking about. There’s a lot of different ways of skinning the cat when it comes to scripting behaviors within Web content. The thing about it is there’s some very, very simple stuff that developers can do in order to make their stuff more accessible. You know? It doesn’t really matter how (Inaudible) bubble way down low, as long as when it comes to the top there’s some program that assistive technologies and screen readers can tap into that.
I would treat it as any other kind of content. But there’s a lot of kind of crowd sourcing best practice I think that goes on in the Web development community which helps each other. The problem is when you need to codify things in terms of best practice. So in a way, we have to ask ourselves, okay, are we talking about mere conformance to WCAG or are we talking about building stuff that just works?
>> GIORGIO BRAJNIK: Thank you, Josh. I think there is a hand up by Detlev.
>> I think the question has already been answered.
>> GIORGIO BRAJNIK: Okay. Good.
Any other questions?
We have (Inaudible) on the queue.
>> Can you hear me now?
>> GIORGIO BRAJNIK: Yes, we can.
>> Great. I don’t know why this system seems to mute. So my question is what do people think about the need for a test corpus of pages or errors, these kind of things, that all of the different systems that we’ve had today could work from so that we can understand how they could interact with each other and how they could — and how they compare to each other? Similar to the way that we have corpuses of pages or at least phrases in text mining and I suppose actually some of the — maybe I could direct that question in some way to Luz Rello because I think that seems to be something that they have in their area.
If anybody could answer, it would be fine.
>> GIORGIO BRAJNIK: Thank you. Luz, do you want to say something? Go.
>> LUZ RELLO: Yes. So the question is how to use corpus of papers?
>> So my question is really do people — do we think it’s useful to have a corpus of pages that could be used to test the validity of these metrics and to test and compare them together so that we can understand how they relate to each other better?
>> LUZ RELLO: Ah, okay, that is — yes, I really think that that could be useful. Actually, we are — that’s always useful to compare and to see what is reliability.
Right now, we are building a corpus, actually, from errors that we have, and we are looking for documents that has two or more errors in the same document from our sample, so then we can build a larger corpus and find a statistical model that can tell us which is the difference between this corpus with errors and the rest of the Web and in which kind of metric.
I hope I have answered your question.
>> GIORGIO BRAJNIK: Okay. Thank you.
>> Yeah, that sounds good. I wonder if anybody else has any ideas — any other accessibility people have ideas about whether they think it’s a good idea or not, really. That kind of answers my question from that end.
>> GIORGIO BRAJNIK: We have Annika. (Inaudible).
>> You want Markel first or Annika?
>> GIORGIO BRAJNIK: Okay, okay, Markel.
>> MARKEL VIGO: So I think that what Simon is suggesting — I think that’s the way to go because, for instance, today we have 11 papers dealing with accessibility metrics, and there’s no way to compare them. I mean, it seems that we need a common framework where we have — where we can test our metrics, a common set — I mean, I think we need a common set of pages that we all, as a community, could use to see how valid, how reliable, and how our metrics compare to others, and how good are from some scenarios and how bad are from some other ones.
I mean, yeah, I fully agree with Simon’s idea, and I think that this should be the way to go for process as a community working on accessibility and accessibility metrics. It’s a way to standardize and have the common framework for us.
I mean, in the last years, we’ve been seeing that the research community is proposing lots and lots of metrics without taking the previous work into account. And I always wonder why people are proposing new metrics. I mean, I don’t think it’s a bad thing to propose new metrics, but I always wonder what is the contribution, how they improve previous ones, and how do they compare? So that’s why I — I wish we had that framework that Simon is suggesting.
>> GIORGIO BRAJNIK: Thank you, Markel. I think that Annika’s question has already been answered.
I would like to ask — to pose the same question to Jorge Fernandes, who by mistake I didn’t give the right of talking beforehand, so if Jorge is still online.
>> Jorge Fernandes.
>> GIORGIO BRAJNIK: Yes, do you want to say something?
>> JORGE FERNANDES: Yes, maybe I could make some input about the initial questions. May I?
>> GIORGIO BRAJNIK: Yes, go ahead. Yeah, go ahead.
>> JORGE FERNANDES: Well, about the target, because now in our paper, we said that our results working already since 2005, so we have already users using our tool, a lot of times. And users target are Web designers, programmers, all type of content editors, but also managers responsible for Web content, and also users — mainly users with disabilities and their representative organizations.
About the purpose of our tool, the purposes of our tool is helping in the development phase validating templates, making an initial assessment of an entire website, monitoring websites, producing a dynamic label of certification that I think is something that I never saw in a tool before, and we are using the dynamic label of certification since 2005.
We are also producing a grading list, facilitating progress towards the learning curve of Web accessibility also, and producing a quantitative report by page, producing data for studies that is some of the proposals of our tool.
About the adoption of our approach, our approach has been extremely well received. Today our tool is being used not only by a biking blog, but also by main national bank of Portugal. It is being used by tech professionals and also users and organizations that represent people with disabilities.
About the validity of our approach, in our unit here in Munich, we produced a list, a lot of Web accessibility reports based on manual evaluation made by experts for public organizations. To do this, we have also used other tools, not only our tool. We are one of the main users of (Indiscernible) for example, from Spain, (Indiscernible) and ASIS from Brazil and the tool validator also, and we confront our results with the results of those kind of tools.
I don’t know if I have time yet to —
>> GIORGIO BRAJNIK: No, really no because we have close to closing the meeting.
>> GIORGIO BRAJNIK: So I think Annika is on the queue list, so if Annika is on the line. And then I suggest we close the line. I ask anybody who has additional questions to send the question to me, and I will make sure to relay the question to the participants of the panel, and then get the answers.
>> ANNIKA NIETZIO: No, I don’t have any further input.
>> GIORGIO BRAJNIK: Okay. I thought you were in the speaker queue. Okay. Good. I think I have to thank a lot the background voice that sometimes popped up, which is Shawn, Shawn’s voice, I was very grateful for your help. Thank everybody else that gave us the help in organizing the meeting. Of course, most of all, to Shadi. And of course, thank you for all the participants, the panelists, the authors, and also the people on the floor, on the audience. And we look forward to meeting you again somewhere else in some other meeting, and please keep looking at the website of this working group because we will be publishing the research notes that come out of this meeting and the questions and the comments and, of course, the presentations.
So thank you very much.
>> SHADI ABOU-ZAHRA: Hi, everyone. This is Shadi Abou-Zahra again. So yeah, also big thanks to the symposium chairs. As Giorgio was just alluding to, the next step will be to actually for the Research and Development Working Group to create the draft of the consolidated input that we received from the symposium. We only managed to really only scratch the surface today, but we do hope to be able to provide a consolidated input. Look out for that. We will be notifying the participants here of when this draft is available for you to look at.
Please do send your questions, as Giorgio said, either to him privately, if you don’t feel comfortable asking your question publicly. If you do, please send it to the Research and Development Working Group meeting list, which is public-WAI-RD@W3.org. You’ll find that email address also posted on the agenda website of this symposium. So if you can provide us any further inputs you have, any further questions, and we will try to consider those in our consolidated output from this symposium.
The idea is really to have concrete outcomes from the — from this discussion. And we look forward to working with you on this consolidated output. And thank you very much, all, for your participation. I apologize for some of the logistical hiccups we’ve had, but I think overall we saw quite a number of different presentations from different aspects and different angles, and really good questions being raised at the end. So thank you very much, and stay tuned online. Bye-bye.