Commercial / Community Scraping! #hhhmcr #a11y #accessibility

Screen Scraping and Trancoding

Screen Scraping and Trancoding

I was recently contacted by ‘ScraperWiki’ who have an event in Manchester called ‘Hacks and Hackers Hack Day’, they say:

We hope to attract ‘hacks’ and ‘hackers’ from all different types of backgrounds: people from big media organisations, as well as individual online publishers and freelancers… The aim is to show journalists how to use programming and design techniques to create online news stories and features; and vice versa, to show programmers how to find, develop, and polish stories and features. All sorts of data was scraped and played with at our past events: in Liverpool, projects included mashes of police, libraries and courts data. Birmingham saw lots of health-related projects, as well as scraping of political party donor and leisure centre information.

However, looking further into their aim it seems that “ScraperWiki, as a platform to scrape and store public data in a structured and usable format.” – now we’ve seen data scraping from PiggyBank, and the BBC RDF triple-store, but this seems to be an engine to scrape lots of resources and make those available for some end-purpose. In Web accessibility, scraping has been used for a long long time. Initial attempts at screen reading technology only read the screen as presented to a visual use, and was called ‘screen scraping’ as it only produced superficial information regarding the text being translated. As the visual complexity of Web pages increased these screen–readers become inadequate because of the reliance of Web documents on context, linking, and the deeper document structure to convey information in a useful way. In this case Web browsers and Web page readers for visually impaired users have been created to access this deeper document structure, by directly examining the XHTML or the Document Object Model (DOM). By examining the precise linguistic meaning of the text it was hoped that more complex meanings (associated with style, colour etc.) could be derived. However, when interacting with complex Web documents these readers, although better than screen-scrapers, still do not enable an understanding of the meaning of the underlying structure which is vital for the cognition of information.

I wonder if there is some way we could now use their scrapping technology in combination with our (community’s) own to build accessible and semantically structured data and store it centrally. Now it seems from their code:

######################################
# Basic PHP scraper
######################################

require  ‘scraperwiki/simple_html_dom.php’;

$html = scraperwiki::scrape(“http://scraperwiki.com/hello_world.html”);
print $html;

# Use the PHP Simple HTML DOM Parser to extract <td> tags
$dom = new simple_html_dom();
$dom->load($html);

foreach($dom->find(‘td’) as $data)
{
# Store data in the datastore
print $data->plaintext . “\n”;
scraperwiki::save(array(‘data’), array(‘data’ => $data->plaintext));
}

That they are using a simple template based approach, not really as heterogeneous as our stuff based on css, and maybe not as rich as IBM TRLs stuff either. But non the less this is an interesting development.

Leave a comment