Wednesday 8 January 2014

Querying Wikipedia data with SPARQL


Wikipedia is a great and well-known resource for all kinds of information.  However this information is not readily query-able (other than by search term) on the Wikipedia site directly.  There is a community effort at http://dbpedia.org to make a structured set of Wikipedia data available for querying and analysis.

The query form at http://dbpedia.org/sparql is a great way to test out the kinds of data extraction possible.



As a basic example, the following SPARQL query lists software products by subject and organisation and is limited to 500 results.
SELECT ?subject ?organisation ?product
WHERE
{
    ?organisation rdf:type <http://dbpedia.org/ontology/Company> .
    ?product <http://dbpedia.org/ontology/developer> ?organisation .
    ?product rdf:type <http://dbpedia.org/ontology/Software> .   
    ?product <http://purl.org/dc/terms/subject> ?subject .
}
ORDER BY ?subject ?organisation
LIMIT 500


The following SPARQL query lists subjects for which there are at least 50 software products defined, and provides the count of products against each subject.
SELECT ?subject count(distinct ?product)
WHERE
{
    ?organisation rdf:type <http://dbpedia.org/ontology/Company> .
    ?product <http://dbpedia.org/ontology/developer> ?organisation .
    ?product rdf:type <http://dbpedia.org/ontology/Software> .   
    ?product <http://purl.org/dc/terms/subject> ?subject .
}
GROUP BY ?subject
HAVING count(distinct ?product) >= 50
ORDER BY ?subject
LIMIT 500

The queries above are very simple examples and it is possible to make more complex use of the links between data available at http://dbpedia.org.

I believe there are many interesting possibilities for this capability and have been using data extracted in this way in a small project.