DIFI - DCAT Harvester

Modules

There are four main modules in the DCAT Harvester. There is one module for user interaction with an API and a frontend. Another module caters to the consumers of the data, with RSS feeds and complete RDF export functionality. A shared module with the datastore abstractions connects Elasticsearch and Fuseki to the other modules. Finally the main module, the actual harvester, downloads DCAT data and validates it before storing it in Fuseki and indexing it to Elasticsearch.

dcat-admin-webapp

Frontend admin interface built with JSP. There are three main views. The admin view is used to administer the users. Here you can create, update and delete users. The users view is for registering a DCAT source for harvesting and monitoring your harvesting status. And then dcat source view is for inspecting a specific DCAT source with a history of it's 100 previous harvests and potentially error messages. More errors messages are available through Kibana.

The JSP pages are initiated with controllers found in the dcat and user packages. The controllers load the required data and trigger the matching JSP page.

Requests initiated from the webpage are handled by an API. All request use JSON and have matching DTOs in their package.

Security

Spring security is used to authenticate and authorize users. User information is saved in Fuseki. All requests need to be authenticated. Users can manipulate their own data, eg. their DCAT sources, while admin users can add and remove other users and also edit their DCAT sources.

dcat-api-app

Api for the frontend and also for the RSS feeds and RDF/JSON-LD data.

The api for retrieving DCAT data from the store is located in "DcatRestController.java" and provides the following functionality.

Retrieve all DCAT data

Type	GET
Path	/api/dcat
Param	format
Param value	jsonld rdf/xml application/ld+json application/rdf+xml
Description	Get all DCAT data in a single request. Either as JSON-LD or as RDF/XML.
Example	`/api/dcat?format=rdf%2Fxml`

Invalidate cache

Type	POST
Path	/api/invalidateCache
Description	Invalidate the cache for the `/api/dcat` call.

Refresh cache

Type	POST
Path	/api/refreshCache
Description	Refresh the cache for the `/api/dcat` call. This call blocks until the cache has been refreshed!

The api to retrieve all the DCAT data can be slow, so there is a simple cache solution to cache all the RDF data. The cache solution is in "Application.java" and uses Guava to cache the response from Fuseki. The response is cached in clear text, so there will normally be one cache entry per format (eg. JSON-LD and RDF/XML). The cache entries automatically expire every 24 hours.

As well as RDF, the api can also output an RSS stream. "FeedController.java" has the following api:

RSS feed api

Type	GET
Path	/api/rss/feed
Description	Returns an RSS feed of the DCAT data.

dcat-datastore

Module with abstractions over Fuseki and Elasticsearch.

The AdminDataStore class provides methods for user administaration (except delete user) and DCAT source administration (exception deletion). The method for adding a user can also be used for updating users, and there is logic for stopping multiple users having the same username. Same goes for adding DCAT sources, however multiple users are allowed to add the same URL for harvesting. This doesn't mean they share the same source.

AdminDcatDataService provides delete methods for both DCAT sources and users. When deleting a user, you first have to clean up all the users DCAT sources.

DcatDataStore is the abstraction used for storing the harvested DCAT data from a DCAT source, and also for retrieving said data.

The Fuseki and Elasticsearch are simple abstractions above Fuseki and Elasticsearch to simplify inserts, updates and deletes.

dcat-generate-dummy-data

Small stand-alone program for generating test data in DCAT format.

dcat-harvester-app

Module for harvesting DCAT data from sources.

The main component of the system, designed to retrieve DCAT data from various sources and incorporate it into a single database

A CrawlerJob is the main process of the harvester. It downloads DCAT data from a specified source, does validation and any transformations required before sending it to be Fuseki and Elasticsearch.

The validation is a set of SPARQL queries that return any invalid data. The queires are available in the resources folder under validation-rules/from-eu. The rules are a copied from https://github.com/EmidioStani/dcat-ap_validator

CrawlerJob information is stored in Fuseki, together with user information in the uses database.

A CrawelJob can be initiated with a request to /api/admin/harvest-all or /api/admin/harvest?id=CRAWLER_ID_HERE

To support data from Vegvesenet and from Entryscape there are two transforms in CrawlerJob. enrichForEntryscape(model) will transform Entryscape data and enrichForVegvesenet(model) will transform data from Vegvesenet.

XmlToRdf

DCAT Catalogs can contain identifiers for corporations and organisations. These identifiers can be looked up in the Brønnøysund registers to find the name and other information about the organisations.

This information is in XML, so to convert it we use XmlToRdf. This is an early version of XmlToRdf, the library has since been developed further and released as an open source project: https://github.com/AcandoNorway/XmlToRdf

Building and running

Prerequisits

docker
docker-compose
a unix operating system (for simplicity, not really a requirement)
Java 8
maven >3.3.x

The easiest and simplest way to build and run the system is to use the command line script ./rundocker.sh which will build the Java modules, build all the docker images and then start the containers.

The admin interface should now be available at http://localhost:8080/dcat-admin-webapp/

The docker setup is not intended for production use.