We have been talking about our technical approach and some of the components with used in a few other previous posts. Here we summarise the major technical components of the architecture of the UCIAD platform (see figure below) as well as the tools we have reused. We also come back to the biggest, and most major technical issue we had to face: scalability, especially when applying Semantic Web technologies to very large log data.
As described before, one of the core principles of UCIAD is the use of Semantic Web technologies, especially ontologies and Semantic Web languages. The goal of the architecture above is to extract from logs homogeneous representations of the traces of activity data present in logs and store them in a common semantic store so that they can be accessed and queried by the user.
The representation format we use for this activity data is RDF – the resource description framework, which is the standard representation format for the semantic Web and linked data. It is a graph-based data model where the nodes are either literal values or URIs: Web addresses of “data objects”. The advantages of RDF is that the graph data-model provides a lot of flexibility into manipulating the data and extending it. RDF is also supported by many different tools and systems. More importantly, for us, the schema used for RDF is an ontology, represented on the OWL Web Ontology Language, which allows flexibility and granularity in representing the data, but also, as a logical formalism, makes it possible to apply simple inference mechanisms (namely classification) so that “definitions” can be added to the ontology to cluster and group traces of activities automatically (see our post regarding the development of the UCIAD ontologies and the reasoning mechanisms applied with these ontologies).
The core component of our architecture is the semantic store (or triple store). Here we use OWLIM, which has three main advantages for us:
- (In principle) it scales to very large amounts of data (although, this has proved to be an optimistic view, see below).
- It provides efficient inference mechanisms for a fragment of OWL (which mostly covers our needs)
- Together with the Sesame interface, it provides standard data access and querying mechanisms through the SPARQL protocol.
As mentioned several times in the past, however, despite OWLIM providing a very good base, the scale of the data we have had to handle generated major issues, which introduced a lot of delays in the project. Activity data, in the form of traces from logs, are enormous. OWLIM claims to be able to handle 10s to 100s of Billions of RDF triples (connections in the graph), but there are a number of circonstances that need to be considered.
To get an idea of the scale we are talking about, we consider a reasonable web server at the Open University (e.g., the one used to serve data.open.ac.uk). This server would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that we will re-factor in terms of our ontologies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). One of the things with using a graph representation is that it is not optimised for size. We therefore can obtain anything between 20 and 50 triples per request. That leads us to something in the order of 100 million triples per month per server (each server can host many websites).
In theory, OWLIM should handle this sort of scale easily, even if we consider several servers over several months. However, there are a number of things that make the practice different from the theory:
- OWLIM might be able to store many billions of triples, but not any kind triples. The data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one request, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes.
- Many triples, but not all at once. This is another very important element for us: OWLIM might be able to “hold” many triples, but it does not mean that they can all me uploaded and processed at the same time. Loading triples into the store takes a lot of resources, and too many triples at the same time might overwhelm it and make it crash. To deal with this, we had to change our process, which originally loaded the log files for an entire month at once, into one where we extracted everyday the log information for the previous day.
- The two previous issues are amplified when inference mechanisms are applied. OWLIM handle inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inference, but also that immensely more resources are required at the time of loading these triples, depending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is already stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user (see our post on user management) is then transferred into another (necessarily smaller) store for which inferences apply.
There are a number of other reasons why dealing with semantic data still requires a certain amount of “trial and error”. There is an element of magic in it, not only because when it works, it allows more flexibility and automation than other types of data management approached, but also because making it work often requires following processes (rituals) that are not completely understood.
Closing on a positive note however, since we started the project, a new version of OWLIM has been released (4.1), which provides significant improvements over the previous versions. The system seems now better able to load large amounts of data in one go, and also to manage the resources available more cleverly. It also now supports the SPARQL 1.1 query language with includes aggregation functions, making some of the analysis tasks we are applying easier and less resource consuming.