Search:

Final post – Putting things together (with a demo)

August 5th, 2011

Over the last 6 months we have been working on building the UCIAD platform, experimenting with large-scale activity data, reflecting on user-centric data and blogging our thoughts. While, as can be seen from the last few posts on this blog, there is quite some work we think should follow from this, it is nice to see things finally coming together, and to be able to show a bit of the UCIAD platform with have been talking about for some times. What better way to do this than with a video of the running platform, showing the different components in action. (Note: it is better to watch it in 720p – HD).

This video shows a user (me) registering to the UCIAD platform with some setting details and browsing his activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk). This video therefore integrates in a working demo the different components we have been talking about here:

  • User management: As we can see here, as the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be his are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted.
  • Extracting user-centric activity data: As described in the first part of the blog post on reasoning (previous link), the settings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his activity.
  • Ontologies to make sense of activity data: The ontologies are used in structuring the data according to a common schema and to provide a base to homogeneously query data coming from different systems. As discussed below, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon.
  • Ontological reasoning for analysis: What the demo video shows clearly is how the activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case. Here, we extended the ontology in order to include definitions of activities relevant to the use of a wiki and a data platform. The powerful aspect of using ontologies here is that such classes can be added to the ontology for the system to automatically process them and organise the data according to them. Here, for example, we define “Executing a SPARQL Query” as an activity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client.
  • Browsing data according to ontologies: We haven’t described this components yet, but we rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple stats.

Next steps

There are a lot of things to mention here, some of them we have already mentioned several times. An obvious one is the finalisation, distribution and deployment of the UCIAD platform. A particular element we want to get done at a short term is to investigate the use of the UCIAD platform with various users, to see what kind of extensions of the ontologies would be commonly useful, and generally to get some insight into the reaction of users when being exposed to their own data.

More generally, we think that there is a lot more work to do on both the aspects of user-centric activity data and on the use of ontologies for the analysis of such data, as described in particular in our Wins and Fails post. These includes aspects around the licensing, distribution and generally management of user-centric data (as mentioned in our post on licensing). Indeed, while “giving back data to the users” is already technically difficult, there is a lot of fuzziness currently around the issues of ownership of activity data. This also forces us to look a lot more carefully at the privacy challenges that such data can generate, that didn’t exist when these data were held and stayed on server logs.

Beyond UCIAD and the Open University

As discussed in our post on the benefits of UCIAD, the issues considered go largely beyond the Open University and even activity data. The issues around licensing in particular are to be considered more broadly, in the same way as the challenges around communicating on user-centric data.

We have been focusing mostly on the technical issues in UCIAD, providing in this way a base framework to start investigating these broader and more complex challenges.

Most significant lessons

To put it simply, the most significant lessons we learnt (as mentioned in the wins and fails post) are:

  • Both user-centric data and ontologies are complex notions, so don’t assume they are understood.
  • Activity data are massive and complex, beyond what can be handled by current semantic data infrastructures, without a bit of clever management.
  • There is a lot of potential in using ontologies and ontological engineering for the analysis and interpretation of raw data.

Wins and fails (lessons along the way)

August 3rd, 2011

If there is one thing I like about the JISC activity data programme in which UCIAD is involved is that the instructions were very clear: your project is a short experiment, to see what could/should be done in the area of activity data in the context of higher education organisations (or at least, this is what I heard). We have integrated that a lot in UCIAD, starting from our two basic hypothesis that a user-centric perspective on activity data is relevant, and that Semantic Web technologies, especially ontologies, provided the right technological basis to achieve such a perspective.

We have discussed in a number of previous posts what we got excited about, what showed us the feasibility, relevance and potential impact of our approach, as well as what unexpected issues we had to face and how some of our assumptions turned out to be wrong. Here, we wanted to give a quick summary of these “wins” and “fails”, starting of course from the wins, and looking at the two aspects corresponding to our two hypothesis: the user-centric view and the semantic technologies view.

Wins – What went right

  • On the user-centric view: Giving data back to the user, user-centric data and consumer data were already emerging trends when we started the project, but clearly exploded as topics that organisations should take into account in the last few months. The New York Times article “Show Us the Data. (It’s Ours, After All.)” has in particular generated a lot of discussions amongst consumer representatives and “data-managers” in various organisations. The mydata project launched by the UK government is also a clear sign that the push for more transparency has to extend to private and public organisations dealing with people’s data. There have already been strong reactions from large companies such as Google, launching its own Data Liberation Front. Generally, users (will more and more) want, and assume the right to access their data and to use them to their own benefits. Only considering the feature of exporting one’s own activity data is technically non-trivial, but of obvious relevance in the current climate where a lot of emphasis is put on transparency, while personal information can be distributed in many different and isolated systems. Beyond the general climate, we have also shown that activity data is not only relevant as aggregated at the level of an organisation, but can give a new perspective when individual users are kept visible in the data (see this post for an explanation of what we mean here). To put it simply, giving people a view on their activity data provides a way for them to reflect on it, and to become more efficient in these activities. It also give them an opportunity to engage with the data, “customize” it, with added-value for the organisation.
  • On Semantic Technologues We have a lot of experience working with ontologies and semantic data, and were therefore confident that there was a great potential here. However, this is probably the point on which most people external to the project would think we had the best chance to fail: we believed that we could apply semantic technologies, linked data-based approaches and (most horribly) ontology-based reasoning to the manipulation, processing and analysis of activity data. Realising the experiments, setting up the UCIAD platform with real, large scale data, applying ontologies on top of these data and evolving these ontologies to create different views for the analysis of these data are, from my very personal point-of-view, the most interesting part of the project. Ontologies have acquired recently a bad reputation, and mentioning them (especially in the context activity data) now often leads to raised eyebrows and condescending looks. One thing that our experiments with UCIAD have shown is that working with ontologies not only has the advantages of introducing formality, shared vocabularies and semantics in our applications, but also represents a flexible and efficient way of introducing meaningful views into large amounts of raw, uninterpreted data. What ontologies bring into such an area is the ability to give definitions that will be at the basis of clustering and organising the data automatically. I can tell what I mean by a “search activity” and magically see all the traces related to search activities being put together, to become explorable and queryable (see our post on reasoning). The nice thing about UCIAD, is that this magic is actually implemented and working in the way we hypothesized it would. It is a fascinating thing to see raw data from log files being classified into meaningful categories of activities, resources and actors. It is even more fascinating knowing that we defined these groups, through encoding these definitions in an ontology, and can add others as we see fit. Due to time constraints, we could only experiment a tiny bit with this process, but we see a very promising approach in the incremental definition of the ontology as an analysis process: looking at the data, thinking that it would make sense to have an activity categorie such as for example “commenting on a blog”, and simply adding it to see the data being automatically reorganised with this new definition.

Fails – What went wrong

  • On the user-centric view: Our biggest failure in my opinion has been that we didn’t manage to communicate appropriately on the notions, approaches and change of perspective that the user-centric view on activity data represents. There are many reasons for this I believe, one being that we have been assuming that the benefits would be self-evident, while they clearly are not (see the post where we tried to get back the basis of the issue). The notion of user-centric data or consumer data might be very trendy, it does not mean that it is ready for wide adoption. There are many issues that need to be solved that go far beyond the purely technical aspects, and that simply come from the fact that activity data has never been looked at in this way before. We don’t really know what will happen in this space, what users would do with these data and how much interest this could generate for the organisation. There are many difficult questions that we could not really address in the scope of the project (including in particular the questions around data ownership, and privacy). While this is enough to keep us excited, there is enormous work to be done before the approach we have been promoting in UCIAD could reach its potential, and be widely adopted.
  • On Semantic Web technologies: While we are still excited about the added-value that semantic web technologies can bring to the analysis of activity data, we have been clearly over-optimistic regarding the maturity of some components we have been relying on, and their ability to handle the scale and complexity of the kind of data we are working with. This issue is clearly summarised in our post on the technical aspect of UCIAD. The good news is however that things are evolving very quickly. It would be a lot easier to implement the UCIAD platform now than it was 6 months ago, as the tools and platforms to deal with semantic data are getting more robust everyday. Also, the evolution of the technology should be followed by an evolution in the skills and ability of the community to adopt such technologies. Realising UCIAD made us reach a better understanding of what was feasible and required to set up a semantic platform for activity data. There is still much to do for such an approach to become feasible in a broader set of situations.

Technical and Standards

August 2nd, 2011

We have been talking about our technical approach and some of the components with used in a few other previous posts. Here we summarise the major technical components of the architecture of the UCIAD platform (see figure below) as well as the tools we have reused. We also come back to the biggest, and most major technical issue we had to face: scalability, especially when applying Semantic Web technologies to very large log data.

As described before, one of the core principles of UCIAD is the use of Semantic Web technologies, especially ontologies and Semantic Web languages. The goal of the architecture above is to extract from logs homogeneous representations of the traces of activity data present in logs and store them in a common semantic store so that they can be accessed and queried by the user.

The representation format we use for this activity data is RDF – the resource description framework, which is the standard representation format for the semantic Web and linked data. It is a graph-based data model where the nodes are either literal values or URIs: Web addresses of “data objects”. The advantages of RDF is that the graph data-model provides a lot of flexibility into manipulating the data and extending it. RDF is also supported by many different tools and systems. More importantly, for us, the schema used for RDF is an ontology, represented on the OWL Web Ontology Language, which allows flexibility and granularity in representing the data, but also, as a logical formalism, makes it possible to apply simple inference mechanisms (namely classification) so that “definitions” can be added to the ontology to cluster and group traces of activities automatically (see our post regarding the development of the UCIAD ontologies and the reasoning mechanisms applied with these ontologies).

The core component of our architecture is the semantic store (or triple store). Here we use OWLIM, which has three main advantages for us:

  • (In principle) it scales to very large amounts of data (although, this has proved to be an optimistic view, see below).
  • It provides efficient inference mechanisms for a fragment of OWL (which mostly covers our needs)
  • Together with the Sesame interface, it provides standard data access and querying mechanisms through the SPARQL protocol.

As mentioned several times in the past, however, despite OWLIM providing a very good base, the scale of the data we have had to handle generated major issues, which introduced a lot of delays in the project. Activity data, in the form of traces from logs, are enormous. OWLIM claims to be able to handle 10s to 100s of Billions of RDF triples (connections in the graph), but there are a number of circonstances that need to be considered.

To get an idea of the scale we are talking about, we consider a reasonable web server at the Open University (e.g., the one used to serve data.open.ac.uk). This server would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that we will re-factor in terms of our ontologies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). One of the things with using a graph representation is that it is not optimised for size. We therefore can obtain anything between 20 and 50 triples per request. That leads us to something in the order of 100 million triples per month per server (each server can host many websites).

In theory, OWLIM should handle this sort of scale easily, even if we consider several servers over several months. However, there are a number of things that make the practice different from the theory:

  • OWLIM might be able to store many billions of triples, but not any kind triples. The data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one request, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes.
  • Many triples, but not all at once. This is another very important element for us: OWLIM might be able to “hold” many triples, but it does not mean that they can all me uploaded and processed at the same time. Loading triples into the store takes a lot of resources, and too many triples at the same time might overwhelm it and make it crash. To deal with this, we had to change our process, which originally loaded the log files for an entire month at once, into one where we extracted everyday the log information for the previous day.
  • The two previous issues are amplified when inference mechanisms are applied. OWLIM handle inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inference, but also that immensely more resources are required at the time of loading these triples, depending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is already stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user (see our post on user management) is then transferred into another (necessarily smaller) store for which inferences apply.

There are a number of other reasons why dealing with semantic data still requires a certain amount of “trial and error”. There is an element of magic in it, not only because when it works, it allows more flexibility and automation than other types of data management approached, but also because making it work often requires following processes (rituals) that are not completely understood.

Closing on a positive note however, since we started the project, a new version of OWLIM has been released (4.1), which provides significant improvements over the previous versions. The system seems now better able to load large amounts of data in one go, and also to manage the resources available more cleverly. It also now supports the SPARQL 1.1 query language with includes aggregation functions, making some of the analysis tasks we are applying easier and less resource consuming.

Licensing & reuse of software and data

July 31st, 2011

Deciding on licensing and data distribution is always challenges where talking about data which are intrinsically personal: activity data. Privacy issues are of course relevant here. We cannot distribute openly, or even on proprietary basis, data that relate to users’ actions and personal data on our systems. Anonimisation approaches exist that are supposed to make users un-identifiable in the data. Such approaches however cannot be applied in UCIAD for two main reason:

  • Such anonimisation mechanisms are only garantied in very closed, controlled environment. In particular, they assume that it is possible to completely characterise the dataset, and that integration with other datasets will not happen. These are two assumption that we can’t apply on our data as it is always evolving (in ways that might make established parameters for anonimisation suddenly invalid) and they are meant to be integrated with other data.
  • The whole principle of the project is to distribute the data to the user it concerns, which means that the user is at the center of the data. Anonimising data related to one user, while giving it back to this user makes of course not sense. More generally, anonimisation mechanisms are based on aggregating data into abstracted or averaged values so that individual users disappear. This is obviously in contradiction with the approach taken in UCIAD.

The issue with licensing data in UCIAD is actually even more complicated: what licence to apply to data exported for a particular user? The ownership of the data is not even clear in this case. It is data collected and delivered by our systems, but that are produced out of the activities of the user. We believe that in this case, a particular type of license, that give control to the user on the distribution of their own data, but without opening it completely, is needed. This is an area that we will need to put additional work on, with possibly useful results coming out of the mydata project.

Of course, despite this very complicated issue, more generic components of UCIAD can be openly distributed. These include the UCIAD ontologies, as well as the source code of the UCIAD platform, manipulating data according to these ontologies.

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …

UCIAD ontologies: A starting point

March 23rd, 2011

UCIAD intends to use ontologies both as a way to achieve the integration of activity across various, possibly heterogeneous systems, and to benefit from their inference capabilities to support the flexible, customisable and expressive analysis of such activity data. Building an ontology that could be used as a conceptual model for all sorts of activity data is quite obviously a difficult task, which is going to be refined and iterated over the length of the project (and hopefully beyond the end of the project).

However, compared to other domains, the advantage of user activities is that there is a lot of data to look at. This might be seen as an issue (from a technical point of view, but also because it is quite overwhelming to get so much data), but in reality, this allows to apply a bottom-up approach to building our ontologies: modelling through characterising the data, rather than through expertise in the domain. It also gives us an insight into the scale of the tasks, and the need for adapted tools to support both the ontological definition of specific situations, and the ontology-based analysis of large amounts of traces of activity data.

Identifying concepts and their relations

The first step in building our ontology is to identify the key concepts, i.e., the key notions, that we need to tackle, bearing in mind that our ultimate goal is to understand activities. The main concepts we are considering are therefore the ones that support the concept of activity. Activities relate to users, but not only. We rely extensively on website logs as sources of activity data. In these cases, we can investigate requests both from human users and from robots automatically retrieving and crawling information from the websites. The server logs in question represent collections can be seen as traces of activities that these users/robots are realising on websites. We therefore need to model these other aspects, which correspond to actions that are realised by actors on particular resources. These are the three kinds of objects that, in the context of Web platforms, we want to model, so that they can be interpreted and classified in terms of activities. We therefore propose 4 ontologies to be used as the basis of the work in UCIAD:

  • The Actor Ontology is an ontology representing different types of actors (human users vs robots), as well as the technical setting through which they realise online activities (computer and user agent).
  • The Sitemap Ontology is an ontology to represent the organisation of webpages in collections and websites, and which is extensible to represent different types of webpages and websites.
  • The Trace Ontology is an ontology to represent traces of activities, realised by particular agents on particular webpages. As we currently focus on HTTP server logs, this ontology contain specific sections related to traces as HTTP requests (e.g., methods as actions and HTTP response code). It is however extensible to other types of traces, such as specific logs for VLEs or search systems.
  • The Activity Ontology is intended to define particular classes of activities into which traces can be classified, depending on their particular parameters (including actors and webpages). The type of activities to consider highly depends on the systems considered and to a certain extent on the user. The idea here is that specific versions of the ontology will be built that fit the specific needs of particular systems. We will then extract the generic and globally reusable part of these ontologies to provide a base for an overarching activity ontology. Ultimately, the idea in UCIAD is that individual users will be able to manipulate this ontology to include their specific view on their own activities.

Reusing existing ontology

When dealing with data and ontologies, reuse is generally seen as a good practice. Appart from saving time from not having to remodel things that have already been described elsewhere, it also helps anticipating on future needs for interoperability by choosing well established ontologies that are likely to have been employed elsewhere. We therefore investigated existing ontologies that could help us define the notions mentioned above. Here are the ontology we reused:

  • The FOAF ontology is commonly used to describe people, their connections with other people, but also their connections with documents. We use FOAF in the Actor Ontology for human users, and on the Sitemap Ontology for Webpages (as Documents).
  • The Time Ontology is a common ontology for representing time and temporal intervals. We use it in the Trace Ontology.
  • The Action ontology defines different types of actions in a broad sense, and can be used as a basis for representing elements of the requests in the Trace Ontology, but also as a base typology for the Activity ontology. It itself imports a number of other ontologies, including its own notion of actors.

The graph representing the dependencies between our ontologies and others is represented below.
UCIAD ontologies dependencies

While not currently used in our base ontologies, other ontologies can be considered at a later stage, for example to model specific types of activities. These include the Online Presence Ontology (OPO), as well as the Semantically-Interlinked Online Communities ontology (SIOC).

Next: Using, refining, customizing

Ontology modelling is a never ending task. Elements constantly need to be corrected and added to cover more and more cases in a way as generic as possible. It is even more the case in UCIAD as the approach is to create the ontology depending on the data we need to treat. Therefore, as we will progressively be adding more data from different sources, including server logs from different types fo websites, activity logs from systems such as VLEs or video players, the ontologies will evolve to include these cases.

Going a step further, what we want to investigate is the user-centric analysis of activity data. The ontologies will be used to provide users with views and analysis mechanisms for the data that concern their own activities. It therefore seems a natural next step to make it possible for the users to extend the ontologies, to customize them, therefore creating their own view on their own data.

Hypothesis

March 14th, 2011

UCIAD is a relatively small, experimental project looking at how semantic technologies can help the user-centric integration, analysis and interpretation of activity data in a large organisation. As such, as suggested also to all the other projects in the JISC Activity Data programme, it relies on a central hypothesis that will hopefully be verified through the realisation and application of our software platform. But before we can express this hypothesis, we need to introduce a bit of background. Especially, we beed to get back to what we mean by “user-centric”.

To put it simply, a user-centric approach is considered here in opposition to an organisation-centric approach. The most common way of considering activity data in large organisations at the moment is through consolidating visits to websites in analytics, giving statistics about the number of visits on a given website or webpage, and where these visits were coming from. We qualify this as an organisation-centric view as the central point of focus is the website managed by the organisation. By taking such a restricted perspective on the interpretation of activity data, a number of potentially interesting questions, that take the users concerned with the activity data as the focus point, cannot be answered. The analysis of the activity data can also be only beneficial to the organisation, and not the user, as each user becomes aggregated in website related statistics. We therefore express our main hypothesis as

Hypothesis 1: Taking a user-centric point of view can enable different types of analysis of activity data, which are valuable to the organisation and the user.

In order to test this hypothesis, one actually needs to achieve such user-centric analysis of activity data. This implies a number of technical and technological challenges, namely, the need to integrate activity data across a variety of websites managed by an organisation, to consolidate this data beyond the “number of visits”, and to interpret them in terms of user activities.

Ontologies are formal, machine processable conceptual models of a domain. Ontology technologies, especially associated with technologies from the semantic web, have proven useful in situations where a meaningful integration of large amounts of heterogeneous data need to be realised, and to a certain extent, reasoned upon in a qualitative way, for interpretation and analysis. Our goal here is to investigate how ontologies and semantic technologies can support the user-centric analysis of activity data. In other words, our second hypothesis is

Hypothesis 2: Ontologies and ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources.

As described in our work plan (see previous blog post), our first task is therefore to build an ontology able to flexibly describe the traces of activities across multiple websites, the users of these websites and the connections between them. The idea is to use this ontology (or rather, this set of ontologies) as a basis for a pluggable software framework, capable of integrating data from heterogeneous logs, and to interpret such data as traces of high-level activities.

The ongoing definition of these ontologies can be followed on our code repository, and a presentation of UCIAD’s basic hypothesis at the JISC Activity Data Programme event is available on slideshare.