Search:

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …

UCIAD ontologies: A starting point

March 23rd, 2011

UCIAD intends to use ontologies both as a way to achieve the integration of activity across various, possibly heterogeneous systems, and to benefit from their inference capabilities to support the flexible, customisable and expressive analysis of such activity data. Building an ontology that could be used as a conceptual model for all sorts of activity data is quite obviously a difficult task, which is going to be refined and iterated over the length of the project (and hopefully beyond the end of the project).

However, compared to other domains, the advantage of user activities is that there is a lot of data to look at. This might be seen as an issue (from a technical point of view, but also because it is quite overwhelming to get so much data), but in reality, this allows to apply a bottom-up approach to building our ontologies: modelling through characterising the data, rather than through expertise in the domain. It also gives us an insight into the scale of the tasks, and the need for adapted tools to support both the ontological definition of specific situations, and the ontology-based analysis of large amounts of traces of activity data.

Identifying concepts and their relations

The first step in building our ontology is to identify the key concepts, i.e., the key notions, that we need to tackle, bearing in mind that our ultimate goal is to understand activities. The main concepts we are considering are therefore the ones that support the concept of activity. Activities relate to users, but not only. We rely extensively on website logs as sources of activity data. In these cases, we can investigate requests both from human users and from robots automatically retrieving and crawling information from the websites. The server logs in question represent collections can be seen as traces of activities that these users/robots are realising on websites. We therefore need to model these other aspects, which correspond to actions that are realised by actors on particular resources. These are the three kinds of objects that, in the context of Web platforms, we want to model, so that they can be interpreted and classified in terms of activities. We therefore propose 4 ontologies to be used as the basis of the work in UCIAD:

  • The Actor Ontology is an ontology representing different types of actors (human users vs robots), as well as the technical setting through which they realise online activities (computer and user agent).
  • The Sitemap Ontology is an ontology to represent the organisation of webpages in collections and websites, and which is extensible to represent different types of webpages and websites.
  • The Trace Ontology is an ontology to represent traces of activities, realised by particular agents on particular webpages. As we currently focus on HTTP server logs, this ontology contain specific sections related to traces as HTTP requests (e.g., methods as actions and HTTP response code). It is however extensible to other types of traces, such as specific logs for VLEs or search systems.
  • The Activity Ontology is intended to define particular classes of activities into which traces can be classified, depending on their particular parameters (including actors and webpages). The type of activities to consider highly depends on the systems considered and to a certain extent on the user. The idea here is that specific versions of the ontology will be built that fit the specific needs of particular systems. We will then extract the generic and globally reusable part of these ontologies to provide a base for an overarching activity ontology. Ultimately, the idea in UCIAD is that individual users will be able to manipulate this ontology to include their specific view on their own activities.

Reusing existing ontology

When dealing with data and ontologies, reuse is generally seen as a good practice. Appart from saving time from not having to remodel things that have already been described elsewhere, it also helps anticipating on future needs for interoperability by choosing well established ontologies that are likely to have been employed elsewhere. We therefore investigated existing ontologies that could help us define the notions mentioned above. Here are the ontology we reused:

  • The FOAF ontology is commonly used to describe people, their connections with other people, but also their connections with documents. We use FOAF in the Actor Ontology for human users, and on the Sitemap Ontology for Webpages (as Documents).
  • The Time Ontology is a common ontology for representing time and temporal intervals. We use it in the Trace Ontology.
  • The Action ontology defines different types of actions in a broad sense, and can be used as a basis for representing elements of the requests in the Trace Ontology, but also as a base typology for the Activity ontology. It itself imports a number of other ontologies, including its own notion of actors.

The graph representing the dependencies between our ontologies and others is represented below.
UCIAD ontologies dependencies

While not currently used in our base ontologies, other ontologies can be considered at a later stage, for example to model specific types of activities. These include the Online Presence Ontology (OPO), as well as the Semantically-Interlinked Online Communities ontology (SIOC).

Next: Using, refining, customizing

Ontology modelling is a never ending task. Elements constantly need to be corrected and added to cover more and more cases in a way as generic as possible. It is even more the case in UCIAD as the approach is to create the ontology depending on the data we need to treat. Therefore, as we will progressively be adding more data from different sources, including server logs from different types fo websites, activity logs from systems such as VLEs or video players, the ontologies will evolve to include these cases.

Going a step further, what we want to investigate is the user-centric analysis of activity data. The ontologies will be used to provide users with views and analysis mechanisms for the data that concern their own activities. It therefore seems a natural next step to make it possible for the users to extend the ontologies, to customize them, therefore creating their own view on their own data.

Hypothesis

March 14th, 2011

UCIAD is a relatively small, experimental project looking at how semantic technologies can help the user-centric integration, analysis and interpretation of activity data in a large organisation. As such, as suggested also to all the other projects in the JISC Activity Data programme, it relies on a central hypothesis that will hopefully be verified through the realisation and application of our software platform. But before we can express this hypothesis, we need to introduce a bit of background. Especially, we beed to get back to what we mean by “user-centric”.

To put it simply, a user-centric approach is considered here in opposition to an organisation-centric approach. The most common way of considering activity data in large organisations at the moment is through consolidating visits to websites in analytics, giving statistics about the number of visits on a given website or webpage, and where these visits were coming from. We qualify this as an organisation-centric view as the central point of focus is the website managed by the organisation. By taking such a restricted perspective on the interpretation of activity data, a number of potentially interesting questions, that take the users concerned with the activity data as the focus point, cannot be answered. The analysis of the activity data can also be only beneficial to the organisation, and not the user, as each user becomes aggregated in website related statistics. We therefore express our main hypothesis as

Hypothesis 1: Taking a user-centric point of view can enable different types of analysis of activity data, which are valuable to the organisation and the user.

In order to test this hypothesis, one actually needs to achieve such user-centric analysis of activity data. This implies a number of technical and technological challenges, namely, the need to integrate activity data across a variety of websites managed by an organisation, to consolidate this data beyond the “number of visits”, and to interpret them in terms of user activities.

Ontologies are formal, machine processable conceptual models of a domain. Ontology technologies, especially associated with technologies from the semantic web, have proven useful in situations where a meaningful integration of large amounts of heterogeneous data need to be realised, and to a certain extent, reasoned upon in a qualitative way, for interpretation and analysis. Our goal here is to investigate how ontologies and semantic technologies can support the user-centric analysis of activity data. In other words, our second hypothesis is

Hypothesis 2: Ontologies and ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources.

As described in our work plan (see previous blog post), our first task is therefore to build an ontology able to flexibly describe the traces of activities across multiple websites, the users of these websites and the connections between them. The idea is to use this ontology (or rather, this set of ontologies) as a basis for a pluggable software framework, capable of integrating data from heterogeneous logs, and to interpret such data as traces of high-level activities.

The ongoing definition of these ontologies can be followed on our code repository, and a presentation of UCIAD’s basic hypothesis at the JISC Activity Data Programme event is available on slideshare.