Search:

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …

On Self-Tracking

May 18th, 2011

I have said it and repeated it numerous times, UCIAD is profoundly different from all the other JISC Activity data projects at many different levels. One of them, at the basis of our main hypothesis is that we consider activity data for the user’s own consumption, and to his/her own benefit. The team working in UCIAD has made this notion of user-centric personal information a guiding principle for research. With my colleague Matthew Rowe we recently described a major aspect of this research in a position paper for the W3C Workshop on Web Tracking and User Privacy: Self-tracking on the Web.

As described in the paper, entitled “Self-Tracking on the Web: Why and How“, self-tracking is “the activity of monitoring and analysing one’s own behaviour regarding personal information exchange and the consequences of such behaviour on their exposure, privacy and reputation“. We emphasize in this paper how existing tools and technologies to realise self-tracking on the Web are limited, especially in comparison with the tools and technologies used to track user activities and data to the benefit of organisations. The paper concluded that “achieving such a process of self-tracking can be very revealing to Web users, helping them reaching a better awareness of their own online behaviour, and a better understanding of the possible consequences of such behaviour on the exposure of their personal information. Such an approach appears to be crucially needed as the Web evolves to both a global information marketplace, and a major medium for all sorts of social interactions online. [...] We therefore argue that a more principled and comprehensive study of the activity of self-tracking on the Web and of the technological requirements for such an activity to take place should be conducted. This requires for both the social and conceptual models of the way personal information is exchanged on the Web to be related to the technological protocols that are used as mediums for instantiating these models. From a more concrete point of view, we believe that a new set of tools are to be created that will support users in monitoring their own activity on the Web

UCIAD can be seen as an experiment in this direction. Focusing on Web data related to the interaction between an user and an organisation, it is looking at the techniques, the models and the tools that are necessary to enable users to have a personalised view on their own data, i.e., the data generated by their own activity. More generally, it is also setting up generic models of activity online i.e., the ontologies and the associated technological components, that can be reused in broader environments.