Search:

Final post – Putting things together (with a demo)

August 5th, 2011

Over the last 6 months we have been working on building the UCIAD platform, experimenting with large-scale activity data, reflecting on user-centric data and blogging our thoughts. While, as can be seen from the last few posts on this blog, there is quite some work we think should follow from this, it is nice to see things finally coming together, and to be able to show a bit of the UCIAD platform with have been talking about for some times. What better way to do this than with a video of the running platform, showing the different components in action. (Note: it is better to watch it in 720p – HD).

This video shows a user (me) registering to the UCIAD platform with some setting details and browsing his activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk). This video therefore integrates in a working demo the different components we have been talking about here:

  • User management: As we can see here, as the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be his are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted.
  • Extracting user-centric activity data: As described in the first part of the blog post on reasoning (previous link), the settings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his activity.
  • Ontologies to make sense of activity data: The ontologies are used in structuring the data according to a common schema and to provide a base to homogeneously query data coming from different systems. As discussed below, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon.
  • Ontological reasoning for analysis: What the demo video shows clearly is how the activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case. Here, we extended the ontology in order to include definitions of activities relevant to the use of a wiki and a data platform. The powerful aspect of using ontologies here is that such classes can be added to the ontology for the system to automatically process them and organise the data according to them. Here, for example, we define “Executing a SPARQL Query” as an activity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client.
  • Browsing data according to ontologies: We haven’t described this components yet, but we rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple stats.

Next steps

There are a lot of things to mention here, some of them we have already mentioned several times. An obvious one is the finalisation, distribution and deployment of the UCIAD platform. A particular element we want to get done at a short term is to investigate the use of the UCIAD platform with various users, to see what kind of extensions of the ontologies would be commonly useful, and generally to get some insight into the reaction of users when being exposed to their own data.

More generally, we think that there is a lot more work to do on both the aspects of user-centric activity data and on the use of ontologies for the analysis of such data, as described in particular in our Wins and Fails post. These includes aspects around the licensing, distribution and generally management of user-centric data (as mentioned in our post on licensing). Indeed, while “giving back data to the users” is already technically difficult, there is a lot of fuzziness currently around the issues of ownership of activity data. This also forces us to look a lot more carefully at the privacy challenges that such data can generate, that didn’t exist when these data were held and stayed on server logs.

Beyond UCIAD and the Open University

As discussed in our post on the benefits of UCIAD, the issues considered go largely beyond the Open University and even activity data. The issues around licensing in particular are to be considered more broadly, in the same way as the challenges around communicating on user-centric data.

We have been focusing mostly on the technical issues in UCIAD, providing in this way a base framework to start investigating these broader and more complex challenges.

Most significant lessons

To put it simply, the most significant lessons we learnt (as mentioned in the wins and fails post) are:

  • Both user-centric data and ontologies are complex notions, so don’t assume they are understood.
  • Activity data are massive and complex, beyond what can be handled by current semantic data infrastructures, without a bit of clever management.
  • There is a lot of potential in using ontologies and ontological engineering for the analysis and interpretation of raw data.

User (Management)

July 27th, 2011

In the previous post, we explained to a certain extent what are our motivations for looking at a user-centric approach to activity data, and especially what we expect to be the benefits to the users. We also quickly sketched some specific aspects of identifying and processing user-specific information in our post regarding the reasoning processes employed in UCIAD. Here, we come back more generally on the aspects related to users and user management in the UCIAD platform, including the way to recognise a user, treat registrations and login, manage and present the information about the user activity and handle access rights over semantic data. The actual prototype of the UCIAD platform implementing all these elements is currently being finalised, and will be described more completely in our final post.

Identifying and managing users of UCIAD

The information the UCIAD platform has regarding users can be seen as similar to the ones basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding computer. Analytics system have for long realised that the combination of these two parameters was sufficient to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web.

In UCIAD, we have the advantage that it is very likely that the user will connect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. As shown in the mock-up screenshot above, the “settings” the user is using can be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user.

In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in the figure below, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user.

As explained in the post about reasoning on user centric activity data, managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. This graph is then being registered in our repository, and the next step is to ensure that the information being provided is restricted to the graph of the logged-in user.

Managing access rights over semantic data

We store, manipulate and reason over activity data using Semantic Web technologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data.

Unfortunately, most current triple stores, and especially the one we are employing, do not provide sufficiently fine-grained access control mechanisms, allowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data.

The all idea is actually quite simple (as depicted on the diagram above): the actual SPARQL endpoint collecting all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentification. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user.

While this mechanism is relatively simple it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. It is actually nice to see how, based on basic authentification mechanisms, the same queries asking for activity data will return different results, depending on the user who is connected.

What users anyway?

Of course, the mechanisms and techniques to manage, identify and process information about users does not answer the question of who they are and what are the benefits they can get from the system. Actually, as argued before, it is pretty hard to predict in advance what is going to be the use of providing back to the users their own activity data. General arguments can be given on the advantages of self-tracking, but in reality, the really important thing is that what is provided by the system has to stay open for any use. Working with the development version of the UCIAD platform, we find it quite fascinating that we, as individual users, can trace back our activities, drill down into specific categories (e.g., search, commenting on blogs, checking the price of a course), send queries which might only be relevant to us (e.g., “how much did I use data.open.ac.uk on sundays?”), etc. It helps us understand our own use of the resources provided by the University, and so to become more efficient with them.

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …