Search:

Personal Activity Data: Another Project

March 21st, 2012

UCIAD is about the integration and analysis of activity data originating from the logs of different websites of an organization, using the knowledge the organization has about these websites to provide users with ways to analyze their own online interactions with the organization. In another project called DATAMI (funded by the IKS Project), we are investigating how activity data from the whole Web traffic generated by a user can be semantically analyzed to extract ‘entities’ of interest to the user, in a “personal, semantic web history dashboard”.

A first, preliminary (video) demo of this application has been released today, that show the potential of the technologies developed in this project:

This result is of course not dissimilar to the video produced at the end of phase 1 of UCIAD, and clearly, the user-study we are currently setting-up for phase 2 will provide valuable results also for the DATAMI project.

Final post – Putting things together (with a demo)

August 5th, 2011

Over the last 6 months we have been working on building the UCIAD platform, experimenting with large-scale activity data, reflecting on user-centric data and blogging our thoughts. While, as can be seen from the last few posts on this blog, there is quite some work we think should follow from this, it is nice to see things finally coming together, and to be able to show a bit of the UCIAD platform with have been talking about for some times. What better way to do this than with a video of the running platform, showing the different components in action. (Note: it is better to watch it in 720p – HD).

This video shows a user (me) registering to the UCIAD platform with some setting details and browsing his activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk). This video therefore integrates in a working demo the different components we have been talking about here:

  • User management: As we can see here, as the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be his are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted.
  • Extracting user-centric activity data: As described in the first part of the blog post on reasoning (previous link), the settings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his activity.
  • Ontologies to make sense of activity data: The ontologies are used in structuring the data according to a common schema and to provide a base to homogeneously query data coming from different systems. As discussed below, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon.
  • Ontological reasoning for analysis: What the demo video shows clearly is how the activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case. Here, we extended the ontology in order to include definitions of activities relevant to the use of a wiki and a data platform. The powerful aspect of using ontologies here is that such classes can be added to the ontology for the system to automatically process them and organise the data according to them. Here, for example, we define “Executing a SPARQL Query” as an activity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client.
  • Browsing data according to ontologies: We haven’t described this components yet, but we rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple stats.

Next steps

There are a lot of things to mention here, some of them we have already mentioned several times. An obvious one is the finalisation, distribution and deployment of the UCIAD platform. A particular element we want to get done at a short term is to investigate the use of the UCIAD platform with various users, to see what kind of extensions of the ontologies would be commonly useful, and generally to get some insight into the reaction of users when being exposed to their own data.

More generally, we think that there is a lot more work to do on both the aspects of user-centric activity data and on the use of ontologies for the analysis of such data, as described in particular in our Wins and Fails post. These includes aspects around the licensing, distribution and generally management of user-centric data (as mentioned in our post on licensing). Indeed, while “giving back data to the users” is already technically difficult, there is a lot of fuzziness currently around the issues of ownership of activity data. This also forces us to look a lot more carefully at the privacy challenges that such data can generate, that didn’t exist when these data were held and stayed on server logs.

Beyond UCIAD and the Open University

As discussed in our post on the benefits of UCIAD, the issues considered go largely beyond the Open University and even activity data. The issues around licensing in particular are to be considered more broadly, in the same way as the challenges around communicating on user-centric data.

We have been focusing mostly on the technical issues in UCIAD, providing in this way a base framework to start investigating these broader and more complex challenges.

Most significant lessons

To put it simply, the most significant lessons we learnt (as mentioned in the wins and fails post) are:

  • Both user-centric data and ontologies are complex notions, so don’t assume they are understood.
  • Activity data are massive and complex, beyond what can be handled by current semantic data infrastructures, without a bit of clever management.
  • There is a lot of potential in using ontologies and ontological engineering for the analysis and interpretation of raw data.

Wins and fails (lessons along the way)

August 3rd, 2011

If there is one thing I like about the JISC activity data programme in which UCIAD is involved is that the instructions were very clear: your project is a short experiment, to see what could/should be done in the area of activity data in the context of higher education organisations (or at least, this is what I heard). We have integrated that a lot in UCIAD, starting from our two basic hypothesis that a user-centric perspective on activity data is relevant, and that Semantic Web technologies, especially ontologies, provided the right technological basis to achieve such a perspective.

We have discussed in a number of previous posts what we got excited about, what showed us the feasibility, relevance and potential impact of our approach, as well as what unexpected issues we had to face and how some of our assumptions turned out to be wrong. Here, we wanted to give a quick summary of these “wins” and “fails”, starting of course from the wins, and looking at the two aspects corresponding to our two hypothesis: the user-centric view and the semantic technologies view.

Wins – What went right

  • On the user-centric view: Giving data back to the user, user-centric data and consumer data were already emerging trends when we started the project, but clearly exploded as topics that organisations should take into account in the last few months. The New York Times article “Show Us the Data. (It’s Ours, After All.)” has in particular generated a lot of discussions amongst consumer representatives and “data-managers” in various organisations. The mydata project launched by the UK government is also a clear sign that the push for more transparency has to extend to private and public organisations dealing with people’s data. There have already been strong reactions from large companies such as Google, launching its own Data Liberation Front. Generally, users (will more and more) want, and assume the right to access their data and to use them to their own benefits. Only considering the feature of exporting one’s own activity data is technically non-trivial, but of obvious relevance in the current climate where a lot of emphasis is put on transparency, while personal information can be distributed in many different and isolated systems. Beyond the general climate, we have also shown that activity data is not only relevant as aggregated at the level of an organisation, but can give a new perspective when individual users are kept visible in the data (see this post for an explanation of what we mean here). To put it simply, giving people a view on their activity data provides a way for them to reflect on it, and to become more efficient in these activities. It also give them an opportunity to engage with the data, “customize” it, with added-value for the organisation.
  • On Semantic Technologues We have a lot of experience working with ontologies and semantic data, and were therefore confident that there was a great potential here. However, this is probably the point on which most people external to the project would think we had the best chance to fail: we believed that we could apply semantic technologies, linked data-based approaches and (most horribly) ontology-based reasoning to the manipulation, processing and analysis of activity data. Realising the experiments, setting up the UCIAD platform with real, large scale data, applying ontologies on top of these data and evolving these ontologies to create different views for the analysis of these data are, from my very personal point-of-view, the most interesting part of the project. Ontologies have acquired recently a bad reputation, and mentioning them (especially in the context activity data) now often leads to raised eyebrows and condescending looks. One thing that our experiments with UCIAD have shown is that working with ontologies not only has the advantages of introducing formality, shared vocabularies and semantics in our applications, but also represents a flexible and efficient way of introducing meaningful views into large amounts of raw, uninterpreted data. What ontologies bring into such an area is the ability to give definitions that will be at the basis of clustering and organising the data automatically. I can tell what I mean by a “search activity” and magically see all the traces related to search activities being put together, to become explorable and queryable (see our post on reasoning). The nice thing about UCIAD, is that this magic is actually implemented and working in the way we hypothesized it would. It is a fascinating thing to see raw data from log files being classified into meaningful categories of activities, resources and actors. It is even more fascinating knowing that we defined these groups, through encoding these definitions in an ontology, and can add others as we see fit. Due to time constraints, we could only experiment a tiny bit with this process, but we see a very promising approach in the incremental definition of the ontology as an analysis process: looking at the data, thinking that it would make sense to have an activity categorie such as for example “commenting on a blog”, and simply adding it to see the data being automatically reorganised with this new definition.

Fails – What went wrong

  • On the user-centric view: Our biggest failure in my opinion has been that we didn’t manage to communicate appropriately on the notions, approaches and change of perspective that the user-centric view on activity data represents. There are many reasons for this I believe, one being that we have been assuming that the benefits would be self-evident, while they clearly are not (see the post where we tried to get back the basis of the issue). The notion of user-centric data or consumer data might be very trendy, it does not mean that it is ready for wide adoption. There are many issues that need to be solved that go far beyond the purely technical aspects, and that simply come from the fact that activity data has never been looked at in this way before. We don’t really know what will happen in this space, what users would do with these data and how much interest this could generate for the organisation. There are many difficult questions that we could not really address in the scope of the project (including in particular the questions around data ownership, and privacy). While this is enough to keep us excited, there is enormous work to be done before the approach we have been promoting in UCIAD could reach its potential, and be widely adopted.
  • On Semantic Web technologies: While we are still excited about the added-value that semantic web technologies can bring to the analysis of activity data, we have been clearly over-optimistic regarding the maturity of some components we have been relying on, and their ability to handle the scale and complexity of the kind of data we are working with. This issue is clearly summarised in our post on the technical aspect of UCIAD. The good news is however that things are evolving very quickly. It would be a lot easier to implement the UCIAD platform now than it was 6 months ago, as the tools and platforms to deal with semantic data are getting more robust everyday. Also, the evolution of the technology should be followed by an evolution in the skills and ability of the community to adopt such technologies. Realising UCIAD made us reach a better understanding of what was feasible and required to set up a semantic platform for activity data. There is still much to do for such an approach to become feasible in a broader set of situations.

Technical and Standards

August 2nd, 2011

We have been talking about our technical approach and some of the components with used in a few other previous posts. Here we summarise the major technical components of the architecture of the UCIAD platform (see figure below) as well as the tools we have reused. We also come back to the biggest, and most major technical issue we had to face: scalability, especially when applying Semantic Web technologies to very large log data.

As described before, one of the core principles of UCIAD is the use of Semantic Web technologies, especially ontologies and Semantic Web languages. The goal of the architecture above is to extract from logs homogeneous representations of the traces of activity data present in logs and store them in a common semantic store so that they can be accessed and queried by the user.

The representation format we use for this activity data is RDF – the resource description framework, which is the standard representation format for the semantic Web and linked data. It is a graph-based data model where the nodes are either literal values or URIs: Web addresses of “data objects”. The advantages of RDF is that the graph data-model provides a lot of flexibility into manipulating the data and extending it. RDF is also supported by many different tools and systems. More importantly, for us, the schema used for RDF is an ontology, represented on the OWL Web Ontology Language, which allows flexibility and granularity in representing the data, but also, as a logical formalism, makes it possible to apply simple inference mechanisms (namely classification) so that “definitions” can be added to the ontology to cluster and group traces of activities automatically (see our post regarding the development of the UCIAD ontologies and the reasoning mechanisms applied with these ontologies).

The core component of our architecture is the semantic store (or triple store). Here we use OWLIM, which has three main advantages for us:

  • (In principle) it scales to very large amounts of data (although, this has proved to be an optimistic view, see below).
  • It provides efficient inference mechanisms for a fragment of OWL (which mostly covers our needs)
  • Together with the Sesame interface, it provides standard data access and querying mechanisms through the SPARQL protocol.

As mentioned several times in the past, however, despite OWLIM providing a very good base, the scale of the data we have had to handle generated major issues, which introduced a lot of delays in the project. Activity data, in the form of traces from logs, are enormous. OWLIM claims to be able to handle 10s to 100s of Billions of RDF triples (connections in the graph), but there are a number of circonstances that need to be considered.

To get an idea of the scale we are talking about, we consider a reasonable web server at the Open University (e.g., the one used to serve data.open.ac.uk). This server would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that we will re-factor in terms of our ontologies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). One of the things with using a graph representation is that it is not optimised for size. We therefore can obtain anything between 20 and 50 triples per request. That leads us to something in the order of 100 million triples per month per server (each server can host many websites).

In theory, OWLIM should handle this sort of scale easily, even if we consider several servers over several months. However, there are a number of things that make the practice different from the theory:

  • OWLIM might be able to store many billions of triples, but not any kind triples. The data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one request, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes.
  • Many triples, but not all at once. This is another very important element for us: OWLIM might be able to “hold” many triples, but it does not mean that they can all me uploaded and processed at the same time. Loading triples into the store takes a lot of resources, and too many triples at the same time might overwhelm it and make it crash. To deal with this, we had to change our process, which originally loaded the log files for an entire month at once, into one where we extracted everyday the log information for the previous day.
  • The two previous issues are amplified when inference mechanisms are applied. OWLIM handle inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inference, but also that immensely more resources are required at the time of loading these triples, depending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is already stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user (see our post on user management) is then transferred into another (necessarily smaller) store for which inferences apply.

There are a number of other reasons why dealing with semantic data still requires a certain amount of “trial and error”. There is an element of magic in it, not only because when it works, it allows more flexibility and automation than other types of data management approached, but also because making it work often requires following processes (rituals) that are not completely understood.

Closing on a positive note however, since we started the project, a new version of OWLIM has been released (4.1), which provides significant improvements over the previous versions. The system seems now better able to load large amounts of data in one go, and also to manage the resources available more cleverly. It also now supports the SPARQL 1.1 query language with includes aggregation functions, making some of the analysis tasks we are applying easier and less resource consuming.

Licensing & reuse of software and data

July 31st, 2011

Deciding on licensing and data distribution is always challenges where talking about data which are intrinsically personal: activity data. Privacy issues are of course relevant here. We cannot distribute openly, or even on proprietary basis, data that relate to users’ actions and personal data on our systems. Anonimisation approaches exist that are supposed to make users un-identifiable in the data. Such approaches however cannot be applied in UCIAD for two main reason:

  • Such anonimisation mechanisms are only garantied in very closed, controlled environment. In particular, they assume that it is possible to completely characterise the dataset, and that integration with other datasets will not happen. These are two assumption that we can’t apply on our data as it is always evolving (in ways that might make established parameters for anonimisation suddenly invalid) and they are meant to be integrated with other data.
  • The whole principle of the project is to distribute the data to the user it concerns, which means that the user is at the center of the data. Anonimising data related to one user, while giving it back to this user makes of course not sense. More generally, anonimisation mechanisms are based on aggregating data into abstracted or averaged values so that individual users disappear. This is obviously in contradiction with the approach taken in UCIAD.

The issue with licensing data in UCIAD is actually even more complicated: what licence to apply to data exported for a particular user? The ownership of the data is not even clear in this case. It is data collected and delivered by our systems, but that are produced out of the activities of the user. We believe that in this case, a particular type of license, that give control to the user on the distribution of their own data, but without opening it completely, is needed. This is an area that we will need to put additional work on, with possibly useful results coming out of the mydata project.

Of course, despite this very complicated issue, more generic components of UCIAD can be openly distributed. These include the UCIAD ontologies, as well as the source code of the UCIAD platform, manipulating data according to these ontologies.

Benefits

July 29th, 2011

One of the major issues (which is going to be discussed in longer terms in the “Wins and Fails” post in the next few days) of the approach taken in UCIAD is to communicate on its benefits. One reason is that, to be fully honest, the mechanisms and the whole perspective we are taking on activity data are still too ‘experimental’ for us to fully understand these benefits yet. The other aspect of this is that at the core of our approach is a focus on the benefits of activity data to the end-user and not, as it would usually be the case, to the organisation. We therefore here quickly come back to what we have learnt on the advantages of our approach, first to the end-users, and then deriving potential benefits to the organisation. We summarise our view on the achievements of UCIAD in terms of benefits through a discussion regarding the success of the project, as seen as an experiment towards ontology-based, user-centric activity data.

Benefits to the end-user

There have been a number of places where the potential benefits of user-centric data (or consumer data) have been discussed, as generally labeled as “giving back their data to the users”. These include in particular the popular article “Show Us the Data. (It’s Ours, After All.)” by Richard H. Thaler in the New York Times. As was argued in particular in one of our previous posts, being able to give a complete account of what end-users could do with such data is both unfeasible and undesirable. However, we can summarise the expected benefits, and their connections to the work done in UCIAD, in three different areas:

  • Known yourself… and be more efficient: As we briefly discussed in our post on self-tracking, there is a trend currently regarding people, individuals, monitoring their own activities, statuses, etc. While some would criticise such attitude as pure narcissism, the reality is that monitoring oneself has been realised as one effective way to improve. In sport for example, monitoring performance in relation with other variables (health status, equipment used, etc.) is necessary to improve and achieve the best conditions, for the best results. Besides sports however, there are many areas where monitoring and understanding one’s own behaviour can help being more efficient. There is a large gap between an athlete measuring his/her performance and a user monitoring his/her online activities. However, for a user to know how he/she searched websites, find and exploit resources on the Web or engage with online communities, can only have a positive effect on his/her effectiveness in realising these tasks in the future.
  • Exploit your own data yourself: Besides the passive monitoring of activities, consumer data has often be described as exploitable by individuals. Indeed, in the current situation, organisations collect large amounts of data about their users, that they exploit to their own benefits, often for commercial purposes. Such personal data and profiles are being used and accessed by a large variety of agents, from the search engine that will send personalised results to the advertiser that will target you with specific products, except the user him/herself. For the users to have access, control and possibly ownership of their own data means that they could also exploit them, use them to build their own profiles that can be employed in communicating with other entities on the Web, under their own terms. In a more directly pragmatic way, the users can analyse their own data and build on top of them to extract relevant information to their own benefit. In UCIAD, we not only allow users to export their own data, but we do it using Semantic Web standards to ensure maximum reusability and, through relying on a customisable ontology, the exported data can be flexibly adapted to any kind of uses that the user might come up with, not only the ones that we have thought of.
  • Combine and integrate your own data: While we are still far from such a situation at this stage, we can easily imagine that, with the explosion of the number of systems providing an “export your own data” feature, users will eventually be able to build their own personal knowledge base, feeding it with personal data collected from the many online systems they use. Again, such a scenario requires a certain level of standardisation in the data representation formats being used, for which Semantic Web technologies appear as perfect candidates. A possibly less distant scenario is the one were users interacting with several organisations would export their activity data from the corresponding instances of the UCIAD platform. These data would naturally integrate to provide the user with the ability to monitor, analyse and exploit their activity data across numerous, originally disconnected organisations and websites.

Benefits to the organisation

As explained earlier, one of the core aspects of UCIAD has been to focus on the benefits of collecting and flexibly interpreting activity data to the end-user. This does not mean that the organisation has no interest in considering the type of technology we have been developing, but simply that the benefits to the organisation mostly come as derived from providing benefits to the end-users of the organisation:

  • Transparency: In very simple terms, users are more and more pushing organisation towards more accountability with respect to the data they collect about them. Deploying the UCIAD platform can be seen as a way for an institution to tell users “here is what we have about you in terms of activity data”.
  • Trust: In relation with the point above on transparency, providing collected data back to the user is a way to establish a stronger relationship with them: i.e., one where they can trust the organisation regarding the fair and transparent use of their activity data.
  • Leave data management to the user: Leaving the user in control of their own data can bring valuable benefits to the organisation. In particular, it means that the user can allow, or actively enable, the use of more data than what can be done when he/she is left out of the loop. It makes it possible for example for them to bring and import data they have collected from other systems and organisations, so that the same data does not have to be collected again, and the new organisation does not have to start from scratch.

How do we measure success?

So, now that we have listed all the expected benefits of the approach taken in UCIAD, the natural next question is “have we managed to bring all these benefits to our institution?”. The plain and honest answer is: No.

From the start, we have considered UCIAD as being an experiment (and actually, a rather short one). What we wanted to demonstrate was that:

  1. These benefits are achievable
  2. Technology, such as linked data and ontologies, make the approach feasible

The UCIAD platform demo, collecting log data from several webservers concerning around a dozen websites, interpreting this data in terms of user-activity, extracting the traces of activities around a given user and exposing the user to these traces in a meaningful way, provides an undeniable demonstration that the technical and technological mechanisms to achieve the UCIAD approach are applicable and effective.

We are currently demonstrating this platform to users of the Open University websites, and observing them in engaging with it, and so with their own activity data. This activity will carry on for some times after the end of the project so that we can learn as much as possible from the current state of the platform. However, from these initial discussions, it appears clearly that users are interested, even sometimes fascinated, with the idea of obtaining and using their own activity data. They are, as it has been happening for many systems outside UCIAD (e.g., Google, Facebook), very positive about such features being added to the websites of an organisation they spend so much time interacting with: their University. In many cases now, they are demanding it.

User (Management)

July 27th, 2011

In the previous post, we explained to a certain extent what are our motivations for looking at a user-centric approach to activity data, and especially what we expect to be the benefits to the users. We also quickly sketched some specific aspects of identifying and processing user-specific information in our post regarding the reasoning processes employed in UCIAD. Here, we come back more generally on the aspects related to users and user management in the UCIAD platform, including the way to recognise a user, treat registrations and login, manage and present the information about the user activity and handle access rights over semantic data. The actual prototype of the UCIAD platform implementing all these elements is currently being finalised, and will be described more completely in our final post.

Identifying and managing users of UCIAD

The information the UCIAD platform has regarding users can be seen as similar to the ones basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding computer. Analytics system have for long realised that the combination of these two parameters was sufficient to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web.

In UCIAD, we have the advantage that it is very likely that the user will connect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. As shown in the mock-up screenshot above, the “settings” the user is using can be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user.

In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in the figure below, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user.

As explained in the post about reasoning on user centric activity data, managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. This graph is then being registered in our repository, and the next step is to ensure that the information being provided is restricted to the graph of the logged-in user.

Managing access rights over semantic data

We store, manipulate and reason over activity data using Semantic Web technologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data.

Unfortunately, most current triple stores, and especially the one we are employing, do not provide sufficiently fine-grained access control mechanisms, allowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data.

The all idea is actually quite simple (as depicted on the diagram above): the actual SPARQL endpoint collecting all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentification. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user.

While this mechanism is relatively simple it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. It is actually nice to see how, based on basic authentification mechanisms, the same queries asking for activity data will return different results, depending on the user who is connected.

What users anyway?

Of course, the mechanisms and techniques to manage, identify and process information about users does not answer the question of who they are and what are the benefits they can get from the system. Actually, as argued before, it is pretty hard to predict in advance what is going to be the use of providing back to the users their own activity data. General arguments can be given on the advantages of self-tracking, but in reality, the really important thing is that what is provided by the system has to stay open for any use. Working with the development version of the UCIAD platform, we find it quite fascinating that we, as individual users, can trace back our activities, drill down into specific categories (e.g., search, commenting on blogs, checking the price of a course), send queries which might only be relevant to us (e.g., “how much did I use data.open.ac.uk on sundays?”), etc. It helps us understand our own use of the resources provided by the University, and so to become more efficient with them.

Explaining user-centric activity data

July 5th, 2011

I was today at the meeting of the JISC activity data programme, where all the projects in the programme came to discuss what they were doing, and what should be the priorities for the coming year(s). As some might have realised, I am actually a bit critical of this sort of discussions. Not that I think that the projects are doing the wrong things, just that there is a lot of catching up to do, and I think we might end up missing the next train (which I believe to be consumer data) while trying to catch up with the previous one (activity data-based recommander systems).

Anyway, I was trying to come up with a reasonable explanation regarding user-centric activity data (mostly based on showing evidence of the current trends in the industry, from energy providers showing users historic information on their own consumption to the Google Data Liberation front and the mydata project) when the ongoing discussion derived on the definition of simple things such as the notion of event. Trying to define the concepts we are talking about is the major goal of our ontologies. However, the discussion made me realised that we also needed a simplified overview of the kind of data we are dealing with, and of what made the difference between the organisation-centric view and the user-centric view of activity data.

Indeed, looking at the figure above, we can summarise very simply what we are dealing with in terms of activity data. Activity data is set of events (or the traces of these events) where an action is realised on a resource (e.g., a webpage) by an actor (most often a user). That is a general view of what we mostly have to consider as raw activity data. However, in order to extract anything meaningful from this data, looking at the raw collection of individual events isn’t going to give us much: we need to abstract the data into sets of events that are meaningful, and which distributions of characteristics can be interpreted.

The figure above represents the most common way of abstracting activity data: what we call the organisation-centric view. The idea is that large sets of events are being analysed that are realised by aggregated sets of users. There can be one set of users, like in the case of analytics system that provide statistics regarding actions realised by all visitors of a website, or the organisation can define sub-groups such as Students/Staff/External that are meaningful to the particular types of activities and analyses being considered. In this case, users stop existing individually in the abstracted activity data, as they only manifest as part of the aggregated statistics for their groups.

User-centric activity data is basically making the abstraction the other way around (see above): aggregating traces of activities around a given user, interpreted according to meaningful sets of resources and events. The challenge in this case (appart from the scalability of the approach, which is going to be the topic of another blog post sometimes) is in the way to define meaningful sets of resources and events. In the data we have been looking at, activities such as “commenting on a blog”, “searching a blog”, “querying linked data” or “using a web application” are clearly emerging, but the number and nature of the types of resources and events that can appear in the data is largely dependent on the system and the user. This is why we believe that using ontologies as a model to drive such abstractions is a good solution: it provides us with a flexible way to define types of resources (e.g., BlogPage, RSS feed, Linked Data endpoint) and the corresponding activities (e.g., commenting, querying, searching), and to automatically classify individual traces and resources into these types. The end result is the ability for individual users to visualise and analyse the distribution of their own activity data in these types and categories. Pushing it a step further, users should even be able to personalise the views, giving their own ontological definitions and obtaining data abstractions that are therefore more meaningful to them.

A colleague forwarded me today this article in french, where the author says (my translation): “What could I accomplish if I had at my disposal, in an exploitable form, the information regarding my pathways and communications? [...] Not only to control what others are doing with it, but to use it to my own benefit? Today, we tend to scratch our head and ask: what would be the use of that?”, and indeed we don’t really know what this will allow in the future. However, as the author of the article suggests, that shouldn’t stop us from trying to find out, as long as we are convinced there is something there to explore.

The mydata project

June 21st, 2011

Announcements have come out recently regarding new projects from the government around the slogan “Better choices, better deals” to support better customer experience, through transparent customer information. This is exciting as it shows how the government, as well as businesses, are now realising that it is through giving control to information to the customers (i.e., the users) that we can build a better, more reliable and more transparent experience. At the core of the initiative is the mydata project which goal can be summarised by the sentence: “giving back customer data to customers”. To a large extent UCIAD can be seen as an experiment in this direction, proposing to deliver activity data to the users (i.e., customers) of large organisations. We certainly share the same hypothesis that, as expressed by Nigel Shadbolt (chair of the MyData project), customers/users getting back their information can help make organisations/businesses “more accountable”, “more efficient” and able to build “new kinds of services”.

Of course, it is still unclear at this stage what will be the concrete outcomes of the mydata project. Great challenges have to be tackled both from a technological point of view (in what format should data be provided to customers? How to ensure reusability? How to deal with heterogeneity?) and from the societal point of view (What are the privacy/security implications? How to enforce “user-centric data provision” policies in businesses? How to spread the benefit equally amongst users?). We hope that our experience with UCIAD (and beyond, with the work building on UCIAD we are planning to do) will contribute to such exciting new approaches to activity/customer data.

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …