Search:

Why I don’t believe in the personal data economy

May 1st, 2012

I care a lot about personal data (especially mine), and yep, my and everybody else’s personal data are all over the place. They are all over the place because they are valuable. They are all over the place because I (and you) don’t have much control over the way they are being collected, exchanged and exploited.

Now, what I conclude from that is that we need more control (and, of course, I would not pretend that I’m the only one thinking that, but that’s not the point). Apparently, what other people would conclude from that is that you should be selling your personal data. That there is such an idea of a “personal data economy” where instead of giving it away, you can decide of a price for your personal data. This is frequently being promoted as the latest, new, brilliant idea on the Web, most recently from HP Labs: “A Stock Exchange for Your Personal Data”.

Why didn’t we think about that before?

Well, quite simply because we did. Numerous times. And it did not work.

Have you heard of things like the Attention Recorder, from the Attention Trust, that was going to make the attention economy the big thing of… hurmm… 2007? Well, me neither.

Why I do not believe in this thing is quite simply because I don’t understand how it could work. It sounds really naive, mostly for three reasons:

  1. The fact that my personal data are valuable does not mean I want to sell them. Apparently, my organs are valuable. There is a market with people who would be more than happy to harvest them and get some benefits from that. To me, they are just essential. They don’t have value from an economic perspective. They have value because I can’t exist and function without them. In other terms, it is not because it has market value that it is a commercialisable personal asset. You would have to be seriously naive to think that it is the case, and come up with very convincing evidence to convince me that this is the way things work.
  2. Putting a price on it does not prevent from getting it for free. That’s one part I really never understand. Personal data is being collected from us, and this is actually necessary for quite a few things to function. If this is the case, why exactly would anybody want to pay for it? You can argue that this is exactly the point: they should not have access to it for free. But how is it that this is going to change? This is not a trick question: I truly don’t understand this!
  3. Wouldn’t it actually kill the personal data economy? Taking the risk to appear as contradicting myself, I would argue that this personal data economy exists already, but not in the form it is envisaged in this “personal data stock-market” trend. We register to online services, allow them somehow to collect data from us, in exchange for what they have to offer. Personal data are not the product, they are the currency! We buy stuff on the Web with our personal data, and actually most of us are reasonably happy about most of what we get out of the deal. We could stretch this metaphor further, and talk about “personal data banks” (some do) and “personal data currency exchange”. But the real point here is that changing the current “market” to a model where personal data is a product being sold would change the balance of this economy: personal data stop having the quality of a currency — i.e. having the same value for everybody. Now, here is the trick: if personal data stop having the same value for everybody, they start being biased; And if they start being biased, they stop having value to the people who exploit them.

Now of course, I probably misunderstood something somehow, and maybe all these points have been addressed (in which case, when can I start making money to tell you how much I weight? — which reminds me I should stop telling things to my friends, just in case). Until that is confirmed through, I would rather work on being able to understand and control the way I spend my personal data online.

Personal Activity Data: Another Project

March 21st, 2012

UCIAD is about the integration and analysis of activity data originating from the logs of different websites of an organization, using the knowledge the organization has about these websites to provide users with ways to analyze their own online interactions with the organization. In another project called DATAMI (funded by the IKS Project), we are investigating how activity data from the whole Web traffic generated by a user can be semantically analyzed to extract ‘entities’ of interest to the user, in a “personal, semantic web history dashboard”.

A first, preliminary (video) demo of this application has been released today, that show the potential of the technologies developed in this project:

This result is of course not dissimilar to the video produced at the end of phase 1 of UCIAD, and clearly, the user-study we are currently setting-up for phase 2 will provide valuable results also for the DATAMI project.

UCIAD: The Second Phase

February 1st, 2012

Today is the start of the second phase of the UCIAD project, also known (not very excitingly) as UCIAD-II. This extension of the project will run until the end of June 2012, and will mostly involve me, as director and manager, and Keerthi Thomas, as our privacy/technology expert.

Why extending UCIAD?

Well, the answer to that would appear quite obvious to a number of people. Not carrying on the work and pushing it further would have actually been more surprising. Indeed, UCIAD investigated the challenges related to integrating traces of activities on an organisation’s websites, and presenting them in a user-centric way. It showed how such an approach creates a new set of technical issues, and how semantic technologies (including linked data and ontologies for data integration and clustering) can be used to tackle these issues.

Our goal here is to validate this initial insight, establish the scenarios in which the user-centric analysis of activity data will be employed in the future and devise recommendations for the types of policy that will be needed in relation to these scenarios. In other terms, now that we understand the technology, we need to understand the usage, and the implications. This is quite exciting considering that, even given the current trend in user-centric/consumer data, these aspects have not been studied before, while they will clearly become crucial in the next couple of years (if not months).

So, what are we going to do?

Generally, UCIAD II is looking at two complementary aspects of the general idea of user-centric activity data (i.e., giving back to users the data about their own activity):

  1. What are the concrete scenarios in which users can benefit from having access to, understanding of and control over their own activity data?
  2. What are the changes in terms of organisations’ policies on data access, data protection, data licensing and privacy that are made necessary by such approaches.

The goal of UCIAD 2 is therefore to provide early-stage answers to these questions through a study, realised with a group of users (students and staff) of the Open University websites. In a nutshell, we will give these users early dedicated access to the evolving set of tools prototyped in UCIAD, populated with their activity data on the Open Universities websites. We will record and reflect with them on their usage of these tools and their reactions to them, in order for them to act as a ‘focus group’ to establish what the access to their own activity data could enable, and what can be judged acceptable in terms of the organisation’s policies on managing these data.

What’s next?

As with UCIAD, we will use this blog to report and discuss on the progress in the project. We have a lot on our plate already, starting for enrolling participants and organising the first interview campaigns. More on that will to appear soon!

Final post – Putting things together (with a demo)

August 5th, 2011

Over the last 6 months we have been working on building the UCIAD platform, experimenting with large-scale activity data, reflecting on user-centric data and blogging our thoughts. While, as can be seen from the last few posts on this blog, there is quite some work we think should follow from this, it is nice to see things finally coming together, and to be able to show a bit of the UCIAD platform with have been talking about for some times. What better way to do this than with a video of the running platform, showing the different components in action. (Note: it is better to watch it in 720p – HD).

This video shows a user (me) registering to the UCIAD platform with some setting details and browsing his activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk). This video therefore integrates in a working demo the different components we have been talking about here:

  • User management: As we can see here, as the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be his are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted.
  • Extracting user-centric activity data: As described in the first part of the blog post on reasoning (previous link), the settings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his activity.
  • Ontologies to make sense of activity data: The ontologies are used in structuring the data according to a common schema and to provide a base to homogeneously query data coming from different systems. As discussed below, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon.
  • Ontological reasoning for analysis: What the demo video shows clearly is how the activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case. Here, we extended the ontology in order to include definitions of activities relevant to the use of a wiki and a data platform. The powerful aspect of using ontologies here is that such classes can be added to the ontology for the system to automatically process them and organise the data according to them. Here, for example, we define “Executing a SPARQL Query” as an activity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client.
  • Browsing data according to ontologies: We haven’t described this components yet, but we rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple stats.

Next steps

There are a lot of things to mention here, some of them we have already mentioned several times. An obvious one is the finalisation, distribution and deployment of the UCIAD platform. A particular element we want to get done at a short term is to investigate the use of the UCIAD platform with various users, to see what kind of extensions of the ontologies would be commonly useful, and generally to get some insight into the reaction of users when being exposed to their own data.

More generally, we think that there is a lot more work to do on both the aspects of user-centric activity data and on the use of ontologies for the analysis of such data, as described in particular in our Wins and Fails post. These includes aspects around the licensing, distribution and generally management of user-centric data (as mentioned in our post on licensing). Indeed, while “giving back data to the users” is already technically difficult, there is a lot of fuzziness currently around the issues of ownership of activity data. This also forces us to look a lot more carefully at the privacy challenges that such data can generate, that didn’t exist when these data were held and stayed on server logs.

Beyond UCIAD and the Open University

As discussed in our post on the benefits of UCIAD, the issues considered go largely beyond the Open University and even activity data. The issues around licensing in particular are to be considered more broadly, in the same way as the challenges around communicating on user-centric data.

We have been focusing mostly on the technical issues in UCIAD, providing in this way a base framework to start investigating these broader and more complex challenges.

Most significant lessons

To put it simply, the most significant lessons we learnt (as mentioned in the wins and fails post) are:

  • Both user-centric data and ontologies are complex notions, so don’t assume they are understood.
  • Activity data are massive and complex, beyond what can be handled by current semantic data infrastructures, without a bit of clever management.
  • There is a lot of potential in using ontologies and ontological engineering for the analysis and interpretation of raw data.

Wins and fails (lessons along the way)

August 3rd, 2011

If there is one thing I like about the JISC activity data programme in which UCIAD is involved is that the instructions were very clear: your project is a short experiment, to see what could/should be done in the area of activity data in the context of higher education organisations (or at least, this is what I heard). We have integrated that a lot in UCIAD, starting from our two basic hypothesis that a user-centric perspective on activity data is relevant, and that Semantic Web technologies, especially ontologies, provided the right technological basis to achieve such a perspective.

We have discussed in a number of previous posts what we got excited about, what showed us the feasibility, relevance and potential impact of our approach, as well as what unexpected issues we had to face and how some of our assumptions turned out to be wrong. Here, we wanted to give a quick summary of these “wins” and “fails”, starting of course from the wins, and looking at the two aspects corresponding to our two hypothesis: the user-centric view and the semantic technologies view.

Wins – What went right

  • On the user-centric view: Giving data back to the user, user-centric data and consumer data were already emerging trends when we started the project, but clearly exploded as topics that organisations should take into account in the last few months. The New York Times article “Show Us the Data. (It’s Ours, After All.)” has in particular generated a lot of discussions amongst consumer representatives and “data-managers” in various organisations. The mydata project launched by the UK government is also a clear sign that the push for more transparency has to extend to private and public organisations dealing with people’s data. There have already been strong reactions from large companies such as Google, launching its own Data Liberation Front. Generally, users (will more and more) want, and assume the right to access their data and to use them to their own benefits. Only considering the feature of exporting one’s own activity data is technically non-trivial, but of obvious relevance in the current climate where a lot of emphasis is put on transparency, while personal information can be distributed in many different and isolated systems. Beyond the general climate, we have also shown that activity data is not only relevant as aggregated at the level of an organisation, but can give a new perspective when individual users are kept visible in the data (see this post for an explanation of what we mean here). To put it simply, giving people a view on their activity data provides a way for them to reflect on it, and to become more efficient in these activities. It also give them an opportunity to engage with the data, “customize” it, with added-value for the organisation.
  • On Semantic Technologues We have a lot of experience working with ontologies and semantic data, and were therefore confident that there was a great potential here. However, this is probably the point on which most people external to the project would think we had the best chance to fail: we believed that we could apply semantic technologies, linked data-based approaches and (most horribly) ontology-based reasoning to the manipulation, processing and analysis of activity data. Realising the experiments, setting up the UCIAD platform with real, large scale data, applying ontologies on top of these data and evolving these ontologies to create different views for the analysis of these data are, from my very personal point-of-view, the most interesting part of the project. Ontologies have acquired recently a bad reputation, and mentioning them (especially in the context activity data) now often leads to raised eyebrows and condescending looks. One thing that our experiments with UCIAD have shown is that working with ontologies not only has the advantages of introducing formality, shared vocabularies and semantics in our applications, but also represents a flexible and efficient way of introducing meaningful views into large amounts of raw, uninterpreted data. What ontologies bring into such an area is the ability to give definitions that will be at the basis of clustering and organising the data automatically. I can tell what I mean by a “search activity” and magically see all the traces related to search activities being put together, to become explorable and queryable (see our post on reasoning). The nice thing about UCIAD, is that this magic is actually implemented and working in the way we hypothesized it would. It is a fascinating thing to see raw data from log files being classified into meaningful categories of activities, resources and actors. It is even more fascinating knowing that we defined these groups, through encoding these definitions in an ontology, and can add others as we see fit. Due to time constraints, we could only experiment a tiny bit with this process, but we see a very promising approach in the incremental definition of the ontology as an analysis process: looking at the data, thinking that it would make sense to have an activity categorie such as for example “commenting on a blog”, and simply adding it to see the data being automatically reorganised with this new definition.

Fails – What went wrong

  • On the user-centric view: Our biggest failure in my opinion has been that we didn’t manage to communicate appropriately on the notions, approaches and change of perspective that the user-centric view on activity data represents. There are many reasons for this I believe, one being that we have been assuming that the benefits would be self-evident, while they clearly are not (see the post where we tried to get back the basis of the issue). The notion of user-centric data or consumer data might be very trendy, it does not mean that it is ready for wide adoption. There are many issues that need to be solved that go far beyond the purely technical aspects, and that simply come from the fact that activity data has never been looked at in this way before. We don’t really know what will happen in this space, what users would do with these data and how much interest this could generate for the organisation. There are many difficult questions that we could not really address in the scope of the project (including in particular the questions around data ownership, and privacy). While this is enough to keep us excited, there is enormous work to be done before the approach we have been promoting in UCIAD could reach its potential, and be widely adopted.
  • On Semantic Web technologies: While we are still excited about the added-value that semantic web technologies can bring to the analysis of activity data, we have been clearly over-optimistic regarding the maturity of some components we have been relying on, and their ability to handle the scale and complexity of the kind of data we are working with. This issue is clearly summarised in our post on the technical aspect of UCIAD. The good news is however that things are evolving very quickly. It would be a lot easier to implement the UCIAD platform now than it was 6 months ago, as the tools and platforms to deal with semantic data are getting more robust everyday. Also, the evolution of the technology should be followed by an evolution in the skills and ability of the community to adopt such technologies. Realising UCIAD made us reach a better understanding of what was feasible and required to set up a semantic platform for activity data. There is still much to do for such an approach to become feasible in a broader set of situations.

Technical and Standards

August 2nd, 2011

We have been talking about our technical approach and some of the components with used in a few other previous posts. Here we summarise the major technical components of the architecture of the UCIAD platform (see figure below) as well as the tools we have reused. We also come back to the biggest, and most major technical issue we had to face: scalability, especially when applying Semantic Web technologies to very large log data.

As described before, one of the core principles of UCIAD is the use of Semantic Web technologies, especially ontologies and Semantic Web languages. The goal of the architecture above is to extract from logs homogeneous representations of the traces of activity data present in logs and store them in a common semantic store so that they can be accessed and queried by the user.

The representation format we use for this activity data is RDF – the resource description framework, which is the standard representation format for the semantic Web and linked data. It is a graph-based data model where the nodes are either literal values or URIs: Web addresses of “data objects”. The advantages of RDF is that the graph data-model provides a lot of flexibility into manipulating the data and extending it. RDF is also supported by many different tools and systems. More importantly, for us, the schema used for RDF is an ontology, represented on the OWL Web Ontology Language, which allows flexibility and granularity in representing the data, but also, as a logical formalism, makes it possible to apply simple inference mechanisms (namely classification) so that “definitions” can be added to the ontology to cluster and group traces of activities automatically (see our post regarding the development of the UCIAD ontologies and the reasoning mechanisms applied with these ontologies).

The core component of our architecture is the semantic store (or triple store). Here we use OWLIM, which has three main advantages for us:

  • (In principle) it scales to very large amounts of data (although, this has proved to be an optimistic view, see below).
  • It provides efficient inference mechanisms for a fragment of OWL (which mostly covers our needs)
  • Together with the Sesame interface, it provides standard data access and querying mechanisms through the SPARQL protocol.

As mentioned several times in the past, however, despite OWLIM providing a very good base, the scale of the data we have had to handle generated major issues, which introduced a lot of delays in the project. Activity data, in the form of traces from logs, are enormous. OWLIM claims to be able to handle 10s to 100s of Billions of RDF triples (connections in the graph), but there are a number of circonstances that need to be considered.

To get an idea of the scale we are talking about, we consider a reasonable web server at the Open University (e.g., the one used to serve data.open.ac.uk). This server would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that we will re-factor in terms of our ontologies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). One of the things with using a graph representation is that it is not optimised for size. We therefore can obtain anything between 20 and 50 triples per request. That leads us to something in the order of 100 million triples per month per server (each server can host many websites).

In theory, OWLIM should handle this sort of scale easily, even if we consider several servers over several months. However, there are a number of things that make the practice different from the theory:

  • OWLIM might be able to store many billions of triples, but not any kind triples. The data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one request, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes.
  • Many triples, but not all at once. This is another very important element for us: OWLIM might be able to “hold” many triples, but it does not mean that they can all me uploaded and processed at the same time. Loading triples into the store takes a lot of resources, and too many triples at the same time might overwhelm it and make it crash. To deal with this, we had to change our process, which originally loaded the log files for an entire month at once, into one where we extracted everyday the log information for the previous day.
  • The two previous issues are amplified when inference mechanisms are applied. OWLIM handle inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inference, but also that immensely more resources are required at the time of loading these triples, depending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is already stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user (see our post on user management) is then transferred into another (necessarily smaller) store for which inferences apply.

There are a number of other reasons why dealing with semantic data still requires a certain amount of “trial and error”. There is an element of magic in it, not only because when it works, it allows more flexibility and automation than other types of data management approached, but also because making it work often requires following processes (rituals) that are not completely understood.

Closing on a positive note however, since we started the project, a new version of OWLIM has been released (4.1), which provides significant improvements over the previous versions. The system seems now better able to load large amounts of data in one go, and also to manage the resources available more cleverly. It also now supports the SPARQL 1.1 query language with includes aggregation functions, making some of the analysis tasks we are applying easier and less resource consuming.

Licensing & reuse of software and data

July 31st, 2011

Deciding on licensing and data distribution is always challenges where talking about data which are intrinsically personal: activity data. Privacy issues are of course relevant here. We cannot distribute openly, or even on proprietary basis, data that relate to users’ actions and personal data on our systems. Anonimisation approaches exist that are supposed to make users un-identifiable in the data. Such approaches however cannot be applied in UCIAD for two main reason:

  • Such anonimisation mechanisms are only garantied in very closed, controlled environment. In particular, they assume that it is possible to completely characterise the dataset, and that integration with other datasets will not happen. These are two assumption that we can’t apply on our data as it is always evolving (in ways that might make established parameters for anonimisation suddenly invalid) and they are meant to be integrated with other data.
  • The whole principle of the project is to distribute the data to the user it concerns, which means that the user is at the center of the data. Anonimising data related to one user, while giving it back to this user makes of course not sense. More generally, anonimisation mechanisms are based on aggregating data into abstracted or averaged values so that individual users disappear. This is obviously in contradiction with the approach taken in UCIAD.

The issue with licensing data in UCIAD is actually even more complicated: what licence to apply to data exported for a particular user? The ownership of the data is not even clear in this case. It is data collected and delivered by our systems, but that are produced out of the activities of the user. We believe that in this case, a particular type of license, that give control to the user on the distribution of their own data, but without opening it completely, is needed. This is an area that we will need to put additional work on, with possibly useful results coming out of the mydata project.

Of course, despite this very complicated issue, more generic components of UCIAD can be openly distributed. These include the UCIAD ontologies, as well as the source code of the UCIAD platform, manipulating data according to these ontologies.

Benefits

July 29th, 2011

One of the major issues (which is going to be discussed in longer terms in the “Wins and Fails” post in the next few days) of the approach taken in UCIAD is to communicate on its benefits. One reason is that, to be fully honest, the mechanisms and the whole perspective we are taking on activity data are still too ‘experimental’ for us to fully understand these benefits yet. The other aspect of this is that at the core of our approach is a focus on the benefits of activity data to the end-user and not, as it would usually be the case, to the organisation. We therefore here quickly come back to what we have learnt on the advantages of our approach, first to the end-users, and then deriving potential benefits to the organisation. We summarise our view on the achievements of UCIAD in terms of benefits through a discussion regarding the success of the project, as seen as an experiment towards ontology-based, user-centric activity data.

Benefits to the end-user

There have been a number of places where the potential benefits of user-centric data (or consumer data) have been discussed, as generally labeled as “giving back their data to the users”. These include in particular the popular article “Show Us the Data. (It’s Ours, After All.)” by Richard H. Thaler in the New York Times. As was argued in particular in one of our previous posts, being able to give a complete account of what end-users could do with such data is both unfeasible and undesirable. However, we can summarise the expected benefits, and their connections to the work done in UCIAD, in three different areas:

  • Known yourself… and be more efficient: As we briefly discussed in our post on self-tracking, there is a trend currently regarding people, individuals, monitoring their own activities, statuses, etc. While some would criticise such attitude as pure narcissism, the reality is that monitoring oneself has been realised as one effective way to improve. In sport for example, monitoring performance in relation with other variables (health status, equipment used, etc.) is necessary to improve and achieve the best conditions, for the best results. Besides sports however, there are many areas where monitoring and understanding one’s own behaviour can help being more efficient. There is a large gap between an athlete measuring his/her performance and a user monitoring his/her online activities. However, for a user to know how he/she searched websites, find and exploit resources on the Web or engage with online communities, can only have a positive effect on his/her effectiveness in realising these tasks in the future.
  • Exploit your own data yourself: Besides the passive monitoring of activities, consumer data has often be described as exploitable by individuals. Indeed, in the current situation, organisations collect large amounts of data about their users, that they exploit to their own benefits, often for commercial purposes. Such personal data and profiles are being used and accessed by a large variety of agents, from the search engine that will send personalised results to the advertiser that will target you with specific products, except the user him/herself. For the users to have access, control and possibly ownership of their own data means that they could also exploit them, use them to build their own profiles that can be employed in communicating with other entities on the Web, under their own terms. In a more directly pragmatic way, the users can analyse their own data and build on top of them to extract relevant information to their own benefit. In UCIAD, we not only allow users to export their own data, but we do it using Semantic Web standards to ensure maximum reusability and, through relying on a customisable ontology, the exported data can be flexibly adapted to any kind of uses that the user might come up with, not only the ones that we have thought of.
  • Combine and integrate your own data: While we are still far from such a situation at this stage, we can easily imagine that, with the explosion of the number of systems providing an “export your own data” feature, users will eventually be able to build their own personal knowledge base, feeding it with personal data collected from the many online systems they use. Again, such a scenario requires a certain level of standardisation in the data representation formats being used, for which Semantic Web technologies appear as perfect candidates. A possibly less distant scenario is the one were users interacting with several organisations would export their activity data from the corresponding instances of the UCIAD platform. These data would naturally integrate to provide the user with the ability to monitor, analyse and exploit their activity data across numerous, originally disconnected organisations and websites.

Benefits to the organisation

As explained earlier, one of the core aspects of UCIAD has been to focus on the benefits of collecting and flexibly interpreting activity data to the end-user. This does not mean that the organisation has no interest in considering the type of technology we have been developing, but simply that the benefits to the organisation mostly come as derived from providing benefits to the end-users of the organisation:

  • Transparency: In very simple terms, users are more and more pushing organisation towards more accountability with respect to the data they collect about them. Deploying the UCIAD platform can be seen as a way for an institution to tell users “here is what we have about you in terms of activity data”.
  • Trust: In relation with the point above on transparency, providing collected data back to the user is a way to establish a stronger relationship with them: i.e., one where they can trust the organisation regarding the fair and transparent use of their activity data.
  • Leave data management to the user: Leaving the user in control of their own data can bring valuable benefits to the organisation. In particular, it means that the user can allow, or actively enable, the use of more data than what can be done when he/she is left out of the loop. It makes it possible for example for them to bring and import data they have collected from other systems and organisations, so that the same data does not have to be collected again, and the new organisation does not have to start from scratch.

How do we measure success?

So, now that we have listed all the expected benefits of the approach taken in UCIAD, the natural next question is “have we managed to bring all these benefits to our institution?”. The plain and honest answer is: No.

From the start, we have considered UCIAD as being an experiment (and actually, a rather short one). What we wanted to demonstrate was that:

  1. These benefits are achievable
  2. Technology, such as linked data and ontologies, make the approach feasible

The UCIAD platform demo, collecting log data from several webservers concerning around a dozen websites, interpreting this data in terms of user-activity, extracting the traces of activities around a given user and exposing the user to these traces in a meaningful way, provides an undeniable demonstration that the technical and technological mechanisms to achieve the UCIAD approach are applicable and effective.

We are currently demonstrating this platform to users of the Open University websites, and observing them in engaging with it, and so with their own activity data. This activity will carry on for some times after the end of the project so that we can learn as much as possible from the current state of the platform. However, from these initial discussions, it appears clearly that users are interested, even sometimes fascinated, with the idea of obtaining and using their own activity data. They are, as it has been happening for many systems outside UCIAD (e.g., Google, Facebook), very positive about such features being added to the websites of an organisation they spend so much time interacting with: their University. In many cases now, they are demanding it.

User (Management)

July 27th, 2011

In the previous post, we explained to a certain extent what are our motivations for looking at a user-centric approach to activity data, and especially what we expect to be the benefits to the users. We also quickly sketched some specific aspects of identifying and processing user-specific information in our post regarding the reasoning processes employed in UCIAD. Here, we come back more generally on the aspects related to users and user management in the UCIAD platform, including the way to recognise a user, treat registrations and login, manage and present the information about the user activity and handle access rights over semantic data. The actual prototype of the UCIAD platform implementing all these elements is currently being finalised, and will be described more completely in our final post.

Identifying and managing users of UCIAD

The information the UCIAD platform has regarding users can be seen as similar to the ones basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding computer. Analytics system have for long realised that the combination of these two parameters was sufficient to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web.

In UCIAD, we have the advantage that it is very likely that the user will connect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. As shown in the mock-up screenshot above, the “settings” the user is using can be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user.

In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in the figure below, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user.

As explained in the post about reasoning on user centric activity data, managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. This graph is then being registered in our repository, and the next step is to ensure that the information being provided is restricted to the graph of the logged-in user.

Managing access rights over semantic data

We store, manipulate and reason over activity data using Semantic Web technologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data.

Unfortunately, most current triple stores, and especially the one we are employing, do not provide sufficiently fine-grained access control mechanisms, allowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data.

The all idea is actually quite simple (as depicted on the diagram above): the actual SPARQL endpoint collecting all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentification. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user.

While this mechanism is relatively simple it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. It is actually nice to see how, based on basic authentification mechanisms, the same queries asking for activity data will return different results, depending on the user who is connected.

What users anyway?

Of course, the mechanisms and techniques to manage, identify and process information about users does not answer the question of who they are and what are the benefits they can get from the system. Actually, as argued before, it is pretty hard to predict in advance what is going to be the use of providing back to the users their own activity data. General arguments can be given on the advantages of self-tracking, but in reality, the really important thing is that what is provided by the system has to stay open for any use. Working with the development version of the UCIAD platform, we find it quite fascinating that we, as individual users, can trace back our activities, drill down into specific categories (e.g., search, commenting on blogs, checking the price of a course), send queries which might only be relevant to us (e.g., “how much did I use data.open.ac.uk on sundays?”), etc. It helps us understand our own use of the resources provided by the University, and so to become more efficient with them.

Explaining user-centric activity data

July 5th, 2011

I was today at the meeting of the JISC activity data programme, where all the projects in the programme came to discuss what they were doing, and what should be the priorities for the coming year(s). As some might have realised, I am actually a bit critical of this sort of discussions. Not that I think that the projects are doing the wrong things, just that there is a lot of catching up to do, and I think we might end up missing the next train (which I believe to be consumer data) while trying to catch up with the previous one (activity data-based recommander systems).

Anyway, I was trying to come up with a reasonable explanation regarding user-centric activity data (mostly based on showing evidence of the current trends in the industry, from energy providers showing users historic information on their own consumption to the Google Data Liberation front and the mydata project) when the ongoing discussion derived on the definition of simple things such as the notion of event. Trying to define the concepts we are talking about is the major goal of our ontologies. However, the discussion made me realised that we also needed a simplified overview of the kind of data we are dealing with, and of what made the difference between the organisation-centric view and the user-centric view of activity data.

Indeed, looking at the figure above, we can summarise very simply what we are dealing with in terms of activity data. Activity data is set of events (or the traces of these events) where an action is realised on a resource (e.g., a webpage) by an actor (most often a user). That is a general view of what we mostly have to consider as raw activity data. However, in order to extract anything meaningful from this data, looking at the raw collection of individual events isn’t going to give us much: we need to abstract the data into sets of events that are meaningful, and which distributions of characteristics can be interpreted.

The figure above represents the most common way of abstracting activity data: what we call the organisation-centric view. The idea is that large sets of events are being analysed that are realised by aggregated sets of users. There can be one set of users, like in the case of analytics system that provide statistics regarding actions realised by all visitors of a website, or the organisation can define sub-groups such as Students/Staff/External that are meaningful to the particular types of activities and analyses being considered. In this case, users stop existing individually in the abstracted activity data, as they only manifest as part of the aggregated statistics for their groups.

User-centric activity data is basically making the abstraction the other way around (see above): aggregating traces of activities around a given user, interpreted according to meaningful sets of resources and events. The challenge in this case (appart from the scalability of the approach, which is going to be the topic of another blog post sometimes) is in the way to define meaningful sets of resources and events. In the data we have been looking at, activities such as “commenting on a blog”, “searching a blog”, “querying linked data” or “using a web application” are clearly emerging, but the number and nature of the types of resources and events that can appear in the data is largely dependent on the system and the user. This is why we believe that using ontologies as a model to drive such abstractions is a good solution: it provides us with a flexible way to define types of resources (e.g., BlogPage, RSS feed, Linked Data endpoint) and the corresponding activities (e.g., commenting, querying, searching), and to automatically classify individual traces and resources into these types. The end result is the ability for individual users to visualise and analyse the distribution of their own activity data in these types and categories. Pushing it a step further, users should even be able to personalise the views, giving their own ontological definitions and obtaining data abstractions that are therefore more meaningful to them.

A colleague forwarded me today this article in french, where the author says (my translation): “What could I accomplish if I had at my disposal, in an exploitable form, the information regarding my pathways and communications? [...] Not only to control what others are doing with it, but to use it to my own benefit? Today, we tend to scratch our head and ask: what would be the use of that?”, and indeed we don’t really know what this will allow in the future. However, as the author of the article suggests, that shouldn’t stop us from trying to find out, as long as we are convinced there is something there to explore.