Search:

Personal Analytics

November 22nd, 2012

We had to spend a bit of time trying to find a better name for what we do than “User Centric Activity Data”. The UCIAD study report settled for “consumer activity data”, in relation to consumer data as considered for example in midata. However, for what concerns the tools that allow us to expose users to their own activity data (such as the UCIAD dashboard), we used the possibly more appropriate term of “personal analytics”. And personal analytics has received quite a lot of attention lately.

It started with Stephen Wolfram posting an article with a lot of analytics about his online activities. This has made quite a buzz, showing how (rather simple) computational techniques and visualisations could be quite revealing to an individual. As a follow-up from that, a personal analytics feature was added to Wolfram|Alpha, using the Wolfram|Alpha engine to provide analyses and visualisation of your activity on Facebook.

Another of such tools, which is currently very much at “prototype” level, is the MOLUTI chrome extension (see on the Chrome webstore). It shows you some interactive visualisations of you web history in the Chrome browser. What is interesting with this is the simple mechanism it provides to filter activities (e.g., on what web site did I look about “child stair gates” last week end?), and also that it makes it possible to share the results of these filters, in the form of what it calls “browse lists” (list of links with a tag cloud).

We argued in the past about the use of such tools, and for the advantages users might have from having access to ways to understand and query their own activities. No doubt that this area has a very bright future, as the need for these personal analytics tools can only grow with the increase in online activities.

The UCIAD User Study: Report

August 14th, 2012

Our goal for the second phase of the UCIAD project was to investigate, through a user study, how people would actually use their own activity data if they were to get access to it and how such access would impact on the organisation that has been collecting the data. We achieved that by building a a technical architecture, collecting data from Open University log systems and exposing them to the corresponding users, through a compelling interface. We collected reactions and thoughts from the users through individual interviews, questionnaires and a “focus group” meeting confronting the varying ideas and opinions about the overall notion of user-centric activity data (or as we might call them, in a simpler way, consumer activity data).

The results of this user study, together with more details on the methodology we employed and the technical platform are now summarised in a complete, self-contained report:


Consumer Activity Data: Usages and Challenges

This report is licensed under Creative Commons Attribution and the source code of most of the technological platform is available as open source software.

Collecting and Processing Personal Activity Data

June 5th, 2012

I seem to be saying that all the time, but we are currently in one of the busiest periods of the project: processing data. An enormous and very tiring part of the beginning was spent on collecting data, which explains our relative silence lately. We discuss here some of the lessons we can already draw from what we have done.

What Data?

As explained earlier, the goal of the project is to investigate the idea of user-centric activity data, what users would do with it and what it would imply in terms of organisational policies. The way we will realise that is by collecting information about the use of the various website of the Open University by a dozen users over a period of 4 weeks.

We selected users so that they can represent a wide variety of roles in the organisations: students (mostly post-graduate), associate lecturers, researchers, admin and support staff. With the help of the IT services of the OU, we then created a script that extracted the log entries for these users (based on their identifiers) for all the concerned websites (intranet, virtual learning environment, general websites, etc.)

The organisational process of collecting “personal” data

The most difficult part of collecting data such as the one we are considering is clearly not technical. We naturally had to ensured that the users selected enrolled voluntarily and understood what we were going to do with the data, and what was expected from them.

Prior to that however, we had to spend a lot of time obtaining approval from various parts of the open university: the ethics committee, the student research project committee, IT security, the data protection coordinator, etc.

In short, we ran into a cycle, where we were directed from one committee to the other, until the situation finally resolved. The lessons learned here are, first, not to underestimate the inefficiency of organisational structures and plan for this very early and with the most pessimistic view on the time it would take. It is a part of the project that very much illustrates the Hofstadter’s law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”

This is tricky however as in turns out that the best time to do these things, according to the official processes, is at proposal time. It is obvious however that, if we were to do that, we would never submit any proposal… Generally, the whole thing is rather frustrating and gives the impression that the whole purpose of these committees is to prevent things from happening.

Another important aspect about getting approval from such committees is: they don’t understand anything to what we do! It did escape us a tiny bit, but it is obvious. These are groups of people looking at research across the whole of the university, with not technical background into what we do, and our topic is clearly confusing: it is doing research directly on personal data, not just having implication of personal data. The need to be very “pedagogical”, and to explain as clearly as possible that there is not way any evil can come out of our research is clearly a challenge.

Visualising personal activity data

In order to explore the use of user-centric activity data, we are investigating an interface for personal web analytics, which is similar to a web analytics such as Google Analytics or Piwik, but where the relationship between the organisation, the user and the data is inverted.

We developed in the first phase of the project an initial version of such an interface, but it was not really intuitive and effective enough. We are now developing a new version which uses a more effective query engine, and with a lot of pre-processing of the data, so that it can be actually used by a variety of users.

UCIAD II interface mock up

UCIAD II interface mock up

What the data already told us

At the moment, we are processing the data. We we can tell however already is how the role a person takes clearly impacts on the way they use university websites. Only the size of the log entries collected for the different users tell us that: students don’t really do much, and on very specific websites (virtual learning environment), researchers and lectures a bit more on other specific websites (tutor system, expense claim system) while admin staff generates a lot more logs on a wider variety of systems. The remaining question is: how these roles will also impact on the way people would use their own activity data, and how the multiple roles people might have, their different personae, interact and should be supported.

The data will tell us more when we will put it in front of the user within the next couple of weeks…

Why I don’t believe in the personal data economy

May 1st, 2012

I care a lot about personal data (especially mine), and yep, my and everybody else’s personal data are all over the place. They are all over the place because they are valuable. They are all over the place because I (and you) don’t have much control over the way they are being collected, exchanged and exploited.

Now, what I conclude from that is that we need more control (and, of course, I would not pretend that I’m the only one thinking that, but that’s not the point). Apparently, what other people would conclude from that is that you should be selling your personal data. That there is such an idea of a “personal data economy” where instead of giving it away, you can decide of a price for your personal data. This is frequently being promoted as the latest, new, brilliant idea on the Web, most recently from HP Labs: “A Stock Exchange for Your Personal Data”.

Why didn’t we think about that before?

Well, quite simply because we did. Numerous times. And it did not work.

Have you heard of things like the Attention Recorder, from the Attention Trust, that was going to make the attention economy the big thing of… hurmm… 2007? Well, me neither.

Why I do not believe in this thing is quite simply because I don’t understand how it could work. It sounds really naive, mostly for three reasons:

  1. The fact that my personal data are valuable does not mean I want to sell them. Apparently, my organs are valuable. There is a market with people who would be more than happy to harvest them and get some benefits from that. To me, they are just essential. They don’t have value from an economic perspective. They have value because I can’t exist and function without them. In other terms, it is not because it has market value that it is a commercialisable personal asset. You would have to be seriously naive to think that it is the case, and come up with very convincing evidence to convince me that this is the way things work.
  2. Putting a price on it does not prevent from getting it for free. That’s one part I really never understand. Personal data is being collected from us, and this is actually necessary for quite a few things to function. If this is the case, why exactly would anybody want to pay for it? You can argue that this is exactly the point: they should not have access to it for free. But how is it that this is going to change? This is not a trick question: I truly don’t understand this!
  3. Wouldn’t it actually kill the personal data economy? Taking the risk to appear as contradicting myself, I would argue that this personal data economy exists already, but not in the form it is envisaged in this “personal data stock-market” trend. We register to online services, allow them somehow to collect data from us, in exchange for what they have to offer. Personal data are not the product, they are the currency! We buy stuff on the Web with our personal data, and actually most of us are reasonably happy about most of what we get out of the deal. We could stretch this metaphor further, and talk about “personal data banks” (some do) and “personal data currency exchange”. But the real point here is that changing the current “market” to a model where personal data is a product being sold would change the balance of this economy: personal data stop having the quality of a currency — i.e. having the same value for everybody. Now, here is the trick: if personal data stop having the same value for everybody, they start being biased; And if they start being biased, they stop having value to the people who exploit them.

Now of course, I probably misunderstood something somehow, and maybe all these points have been addressed (in which case, when can I start making money to tell you how much I weight? — which reminds me I should stop telling things to my friends, just in case). Until that is confirmed through, I would rather work on being able to understand and control the way I spend my personal data online.

Personal Activity Data: Another Project

March 21st, 2012

UCIAD is about the integration and analysis of activity data originating from the logs of different websites of an organization, using the knowledge the organization has about these websites to provide users with ways to analyze their own online interactions with the organization. In another project called DATAMI (funded by the IKS Project), we are investigating how activity data from the whole Web traffic generated by a user can be semantically analyzed to extract ‘entities’ of interest to the user, in a “personal, semantic web history dashboard”.

A first, preliminary (video) demo of this application has been released today, that show the potential of the technologies developed in this project:

This result is of course not dissimilar to the video produced at the end of phase 1 of UCIAD, and clearly, the user-study we are currently setting-up for phase 2 will provide valuable results also for the DATAMI project.

UCIAD: The Second Phase

February 1st, 2012

Today is the start of the second phase of the UCIAD project, also known (not very excitingly) as UCIAD-II. This extension of the project will run until the end of June 2012, and will mostly involve me, as director and manager, and Keerthi Thomas, as our privacy/technology expert.

Why extending UCIAD?

Well, the answer to that would appear quite obvious to a number of people. Not carrying on the work and pushing it further would have actually been more surprising. Indeed, UCIAD investigated the challenges related to integrating traces of activities on an organisation’s websites, and presenting them in a user-centric way. It showed how such an approach creates a new set of technical issues, and how semantic technologies (including linked data and ontologies for data integration and clustering) can be used to tackle these issues.

Our goal here is to validate this initial insight, establish the scenarios in which the user-centric analysis of activity data will be employed in the future and devise recommendations for the types of policy that will be needed in relation to these scenarios. In other terms, now that we understand the technology, we need to understand the usage, and the implications. This is quite exciting considering that, even given the current trend in user-centric/consumer data, these aspects have not been studied before, while they will clearly become crucial in the next couple of years (if not months).

So, what are we going to do?

Generally, UCIAD II is looking at two complementary aspects of the general idea of user-centric activity data (i.e., giving back to users the data about their own activity):

  1. What are the concrete scenarios in which users can benefit from having access to, understanding of and control over their own activity data?
  2. What are the changes in terms of organisations’ policies on data access, data protection, data licensing and privacy that are made necessary by such approaches.

The goal of UCIAD 2 is therefore to provide early-stage answers to these questions through a study, realised with a group of users (students and staff) of the Open University websites. In a nutshell, we will give these users early dedicated access to the evolving set of tools prototyped in UCIAD, populated with their activity data on the Open Universities websites. We will record and reflect with them on their usage of these tools and their reactions to them, in order for them to act as a ‘focus group’ to establish what the access to their own activity data could enable, and what can be judged acceptable in terms of the organisation’s policies on managing these data.

What’s next?

As with UCIAD, we will use this blog to report and discuss on the progress in the project. We have a lot on our plate already, starting for enrolling participants and organising the first interview campaigns. More on that will to appear soon!

Final post – Putting things together (with a demo)

August 5th, 2011

Over the last 6 months we have been working on building the UCIAD platform, experimenting with large-scale activity data, reflecting on user-centric data and blogging our thoughts. While, as can be seen from the last few posts on this blog, there is quite some work we think should follow from this, it is nice to see things finally coming together, and to be able to show a bit of the UCIAD platform with have been talking about for some times. What better way to do this than with a video of the running platform, showing the different components in action. (Note: it is better to watch it in 720p – HD).

This video shows a user (me) registering to the UCIAD platform with some setting details and browsing his activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk). This video therefore integrates in a working demo the different components we have been talking about here:

  • User management: As we can see here, as the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be his are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted.
  • Extracting user-centric activity data: As described in the first part of the blog post on reasoning (previous link), the settings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his activity.
  • Ontologies to make sense of activity data: The ontologies are used in structuring the data according to a common schema and to provide a base to homogeneously query data coming from different systems. As discussed below, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon.
  • Ontological reasoning for analysis: What the demo video shows clearly is how the activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case. Here, we extended the ontology in order to include definitions of activities relevant to the use of a wiki and a data platform. The powerful aspect of using ontologies here is that such classes can be added to the ontology for the system to automatically process them and organise the data according to them. Here, for example, we define “Executing a SPARQL Query” as an activity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client.
  • Browsing data according to ontologies: We haven’t described this components yet, but we rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple stats.

Next steps

There are a lot of things to mention here, some of them we have already mentioned several times. An obvious one is the finalisation, distribution and deployment of the UCIAD platform. A particular element we want to get done at a short term is to investigate the use of the UCIAD platform with various users, to see what kind of extensions of the ontologies would be commonly useful, and generally to get some insight into the reaction of users when being exposed to their own data.

More generally, we think that there is a lot more work to do on both the aspects of user-centric activity data and on the use of ontologies for the analysis of such data, as described in particular in our Wins and Fails post. These includes aspects around the licensing, distribution and generally management of user-centric data (as mentioned in our post on licensing). Indeed, while “giving back data to the users” is already technically difficult, there is a lot of fuzziness currently around the issues of ownership of activity data. This also forces us to look a lot more carefully at the privacy challenges that such data can generate, that didn’t exist when these data were held and stayed on server logs.

Beyond UCIAD and the Open University

As discussed in our post on the benefits of UCIAD, the issues considered go largely beyond the Open University and even activity data. The issues around licensing in particular are to be considered more broadly, in the same way as the challenges around communicating on user-centric data.

We have been focusing mostly on the technical issues in UCIAD, providing in this way a base framework to start investigating these broader and more complex challenges.

Most significant lessons

To put it simply, the most significant lessons we learnt (as mentioned in the wins and fails post) are:

  • Both user-centric data and ontologies are complex notions, so don’t assume they are understood.
  • Activity data are massive and complex, beyond what can be handled by current semantic data infrastructures, without a bit of clever management.
  • There is a lot of potential in using ontologies and ontological engineering for the analysis and interpretation of raw data.

Wins and fails (lessons along the way)

August 3rd, 2011

If there is one thing I like about the JISC activity data programme in which UCIAD is involved is that the instructions were very clear: your project is a short experiment, to see what could/should be done in the area of activity data in the context of higher education organisations (or at least, this is what I heard). We have integrated that a lot in UCIAD, starting from our two basic hypothesis that a user-centric perspective on activity data is relevant, and that Semantic Web technologies, especially ontologies, provided the right technological basis to achieve such a perspective.

We have discussed in a number of previous posts what we got excited about, what showed us the feasibility, relevance and potential impact of our approach, as well as what unexpected issues we had to face and how some of our assumptions turned out to be wrong. Here, we wanted to give a quick summary of these “wins” and “fails”, starting of course from the wins, and looking at the two aspects corresponding to our two hypothesis: the user-centric view and the semantic technologies view.

Wins – What went right

  • On the user-centric view: Giving data back to the user, user-centric data and consumer data were already emerging trends when we started the project, but clearly exploded as topics that organisations should take into account in the last few months. The New York Times article “Show Us the Data. (It’s Ours, After All.)” has in particular generated a lot of discussions amongst consumer representatives and “data-managers” in various organisations. The mydata project launched by the UK government is also a clear sign that the push for more transparency has to extend to private and public organisations dealing with people’s data. There have already been strong reactions from large companies such as Google, launching its own Data Liberation Front. Generally, users (will more and more) want, and assume the right to access their data and to use them to their own benefits. Only considering the feature of exporting one’s own activity data is technically non-trivial, but of obvious relevance in the current climate where a lot of emphasis is put on transparency, while personal information can be distributed in many different and isolated systems. Beyond the general climate, we have also shown that activity data is not only relevant as aggregated at the level of an organisation, but can give a new perspective when individual users are kept visible in the data (see this post for an explanation of what we mean here). To put it simply, giving people a view on their activity data provides a way for them to reflect on it, and to become more efficient in these activities. It also give them an opportunity to engage with the data, “customize” it, with added-value for the organisation.
  • On Semantic Technologues We have a lot of experience working with ontologies and semantic data, and were therefore confident that there was a great potential here. However, this is probably the point on which most people external to the project would think we had the best chance to fail: we believed that we could apply semantic technologies, linked data-based approaches and (most horribly) ontology-based reasoning to the manipulation, processing and analysis of activity data. Realising the experiments, setting up the UCIAD platform with real, large scale data, applying ontologies on top of these data and evolving these ontologies to create different views for the analysis of these data are, from my very personal point-of-view, the most interesting part of the project. Ontologies have acquired recently a bad reputation, and mentioning them (especially in the context activity data) now often leads to raised eyebrows and condescending looks. One thing that our experiments with UCIAD have shown is that working with ontologies not only has the advantages of introducing formality, shared vocabularies and semantics in our applications, but also represents a flexible and efficient way of introducing meaningful views into large amounts of raw, uninterpreted data. What ontologies bring into such an area is the ability to give definitions that will be at the basis of clustering and organising the data automatically. I can tell what I mean by a “search activity” and magically see all the traces related to search activities being put together, to become explorable and queryable (see our post on reasoning). The nice thing about UCIAD, is that this magic is actually implemented and working in the way we hypothesized it would. It is a fascinating thing to see raw data from log files being classified into meaningful categories of activities, resources and actors. It is even more fascinating knowing that we defined these groups, through encoding these definitions in an ontology, and can add others as we see fit. Due to time constraints, we could only experiment a tiny bit with this process, but we see a very promising approach in the incremental definition of the ontology as an analysis process: looking at the data, thinking that it would make sense to have an activity categorie such as for example “commenting on a blog”, and simply adding it to see the data being automatically reorganised with this new definition.

Fails – What went wrong

  • On the user-centric view: Our biggest failure in my opinion has been that we didn’t manage to communicate appropriately on the notions, approaches and change of perspective that the user-centric view on activity data represents. There are many reasons for this I believe, one being that we have been assuming that the benefits would be self-evident, while they clearly are not (see the post where we tried to get back the basis of the issue). The notion of user-centric data or consumer data might be very trendy, it does not mean that it is ready for wide adoption. There are many issues that need to be solved that go far beyond the purely technical aspects, and that simply come from the fact that activity data has never been looked at in this way before. We don’t really know what will happen in this space, what users would do with these data and how much interest this could generate for the organisation. There are many difficult questions that we could not really address in the scope of the project (including in particular the questions around data ownership, and privacy). While this is enough to keep us excited, there is enormous work to be done before the approach we have been promoting in UCIAD could reach its potential, and be widely adopted.
  • On Semantic Web technologies: While we are still excited about the added-value that semantic web technologies can bring to the analysis of activity data, we have been clearly over-optimistic regarding the maturity of some components we have been relying on, and their ability to handle the scale and complexity of the kind of data we are working with. This issue is clearly summarised in our post on the technical aspect of UCIAD. The good news is however that things are evolving very quickly. It would be a lot easier to implement the UCIAD platform now than it was 6 months ago, as the tools and platforms to deal with semantic data are getting more robust everyday. Also, the evolution of the technology should be followed by an evolution in the skills and ability of the community to adopt such technologies. Realising UCIAD made us reach a better understanding of what was feasible and required to set up a semantic platform for activity data. There is still much to do for such an approach to become feasible in a broader set of situations.

Technical and Standards

August 2nd, 2011

We have been talking about our technical approach and some of the components with used in a few other previous posts. Here we summarise the major technical components of the architecture of the UCIAD platform (see figure below) as well as the tools we have reused. We also come back to the biggest, and most major technical issue we had to face: scalability, especially when applying Semantic Web technologies to very large log data.

As described before, one of the core principles of UCIAD is the use of Semantic Web technologies, especially ontologies and Semantic Web languages. The goal of the architecture above is to extract from logs homogeneous representations of the traces of activity data present in logs and store them in a common semantic store so that they can be accessed and queried by the user.

The representation format we use for this activity data is RDF – the resource description framework, which is the standard representation format for the semantic Web and linked data. It is a graph-based data model where the nodes are either literal values or URIs: Web addresses of “data objects”. The advantages of RDF is that the graph data-model provides a lot of flexibility into manipulating the data and extending it. RDF is also supported by many different tools and systems. More importantly, for us, the schema used for RDF is an ontology, represented on the OWL Web Ontology Language, which allows flexibility and granularity in representing the data, but also, as a logical formalism, makes it possible to apply simple inference mechanisms (namely classification) so that “definitions” can be added to the ontology to cluster and group traces of activities automatically (see our post regarding the development of the UCIAD ontologies and the reasoning mechanisms applied with these ontologies).

The core component of our architecture is the semantic store (or triple store). Here we use OWLIM, which has three main advantages for us:

  • (In principle) it scales to very large amounts of data (although, this has proved to be an optimistic view, see below).
  • It provides efficient inference mechanisms for a fragment of OWL (which mostly covers our needs)
  • Together with the Sesame interface, it provides standard data access and querying mechanisms through the SPARQL protocol.

As mentioned several times in the past, however, despite OWLIM providing a very good base, the scale of the data we have had to handle generated major issues, which introduced a lot of delays in the project. Activity data, in the form of traces from logs, are enormous. OWLIM claims to be able to handle 10s to 100s of Billions of RDF triples (connections in the graph), but there are a number of circonstances that need to be considered.

To get an idea of the scale we are talking about, we consider a reasonable web server at the Open University (e.g., the one used to serve data.open.ac.uk). This server would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that we will re-factor in terms of our ontologies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). One of the things with using a graph representation is that it is not optimised for size. We therefore can obtain anything between 20 and 50 triples per request. That leads us to something in the order of 100 million triples per month per server (each server can host many websites).

In theory, OWLIM should handle this sort of scale easily, even if we consider several servers over several months. However, there are a number of things that make the practice different from the theory:

  • OWLIM might be able to store many billions of triples, but not any kind triples. The data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one request, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes.
  • Many triples, but not all at once. This is another very important element for us: OWLIM might be able to “hold” many triples, but it does not mean that they can all me uploaded and processed at the same time. Loading triples into the store takes a lot of resources, and too many triples at the same time might overwhelm it and make it crash. To deal with this, we had to change our process, which originally loaded the log files for an entire month at once, into one where we extracted everyday the log information for the previous day.
  • The two previous issues are amplified when inference mechanisms are applied. OWLIM handle inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inference, but also that immensely more resources are required at the time of loading these triples, depending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is already stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user (see our post on user management) is then transferred into another (necessarily smaller) store for which inferences apply.

There are a number of other reasons why dealing with semantic data still requires a certain amount of “trial and error”. There is an element of magic in it, not only because when it works, it allows more flexibility and automation than other types of data management approached, but also because making it work often requires following processes (rituals) that are not completely understood.

Closing on a positive note however, since we started the project, a new version of OWLIM has been released (4.1), which provides significant improvements over the previous versions. The system seems now better able to load large amounts of data in one go, and also to manage the resources available more cleverly. It also now supports the SPARQL 1.1 query language with includes aggregation functions, making some of the analysis tasks we are applying easier and less resource consuming.

Licensing & reuse of software and data

July 31st, 2011

Deciding on licensing and data distribution is always challenges where talking about data which are intrinsically personal: activity data. Privacy issues are of course relevant here. We cannot distribute openly, or even on proprietary basis, data that relate to users’ actions and personal data on our systems. Anonimisation approaches exist that are supposed to make users un-identifiable in the data. Such approaches however cannot be applied in UCIAD for two main reason:

  • Such anonimisation mechanisms are only garantied in very closed, controlled environment. In particular, they assume that it is possible to completely characterise the dataset, and that integration with other datasets will not happen. These are two assumption that we can’t apply on our data as it is always evolving (in ways that might make established parameters for anonimisation suddenly invalid) and they are meant to be integrated with other data.
  • The whole principle of the project is to distribute the data to the user it concerns, which means that the user is at the center of the data. Anonimising data related to one user, while giving it back to this user makes of course not sense. More generally, anonimisation mechanisms are based on aggregating data into abstracted or averaged values so that individual users disappear. This is obviously in contradiction with the approach taken in UCIAD.

The issue with licensing data in UCIAD is actually even more complicated: what licence to apply to data exported for a particular user? The ownership of the data is not even clear in this case. It is data collected and delivered by our systems, but that are produced out of the activities of the user. We believe that in this case, a particular type of license, that give control to the user on the distribution of their own data, but without opening it completely, is needed. This is an area that we will need to put additional work on, with possibly useful results coming out of the mydata project.

Of course, despite this very complicated issue, more generic components of UCIAD can be openly distributed. These include the UCIAD ontologies, as well as the source code of the UCIAD platform, manipulating data according to these ontologies.