Collecting and Processing Personal Activity Data

June 5th, 2012

I seem to be saying that all the time, but we are currently in one of the busiest periods of the project: processing data. An enormous and very tiring part of the beginning was spent on collecting data, which explains our relative silence lately. We discuss here some of the lessons we can already draw from what we have done.

What Data?

As explained earlier, the goal of the project is to investigate the idea of user-centric activity data, what users would do with it and what it would imply in terms of organisational policies. The way we will realise that is by collecting information about the use of the various website of the Open University by a dozen users over a period of 4 weeks.

We selected users so that they can represent a wide variety of roles in the organisations: students (mostly post-graduate), associate lecturers, researchers, admin and support staff. With the help of the IT services of the OU, we then created a script that extracted the log entries for these users (based on their identifiers) for all the concerned websites (intranet, virtual learning environment, general websites, etc.)

The organisational process of collecting “personal” data

The most difficult part of collecting data such as the one we are considering is clearly not technical. We naturally had to ensured that the users selected enrolled voluntarily and understood what we were going to do with the data, and what was expected from them.

Prior to that however, we had to spend a lot of time obtaining approval from various parts of the open university: the ethics committee, the student research project committee, IT security, the data protection coordinator, etc.

In short, we ran into a cycle, where we were directed from one committee to the other, until the situation finally resolved. The lessons learned here are, first, not to underestimate the inefficiency of organisational structures and plan for this very early and with the most pessimistic view on the time it would take. It is a part of the project that very much illustrates the Hofstadter’s law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”

This is tricky however as in turns out that the best time to do these things, according to the official processes, is at proposal time. It is obvious however that, if we were to do that, we would never submit any proposal… Generally, the whole thing is rather frustrating and gives the impression that the whole purpose of these committees is to prevent things from happening.

Another important aspect about getting approval from such committees is: they don’t understand anything to what we do! It did escape us a tiny bit, but it is obvious. These are groups of people looking at research across the whole of the university, with not technical background into what we do, and our topic is clearly confusing: it is doing research directly on personal data, not just having implication of personal data. The need to be very “pedagogical”, and to explain as clearly as possible that there is not way any evil can come out of our research is clearly a challenge.

Visualising personal activity data

In order to explore the use of user-centric activity data, we are investigating an interface for personal web analytics, which is similar to a web analytics such as Google Analytics or Piwik, but where the relationship between the organisation, the user and the data is inverted.

We developed in the first phase of the project an initial version of such an interface, but it was not really intuitive and effective enough. We are now developing a new version which uses a more effective query engine, and with a lot of pre-processing of the data, so that it can be actually used by a variety of users.

UCIAD II interface mock up

UCIAD II interface mock up

What the data already told us

At the moment, we are processing the data. We we can tell however already is how the role a person takes clearly impacts on the way they use university websites. Only the size of the log entries collected for the different users tell us that: students don’t really do much, and on very specific websites (virtual learning environment), researchers and lectures a bit more on other specific websites (tutor system, expense claim system) while admin staff generates a lot more logs on a wider variety of systems. The remaining question is: how these roles will also impact on the way people would use their own activity data, and how the multiple roles people might have, their different personae, interact and should be supported.

The data will tell us more when we will put it in front of the user within the next couple of weeks…

User (Management)

July 27th, 2011

In the previous post, we explained to a certain extent what are our motivations for looking at a user-centric approach to activity data, and especially what we expect to be the benefits to the users. We also quickly sketched some specific aspects of identifying and processing user-specific information in our post regarding the reasoning processes employed in UCIAD. Here, we come back more generally on the aspects related to users and user management in the UCIAD platform, including the way to recognise a user, treat registrations and login, manage and present the information about the user activity and handle access rights over semantic data. The actual prototype of the UCIAD platform implementing all these elements is currently being finalised, and will be described more completely in our final post.

Identifying and managing users of UCIAD

The information the UCIAD platform has regarding users can be seen as similar to the ones basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding computer. Analytics system have for long realised that the combination of these two parameters was sufficient to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web.

In UCIAD, we have the advantage that it is very likely that the user will connect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. As shown in the mock-up screenshot above, the “settings” the user is using can be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user.

In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in the figure below, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user.

As explained in the post about reasoning on user centric activity data, managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. This graph is then being registered in our repository, and the next step is to ensure that the information being provided is restricted to the graph of the logged-in user.

Managing access rights over semantic data

We store, manipulate and reason over activity data using Semantic Web technologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data.

Unfortunately, most current triple stores, and especially the one we are employing, do not provide sufficiently fine-grained access control mechanisms, allowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data.

The all idea is actually quite simple (as depicted on the diagram above): the actual SPARQL endpoint collecting all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentification. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user.

While this mechanism is relatively simple it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. It is actually nice to see how, based on basic authentification mechanisms, the same queries asking for activity data will return different results, depending on the user who is connected.

What users anyway?

Of course, the mechanisms and techniques to manage, identify and process information about users does not answer the question of who they are and what are the benefits they can get from the system. Actually, as argued before, it is pretty hard to predict in advance what is going to be the use of providing back to the users their own activity data. General arguments can be given on the advantages of self-tracking, but in reality, the really important thing is that what is provided by the system has to stay open for any use. Working with the development version of the UCIAD platform, we find it quite fascinating that we, as individual users, can trace back our activities, drill down into specific categories (e.g., search, commenting on blogs, checking the price of a course), send queries which might only be relevant to us (e.g., “how much did I use on sundays?”), etc. It helps us understand our own use of the resources provided by the University, and so to become more efficient with them.