Collecting and Processing Personal Activity Data

June 5th, 2012

I seem to be saying that all the time, but we are currently in one of the busiest periods of the project: processing data. An enormous and very tiring part of the beginning was spent on collecting data, which explains our relative silence lately. We discuss here some of the lessons we can already draw from what we have done.

What Data?

As explained earlier, the goal of the project is to investigate the idea of user-centric activity data, what users would do with it and what it would imply in terms of organisational policies. The way we will realise that is by collecting information about the use of the various website of the Open University by a dozen users over a period of 4 weeks.

We selected users so that they can represent a wide variety of roles in the organisations: students (mostly post-graduate), associate lecturers, researchers, admin and support staff. With the help of the IT services of the OU, we then created a script that extracted the log entries for these users (based on their identifiers) for all the concerned websites (intranet, virtual learning environment, general websites, etc.)

The organisational process of collecting “personal” data

The most difficult part of collecting data such as the one we are considering is clearly not technical. We naturally had to ensured that the users selected enrolled voluntarily and understood what we were going to do with the data, and what was expected from them.

Prior to that however, we had to spend a lot of time obtaining approval from various parts of the open university: the ethics committee, the student research project committee, IT security, the data protection coordinator, etc.

In short, we ran into a cycle, where we were directed from one committee to the other, until the situation finally resolved. The lessons learned here are, first, not to underestimate the inefficiency of organisational structures and plan for this very early and with the most pessimistic view on the time it would take. It is a part of the project that very much illustrates the Hofstadter’s law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”

This is tricky however as in turns out that the best time to do these things, according to the official processes, is at proposal time. It is obvious however that, if we were to do that, we would never submit any proposal… Generally, the whole thing is rather frustrating and gives the impression that the whole purpose of these committees is to prevent things from happening.

Another important aspect about getting approval from such committees is: they don’t understand anything to what we do! It did escape us a tiny bit, but it is obvious. These are groups of people looking at research across the whole of the university, with not technical background into what we do, and our topic is clearly confusing: it is doing research directly on personal data, not just having implication of personal data. The need to be very “pedagogical”, and to explain as clearly as possible that there is not way any evil can come out of our research is clearly a challenge.

Visualising personal activity data

In order to explore the use of user-centric activity data, we are investigating an interface for personal web analytics, which is similar to a web analytics such as Google Analytics or Piwik, but where the relationship between the organisation, the user and the data is inverted.

We developed in the first phase of the project an initial version of such an interface, but it was not really intuitive and effective enough. We are now developing a new version which uses a more effective query engine, and with a lot of pre-processing of the data, so that it can be actually used by a variety of users.

UCIAD II interface mock up

UCIAD II interface mock up

What the data already told us

At the moment, we are processing the data. We we can tell however already is how the role a person takes clearly impacts on the way they use university websites. Only the size of the log entries collected for the different users tell us that: students don’t really do much, and on very specific websites (virtual learning environment), researchers and lectures a bit more on other specific websites (tutor system, expense claim system) while admin staff generates a lot more logs on a wider variety of systems. The remaining question is: how these roles will also impact on the way people would use their own activity data, and how the multiple roles people might have, their different personae, interact and should be supported.

The data will tell us more when we will put it in front of the user within the next couple of weeks…