Search:

Benefits

July 29th, 2011

One of the major issues (which is going to be discussed in longer terms in the “Wins and Fails” post in the next few days) of the approach taken in UCIAD is to communicate on its benefits. One reason is that, to be fully honest, the mechanisms and the whole perspective we are taking on activity data are still too ‘experimental’ for us to fully understand these benefits yet. The other aspect of this is that at the core of our approach is a focus on the benefits of activity data to the end-user and not, as it would usually be the case, to the organisation. We therefore here quickly come back to what we have learnt on the advantages of our approach, first to the end-users, and then deriving potential benefits to the organisation. We summarise our view on the achievements of UCIAD in terms of benefits through a discussion regarding the success of the project, as seen as an experiment towards ontology-based, user-centric activity data.

Benefits to the end-user

There have been a number of places where the potential benefits of user-centric data (or consumer data) have been discussed, as generally labeled as “giving back their data to the users”. These include in particular the popular article “Show Us the Data. (It’s Ours, After All.)” by Richard H. Thaler in the New York Times. As was argued in particular in one of our previous posts, being able to give a complete account of what end-users could do with such data is both unfeasible and undesirable. However, we can summarise the expected benefits, and their connections to the work done in UCIAD, in three different areas:

  • Known yourself… and be more efficient: As we briefly discussed in our post on self-tracking, there is a trend currently regarding people, individuals, monitoring their own activities, statuses, etc. While some would criticise such attitude as pure narcissism, the reality is that monitoring oneself has been realised as one effective way to improve. In sport for example, monitoring performance in relation with other variables (health status, equipment used, etc.) is necessary to improve and achieve the best conditions, for the best results. Besides sports however, there are many areas where monitoring and understanding one’s own behaviour can help being more efficient. There is a large gap between an athlete measuring his/her performance and a user monitoring his/her online activities. However, for a user to know how he/she searched websites, find and exploit resources on the Web or engage with online communities, can only have a positive effect on his/her effectiveness in realising these tasks in the future.
  • Exploit your own data yourself: Besides the passive monitoring of activities, consumer data has often be described as exploitable by individuals. Indeed, in the current situation, organisations collect large amounts of data about their users, that they exploit to their own benefits, often for commercial purposes. Such personal data and profiles are being used and accessed by a large variety of agents, from the search engine that will send personalised results to the advertiser that will target you with specific products, except the user him/herself. For the users to have access, control and possibly ownership of their own data means that they could also exploit them, use them to build their own profiles that can be employed in communicating with other entities on the Web, under their own terms. In a more directly pragmatic way, the users can analyse their own data and build on top of them to extract relevant information to their own benefit. In UCIAD, we not only allow users to export their own data, but we do it using Semantic Web standards to ensure maximum reusability and, through relying on a customisable ontology, the exported data can be flexibly adapted to any kind of uses that the user might come up with, not only the ones that we have thought of.
  • Combine and integrate your own data: While we are still far from such a situation at this stage, we can easily imagine that, with the explosion of the number of systems providing an “export your own data” feature, users will eventually be able to build their own personal knowledge base, feeding it with personal data collected from the many online systems they use. Again, such a scenario requires a certain level of standardisation in the data representation formats being used, for which Semantic Web technologies appear as perfect candidates. A possibly less distant scenario is the one were users interacting with several organisations would export their activity data from the corresponding instances of the UCIAD platform. These data would naturally integrate to provide the user with the ability to monitor, analyse and exploit their activity data across numerous, originally disconnected organisations and websites.

Benefits to the organisation

As explained earlier, one of the core aspects of UCIAD has been to focus on the benefits of collecting and flexibly interpreting activity data to the end-user. This does not mean that the organisation has no interest in considering the type of technology we have been developing, but simply that the benefits to the organisation mostly come as derived from providing benefits to the end-users of the organisation:

  • Transparency: In very simple terms, users are more and more pushing organisation towards more accountability with respect to the data they collect about them. Deploying the UCIAD platform can be seen as a way for an institution to tell users “here is what we have about you in terms of activity data”.
  • Trust: In relation with the point above on transparency, providing collected data back to the user is a way to establish a stronger relationship with them: i.e., one where they can trust the organisation regarding the fair and transparent use of their activity data.
  • Leave data management to the user: Leaving the user in control of their own data can bring valuable benefits to the organisation. In particular, it means that the user can allow, or actively enable, the use of more data than what can be done when he/she is left out of the loop. It makes it possible for example for them to bring and import data they have collected from other systems and organisations, so that the same data does not have to be collected again, and the new organisation does not have to start from scratch.

How do we measure success?

So, now that we have listed all the expected benefits of the approach taken in UCIAD, the natural next question is “have we managed to bring all these benefits to our institution?”. The plain and honest answer is: No.

From the start, we have considered UCIAD as being an experiment (and actually, a rather short one). What we wanted to demonstrate was that:

  1. These benefits are achievable
  2. Technology, such as linked data and ontologies, make the approach feasible

The UCIAD platform demo, collecting log data from several webservers concerning around a dozen websites, interpreting this data in terms of user-activity, extracting the traces of activities around a given user and exposing the user to these traces in a meaningful way, provides an undeniable demonstration that the technical and technological mechanisms to achieve the UCIAD approach are applicable and effective.

We are currently demonstrating this platform to users of the Open University websites, and observing them in engaging with it, and so with their own activity data. This activity will carry on for some times after the end of the project so that we can learn as much as possible from the current state of the platform. However, from these initial discussions, it appears clearly that users are interested, even sometimes fascinated, with the idea of obtaining and using their own activity data. They are, as it has been happening for many systems outside UCIAD (e.g., Google, Facebook), very positive about such features being added to the websites of an organisation they spend so much time interacting with: their University. In many cases now, they are demanding it.

User (Management)

July 27th, 2011

In the previous post, we explained to a certain extent what are our motivations for looking at a user-centric approach to activity data, and especially what we expect to be the benefits to the users. We also quickly sketched some specific aspects of identifying and processing user-specific information in our post regarding the reasoning processes employed in UCIAD. Here, we come back more generally on the aspects related to users and user management in the UCIAD platform, including the way to recognise a user, treat registrations and login, manage and present the information about the user activity and handle access rights over semantic data. The actual prototype of the UCIAD platform implementing all these elements is currently being finalised, and will be described more completely in our final post.

Identifying and managing users of UCIAD

The information the UCIAD platform has regarding users can be seen as similar to the ones basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding computer. Analytics system have for long realised that the combination of these two parameters was sufficient to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web.

In UCIAD, we have the advantage that it is very likely that the user will connect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. As shown in the mock-up screenshot above, the “settings” the user is using can be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user.

In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in the figure below, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user.

As explained in the post about reasoning on user centric activity data, managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. This graph is then being registered in our repository, and the next step is to ensure that the information being provided is restricted to the graph of the logged-in user.

Managing access rights over semantic data

We store, manipulate and reason over activity data using Semantic Web technologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data.

Unfortunately, most current triple stores, and especially the one we are employing, do not provide sufficiently fine-grained access control mechanisms, allowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data.

The all idea is actually quite simple (as depicted on the diagram above): the actual SPARQL endpoint collecting all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentification. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user.

While this mechanism is relatively simple it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. It is actually nice to see how, based on basic authentification mechanisms, the same queries asking for activity data will return different results, depending on the user who is connected.

What users anyway?

Of course, the mechanisms and techniques to manage, identify and process information about users does not answer the question of who they are and what are the benefits they can get from the system. Actually, as argued before, it is pretty hard to predict in advance what is going to be the use of providing back to the users their own activity data. General arguments can be given on the advantages of self-tracking, but in reality, the really important thing is that what is provided by the system has to stay open for any use. Working with the development version of the UCIAD platform, we find it quite fascinating that we, as individual users, can trace back our activities, drill down into specific categories (e.g., search, commenting on blogs, checking the price of a course), send queries which might only be relevant to us (e.g., “how much did I use data.open.ac.uk on sundays?”), etc. It helps us understand our own use of the resources provided by the University, and so to become more efficient with them.

Explaining user-centric activity data

July 5th, 2011

I was today at the meeting of the JISC activity data programme, where all the projects in the programme came to discuss what they were doing, and what should be the priorities for the coming year(s). As some might have realised, I am actually a bit critical of this sort of discussions. Not that I think that the projects are doing the wrong things, just that there is a lot of catching up to do, and I think we might end up missing the next train (which I believe to be consumer data) while trying to catch up with the previous one (activity data-based recommander systems).

Anyway, I was trying to come up with a reasonable explanation regarding user-centric activity data (mostly based on showing evidence of the current trends in the industry, from energy providers showing users historic information on their own consumption to the Google Data Liberation front and the mydata project) when the ongoing discussion derived on the definition of simple things such as the notion of event. Trying to define the concepts we are talking about is the major goal of our ontologies. However, the discussion made me realised that we also needed a simplified overview of the kind of data we are dealing with, and of what made the difference between the organisation-centric view and the user-centric view of activity data.

Indeed, looking at the figure above, we can summarise very simply what we are dealing with in terms of activity data. Activity data is set of events (or the traces of these events) where an action is realised on a resource (e.g., a webpage) by an actor (most often a user). That is a general view of what we mostly have to consider as raw activity data. However, in order to extract anything meaningful from this data, looking at the raw collection of individual events isn’t going to give us much: we need to abstract the data into sets of events that are meaningful, and which distributions of characteristics can be interpreted.

The figure above represents the most common way of abstracting activity data: what we call the organisation-centric view. The idea is that large sets of events are being analysed that are realised by aggregated sets of users. There can be one set of users, like in the case of analytics system that provide statistics regarding actions realised by all visitors of a website, or the organisation can define sub-groups such as Students/Staff/External that are meaningful to the particular types of activities and analyses being considered. In this case, users stop existing individually in the abstracted activity data, as they only manifest as part of the aggregated statistics for their groups.

User-centric activity data is basically making the abstraction the other way around (see above): aggregating traces of activities around a given user, interpreted according to meaningful sets of resources and events. The challenge in this case (appart from the scalability of the approach, which is going to be the topic of another blog post sometimes) is in the way to define meaningful sets of resources and events. In the data we have been looking at, activities such as “commenting on a blog”, “searching a blog”, “querying linked data” or “using a web application” are clearly emerging, but the number and nature of the types of resources and events that can appear in the data is largely dependent on the system and the user. This is why we believe that using ontologies as a model to drive such abstractions is a good solution: it provides us with a flexible way to define types of resources (e.g., BlogPage, RSS feed, Linked Data endpoint) and the corresponding activities (e.g., commenting, querying, searching), and to automatically classify individual traces and resources into these types. The end result is the ability for individual users to visualise and analyse the distribution of their own activity data in these types and categories. Pushing it a step further, users should even be able to personalise the views, giving their own ontological definitions and obtaining data abstractions that are therefore more meaningful to them.

A colleague forwarded me today this article in french, where the author says (my translation): “What could I accomplish if I had at my disposal, in an exploitable form, the information regarding my pathways and communications? [...] Not only to control what others are doing with it, but to use it to my own benefit? Today, we tend to scratch our head and ask: what would be the use of that?”, and indeed we don’t really know what this will allow in the future. However, as the author of the article suggests, that shouldn’t stop us from trying to find out, as long as we are convinced there is something there to explore.

The mydata project

June 21st, 2011

Announcements have come out recently regarding new projects from the government around the slogan “Better choices, better deals” to support better customer experience, through transparent customer information. This is exciting as it shows how the government, as well as businesses, are now realising that it is through giving control to information to the customers (i.e., the users) that we can build a better, more reliable and more transparent experience. At the core of the initiative is the mydata project which goal can be summarised by the sentence: “giving back customer data to customers”. To a large extent UCIAD can be seen as an experiment in this direction, proposing to deliver activity data to the users (i.e., customers) of large organisations. We certainly share the same hypothesis that, as expressed by Nigel Shadbolt (chair of the MyData project), customers/users getting back their information can help make organisations/businesses “more accountable”, “more efficient” and able to build “new kinds of services”.

Of course, it is still unclear at this stage what will be the concrete outcomes of the mydata project. Great challenges have to be tackled both from a technological point of view (in what format should data be provided to customers? How to ensure reusability? How to deal with heterogeneity?) and from the societal point of view (What are the privacy/security implications? How to enforce “user-centric data provision” policies in businesses? How to spread the benefit equally amongst users?). We hope that our experience with UCIAD (and beyond, with the work building on UCIAD we are planning to do) will contribute to such exciting new approaches to activity/customer data.

Reasoning over user-centric activity data

June 16th, 2011

There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

Traces of activities around a user

The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:actor="http://uciad.info/ontology/actor/">
<rdf:Description rdf:about="http://uciad.info/actor/mathieu">
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/>
    <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a"/>
</rdf:Description>
</rdf:RDF>

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

PREFIX tr:<http://uciad.info/ontology/trace/>
PREFIX actor:<http://uciad.info/ontology/actor/>
construct {
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
} where{
  <http://uciad.info/actor/mathieu> actor:knownSetting ?set.
  ?trace tr:hasSetting ?set.
  ?trace ?p ?x.
  ?x ?p2 ?x2.
  ?x2 ?p3 ?x3.
  ?x3 ?p4 ?x4
}

The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Reasoning about websites and activities

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

Class(BlogFeed complete
    intersectionOf(Webpage
      restriction(isPartOf someValuesFrom(RSSFeed))
      restriction(isPartOf someValuesFrom(Blog))
    )
)

What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …

On Self-Tracking

May 18th, 2011

I have said it and repeated it numerous times, UCIAD is profoundly different from all the other JISC Activity data projects at many different levels. One of them, at the basis of our main hypothesis is that we consider activity data for the user’s own consumption, and to his/her own benefit. The team working in UCIAD has made this notion of user-centric personal information a guiding principle for research. With my colleague Matthew Rowe we recently described a major aspect of this research in a position paper for the W3C Workshop on Web Tracking and User Privacy: Self-tracking on the Web.

As described in the paper, entitled “Self-Tracking on the Web: Why and How“, self-tracking is “the activity of monitoring and analysing one’s own behaviour regarding personal information exchange and the consequences of such behaviour on their exposure, privacy and reputation“. We emphasize in this paper how existing tools and technologies to realise self-tracking on the Web are limited, especially in comparison with the tools and technologies used to track user activities and data to the benefit of organisations. The paper concluded that “achieving such a process of self-tracking can be very revealing to Web users, helping them reaching a better awareness of their own online behaviour, and a better understanding of the possible consequences of such behaviour on the exposure of their personal information. Such an approach appears to be crucially needed as the Web evolves to both a global information marketplace, and a major medium for all sorts of social interactions online. [...] We therefore argue that a more principled and comprehensive study of the activity of self-tracking on the Web and of the technological requirements for such an activity to take place should be conducted. This requires for both the social and conceptual models of the way personal information is exchanged on the Web to be related to the technological protocols that are used as mediums for instantiating these models. From a more concrete point of view, we believe that a new set of tools are to be created that will support users in monitoring their own activity on the Web

UCIAD can be seen as an experiment in this direction. Focusing on Web data related to the interaction between an user and an organisation, it is looking at the techniques, the models and the tools that are necessary to enable users to have a personalised view on their own data, i.e., the data generated by their own activity. More generally, it is also setting up generic models of activity online i.e., the ontologies and the associated technological components, that can be reused in broader environments.

UCIAD ontologies: A starting point

March 23rd, 2011

UCIAD intends to use ontologies both as a way to achieve the integration of activity across various, possibly heterogeneous systems, and to benefit from their inference capabilities to support the flexible, customisable and expressive analysis of such activity data. Building an ontology that could be used as a conceptual model for all sorts of activity data is quite obviously a difficult task, which is going to be refined and iterated over the length of the project (and hopefully beyond the end of the project).

However, compared to other domains, the advantage of user activities is that there is a lot of data to look at. This might be seen as an issue (from a technical point of view, but also because it is quite overwhelming to get so much data), but in reality, this allows to apply a bottom-up approach to building our ontologies: modelling through characterising the data, rather than through expertise in the domain. It also gives us an insight into the scale of the tasks, and the need for adapted tools to support both the ontological definition of specific situations, and the ontology-based analysis of large amounts of traces of activity data.

Identifying concepts and their relations

The first step in building our ontology is to identify the key concepts, i.e., the key notions, that we need to tackle, bearing in mind that our ultimate goal is to understand activities. The main concepts we are considering are therefore the ones that support the concept of activity. Activities relate to users, but not only. We rely extensively on website logs as sources of activity data. In these cases, we can investigate requests both from human users and from robots automatically retrieving and crawling information from the websites. The server logs in question represent collections can be seen as traces of activities that these users/robots are realising on websites. We therefore need to model these other aspects, which correspond to actions that are realised by actors on particular resources. These are the three kinds of objects that, in the context of Web platforms, we want to model, so that they can be interpreted and classified in terms of activities. We therefore propose 4 ontologies to be used as the basis of the work in UCIAD:

  • The Actor Ontology is an ontology representing different types of actors (human users vs robots), as well as the technical setting through which they realise online activities (computer and user agent).
  • The Sitemap Ontology is an ontology to represent the organisation of webpages in collections and websites, and which is extensible to represent different types of webpages and websites.
  • The Trace Ontology is an ontology to represent traces of activities, realised by particular agents on particular webpages. As we currently focus on HTTP server logs, this ontology contain specific sections related to traces as HTTP requests (e.g., methods as actions and HTTP response code). It is however extensible to other types of traces, such as specific logs for VLEs or search systems.
  • The Activity Ontology is intended to define particular classes of activities into which traces can be classified, depending on their particular parameters (including actors and webpages). The type of activities to consider highly depends on the systems considered and to a certain extent on the user. The idea here is that specific versions of the ontology will be built that fit the specific needs of particular systems. We will then extract the generic and globally reusable part of these ontologies to provide a base for an overarching activity ontology. Ultimately, the idea in UCIAD is that individual users will be able to manipulate this ontology to include their specific view on their own activities.

Reusing existing ontology

When dealing with data and ontologies, reuse is generally seen as a good practice. Appart from saving time from not having to remodel things that have already been described elsewhere, it also helps anticipating on future needs for interoperability by choosing well established ontologies that are likely to have been employed elsewhere. We therefore investigated existing ontologies that could help us define the notions mentioned above. Here are the ontology we reused:

  • The FOAF ontology is commonly used to describe people, their connections with other people, but also their connections with documents. We use FOAF in the Actor Ontology for human users, and on the Sitemap Ontology for Webpages (as Documents).
  • The Time Ontology is a common ontology for representing time and temporal intervals. We use it in the Trace Ontology.
  • The Action ontology defines different types of actions in a broad sense, and can be used as a basis for representing elements of the requests in the Trace Ontology, but also as a base typology for the Activity ontology. It itself imports a number of other ontologies, including its own notion of actors.

The graph representing the dependencies between our ontologies and others is represented below.
UCIAD ontologies dependencies

While not currently used in our base ontologies, other ontologies can be considered at a later stage, for example to model specific types of activities. These include the Online Presence Ontology (OPO), as well as the Semantically-Interlinked Online Communities ontology (SIOC).

Next: Using, refining, customizing

Ontology modelling is a never ending task. Elements constantly need to be corrected and added to cover more and more cases in a way as generic as possible. It is even more the case in UCIAD as the approach is to create the ontology depending on the data we need to treat. Therefore, as we will progressively be adding more data from different sources, including server logs from different types fo websites, activity logs from systems such as VLEs or video players, the ontologies will evolve to include these cases.

Going a step further, what we want to investigate is the user-centric analysis of activity data. The ontologies will be used to provide users with views and analysis mechanisms for the data that concern their own activities. It therefore seems a natural next step to make it possible for the users to extend the ontologies, to customize them, therefore creating their own view on their own data.

Hypothesis

March 14th, 2011

UCIAD is a relatively small, experimental project looking at how semantic technologies can help the user-centric integration, analysis and interpretation of activity data in a large organisation. As such, as suggested also to all the other projects in the JISC Activity Data programme, it relies on a central hypothesis that will hopefully be verified through the realisation and application of our software platform. But before we can express this hypothesis, we need to introduce a bit of background. Especially, we beed to get back to what we mean by “user-centric”.

To put it simply, a user-centric approach is considered here in opposition to an organisation-centric approach. The most common way of considering activity data in large organisations at the moment is through consolidating visits to websites in analytics, giving statistics about the number of visits on a given website or webpage, and where these visits were coming from. We qualify this as an organisation-centric view as the central point of focus is the website managed by the organisation. By taking such a restricted perspective on the interpretation of activity data, a number of potentially interesting questions, that take the users concerned with the activity data as the focus point, cannot be answered. The analysis of the activity data can also be only beneficial to the organisation, and not the user, as each user becomes aggregated in website related statistics. We therefore express our main hypothesis as

Hypothesis 1: Taking a user-centric point of view can enable different types of analysis of activity data, which are valuable to the organisation and the user.

In order to test this hypothesis, one actually needs to achieve such user-centric analysis of activity data. This implies a number of technical and technological challenges, namely, the need to integrate activity data across a variety of websites managed by an organisation, to consolidate this data beyond the “number of visits”, and to interpret them in terms of user activities.

Ontologies are formal, machine processable conceptual models of a domain. Ontology technologies, especially associated with technologies from the semantic web, have proven useful in situations where a meaningful integration of large amounts of heterogeneous data need to be realised, and to a certain extent, reasoned upon in a qualitative way, for interpretation and analysis. Our goal here is to investigate how ontologies and semantic technologies can support the user-centric analysis of activity data. In other words, our second hypothesis is

Hypothesis 2: Ontologies and ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources.

As described in our work plan (see previous blog post), our first task is therefore to build an ontology able to flexibly describe the traces of activities across multiple websites, the users of these websites and the connections between them. The idea is to use this ontology (or rather, this set of ontologies) as a basis for a pluggable software framework, capable of integrating data from heterogeneous logs, and to interpret such data as traces of high-level activities.

The ongoing definition of these ontologies can be followed on our code repository, and a presentation of UCIAD’s basic hypothesis at the JISC Activity Data Programme event is available on slideshare.

Project Plan

February 17th, 2011

UCIAD intends to realise something relatively ambitious -set up a software infrastructure for the user-centric integration of activity data- within a rather short period of time. This stresses the importance of setting up a suitable work plan from the start of the project, ensuring that outputs are delivered and can be taken up as early as possible.

Aims, Objectives and Final Output(s) of the project

The overall aim of UCIAD is to investigate the use of ontologies and semantic technologies for integrating the different data about the interaction of a user with different systems and websites in an organization. More specifically, to achieve this aim we plan:

  1. To investigate and develop the ontological models needed to integrate user activity data. The objective here is to develop a set of ontologies that can be used to integrate logs and traces of activities existing in a variety of formats, depending on the originating system. Such ontologies will provide a common, meaningful and reusable activity data model for capturing user-centric activity data.
  2. To prototype a reusable, pluggable framework to integrate user activity data across different user facing systems within a large organization, relying on the developed ontological models. Such a framework will be based on semantic data management components available in KMi or externally (as open source software) to aggregate data coming from various systems. In order to accommodate an extensible variety of log formats and activity databases, it will implement a pluggable architecture, where plug-ins implementing a mapping between a particular source/format and our ontological model can be easily added to the framework.
  3. To test and scope the applicability of such a framework within realistic scenarios at The Open University. A complete case study integrating logs from various systems at The Open University, especially access and search logs from The Open University’s main website, specific logs from The Open University’s virtual learning environment, the linked open data platform of The Open University, the seminar system of The Open University, websites and user facing systems from various research projects at the Knowledge Media institute (e.g., kmi.open.ac.uk, lucero-project.info, neon-project.org, http://sssw.org, etc.) will be used to test the UCIAD framework.
  4. To demonstrate how the UCIAD activity data framework can benefit the users in their interaction with the organization. Initial requirements, components and guidelines on exploiting the framework to the benefit of the user, regarding in particular GUI issues, ownership and export of the data will be devised by the end of the project, ensuring short-term potential deployment of the results of the project.

Risk Analysis and Success Plan

Considering the ambitious goals of the project, the major risks relate to the maturity and robustness of semantic technologies, related to their ability to handle very large amounts of user activity data across multiple websites, and to support the user-centric interpretation of this data. The team involved in the project has extensive experience in working with such technologies, in large scale projects.

The primary goal of UCIAD being the realisation of an open software platform relying on ontologies to integrate and interpret user activity data, the main success criteria include the successful, documented application of this platform on a large variety of websites at the Open University, and possibly outside. The outputs of the project will be released as open source, and we expect uptake from external organisations to take place towards the end, or after the project.

IPR

In order not to infringe the privacy-related expectations from users of the considered websites, the activity data considered as part of the project will be kept private. The ontologies to model and integrate such data will be made available under an open license (CC0), for reuse and extension by the community. Some technologies employed in the project have been developed by external organizations and are available as open source software. Code realized as part of UCIAD will also be released under an open source license (LGPL). The code will be made available through UCIAD’s repositories on github. All documentation produced, including reports, blogs and system documentation will be made available under a creative commons license (CC-By).

Project Team Relationships and End User Engagement

UCIAD is realised and managed at the Knowledge Media Institute (KMi) of the Open University, which is a 84-strong interdisciplinary research laboratory founded at The Open University in 1995. KMi has established itself as a world-class R&D centre at the leading edge of the Web, semantic, learning, and new media technologies. The research areas in KMi include cognitive sciences, new media technologies for learners, human computer interaction, Semantic Web and Web services, multimedia analysis and information retrieval.

The project team includes:

  • Dr. Mathieu d’Aquin is a Research Fellow working in the Semantic Web area at the Knowledge Media Institute. Dr. d’Aquin is leading the research and development around approaches to exploit semantic technologies and semantic data. Dr. d’Aquin has in particular been working on concrete solutions for the realization of applications producing and consuming linked data (see for example the JISC-funded LUCERO project which he is directing), and is currently leading the realization of the Open University’s linked data Web – data.open.ac.uk. Dr. d’Aquin is also involved in a research direction concerning the use of Semantic Web technologies for the purpose of personal information management.
  • Prof. Enrico Motta is Professor of Knowledge Technologies at KMi and a leading international scientist in the area of Semantic Technologies, with extensive experience of both fundamental and applied research. Professor Motta will act in the project as the chair of the steering group.
  • Salman Elahi is a research assistant at KMi, and a part time PhD Student working on aspects of user-centric identity and personal information management.
  • Stuart Brown is Web Developments and Online Communities manager at The Open University. He is in particular involved in the overall management of the Open University’s content management systems. Stuart Brown will act as a member of the UCIAD steering group, in charge of the liaison between the project team and the Open University’s online services.

Dissemination will be realised through a variety of channels (blog, twitter, etc.) as well as through direct engagement with the community (users and website developers at The Open University, other researchers and developers through seminars, conferences and dedicated workshops). Several aspects of evaluation will be considered. The ontologies and software framework developed as part of the project will be evaluated both formally (using ontology evaluation frameworks and software validation methods) and through usage in our case study. The overall outcome of the project will be evaluated based on adoption at The Open University and by external parties.

Projected Timeline, Workplan & Overall Project Methodology

Based on the aim and objectives described above, we divide the workplan of UCIAD in 5 workpackages:

WP1 – Ontologies as Semantic Models for Integrating User Activity Data: The goal of this workpackage is to produce the foundational data models for the project, by developing the ontologies to be used to integrate activity data from various sources. Here, we will employ ontology design methodologies developed in KMi, combining reuse of existing ontologies, data-driven modelling and knowledge engineering techniques.

Deliverables: A set of documented and reusable user activity data ontologies.

WP2 – Prototype Ontology Based Architecture for Cross-Organization User Activity Data: The goal of this workpackage is to prototype the architecture for aggregating user activity data based on the ontologies developed in WP1. This architecture will mostly consist of a semantic data management system (triple store, reasoner and query engine), and a plug-in based framework to realise the mapping between logs and activity databases and user activity ontologies.

Deliverables: An open-source, pluggable user activity data framework and
documentation.

WP3 – Case Study using Multiple Sources of Activity Data: The goal of this workpackage
is to deploy the architecture developed in WP2 in a concrete, realistic scenario. We will in particular set up the architecture with a set of plugins to aggregate data from several websites in of The Open University and the Knowledge Media institute (see list of considered systems and websites in Paragraph 14). Initial agreements with the administrators of the considered systems and websites at The Open University’s online services and Knowledge Media institute have already been obtained.

Deliverables: A set of plugins for the relevant websites/systems (including for example a plugin for access logs of Apache Web servers), with documentation regarding the development of these plugins and the deployment of the UCIAD framework.

WP4 – User Centric Interfaces to Activity Data: The goal of this workpackage is to analyse the requirements and implement initial components for user interfaces to the UCIAD framework. In order to reduce development cost, we plan to reuse components of the open source Piwik web analytics engine2, to provide user-centric, ontology-based analytics across organizational websites, instead of website-centric analytics.

Deliverables: An initial set of components (widgets) for a prototype graphical interface to the UCIAD framework.

WP5 – Dissemination and Project Evaluation: The goal of the project is to investigate and prototype a pluggable framework for user activity data. It is therefore essential for the project to engage with potential users and developers of this framework, to ensure adoption and further extension. We will realise this through extensive and frequent communication across a variety of channels (project website, blog, twitter, seminar and conferences). The evaluation of the results of the project will be realised through demonstrating in a realistic case study, the benefit and quality of the developed components (ontologies, architecture, plugins, interface).

Deliverables: Documented dissemination activities and user-based tests.

UCIAD project plan

Budget


Directly incurred Staff £28,569 Include research assistant and director of the project
Directly incurred non-staff £4,000 Include travel and equipment
Directly Allocated £6,994 Include staff and estates
Indirect Cost £31,614
Total £71,178
JISC contribution £49,824
OU contribution £21,353

What is UCIAD

January 21st, 2011

UCIAD (User Centric Integration of Activity Data) is a new 6 month JISC-funded project addressing the integration of activity data spread across various systems in an organization, and exploring how this integration can both benefit users and improve transparency in an organization.

Both research and commercial developments in the area of user activity data analysis have until now mostly focused on logging user visits to specific websites and systems, primarily in order to support recommendation, or to gather feedback data from users. However, data concerning a single user are generally fragmented across many different systems and logs, from website access logs to search data in different departments and as a result organizations typically are not able to maintain an integrated overview of the various activities of a given user, thus affecting their ability to provide optimal service to their users. Hence, a key tenet of the UCIAD project is that developing a coherent picture of the interactions between the user and the organization would be beneficial both to an organization and to its users.

Specifically, the objective of UCIAD is to provide the conceptual and computational foundations to support user-centric analyses of activity data, with the aim of producing results which can be customized for and deployed in different organizations. Ontologies represent semantic models of a particular domain, and can be used to annotate and integrate data from heterogeneous sources. The project will therefore investigate ontological models for the integration of user activity data, how such models can be used as a basis for a pluggable data framework aggregating user activity data, and how such an infrastructure can be used for the benefit of the users, providing meaningful (and exportable) overviews of their interaction with the organization.

The project, led and managed by Mathieu d’Aquin, will contribute to the current strand of research in KMi, which focuses on the use of semantic technologies to support online personal information management and is carried out by Dr d’Aquin and Salman Elahi. The solutions developed in the project will be tested on a number of Open University systems, in collaboration with the Open University’s communication services.