CKI-002: DataWarehouse Authorization System

Architecture of the DataWarehouse data access authorization system
Iñaki Malerba – cki-project/documentation!50

Abstract

The public DataWarehouse instance will allow public and internal data to exist on the same platform. In order to achieve this we need to develop an authorization framework that allows us to control the data access keeping the internal information safe.

Furthermore, adding tree level authorization checks will allow us to group the data into more than 2 (public/internal) groups, enabling giving certain maintainers write permissions over their trees, keeping them public readable without granting access over all the internal trees.

Motivation

Goals

Limit data access

The goal of this system is to limit the data users -both logged in and anonymous visitors- can see and modify.

User Stories

Anonymous user can read public trees

Public trees need to be readable by all users, logged in or not.

User with authorization on public trees

Certain users will need authorization to modify public readable trees, including the submission of data and triaging failures.

User with read authorization on non-public trees

Some trees should only be accessible by logged in users with certain authorization.

User with write authorization on non-public trees

Certain users will need authorization to modify non-public trees, including the submission of data and triaging failures.

Listing Issues

Users should only be able to see issues related to the revisions they have access.

New git trees are not readable/writeable by default

Trees without a certain policy attached need to be private by default. This means that manual action will be required to assign the correct policy to the new trees.

Approach

Policies

The AuthorizationBackend system includes a new Policy class that can be attached to any other class and defines the authorization needed to access it and it’s related objects.

Policy contains the following fields: name, read_group and write_group. The field name defines a human friendly description of the policy, while read_group and write_group indicates the group a user needs to belong to be able to perform read and write actions on the objects tagged with this policy, respectively.

To meet a Policy requirement, the user needs to be part of a group. For this, the built-in Django Authorization Framework is used.

For instance, a group named auth-redhat-internal could give access to the Red Hat internal data, or auth-arm could be the group maintainers need to belong in order to triage arm tree objects.

read_group or write_group pointing to a group

The user making a request needs to belong to the group referred in this field in order to read or write the object related to the policy, respectively.

read_group or write_group null

If the read_group or write_group value is null it means that the policy allows anyone to read or write the object related to that policy.

When a object is ‘writeable by anyone’ means that anyone logged in with the sufficient table-level permissions can write to it.

No policy attached

When an object has no policy attached it means that is not readable nor writeable by anyone. This guarantees objects to be private by default.

Git Tree Policies

GitTree objects get a Policy object attached, allowing us to filter any of the related objects (KCIDBRevision, KCIDBBuild, KCIDBTest, Pipeline, etc) according to this policy.

This covers all the results endpoints, giving us control over the KCIDB and pipeline data available on the DataWarehouse.

For example, endpoints querying KCIDB Revisions will only be able to list the revisions allowed for the user, while requesting via GET a particular unauthorized Revision would return a Not Found error.

Issues Policies

Issue objects get a Policy attached, allowing us to limit the issues a user can see while querying them or querying or triaging a certain KCIDB object.

For instance, auth-issue-redhat-internal policy could limit an Issue to a certain group of users, while auth-issue-public could make an Issue public.

Use case example

To illustrate the use of this design, let’s assume the simplest case: there is a single instance containing 2 sets of data: internal and public.

In this case, we want to keep the internal data only readable to a specific group of users, while the public data should be readable by anyone. Both sets can only be written by a certain group, too.

The policies would be the following:

  • internal (auth-internal): write_group: redhat-cki, read_group: redhat-all
  • public (auth-public): write_group: redhat-cki, read_group: None

redhat-cki would contain all the members of CKI Team, giving them write permissions to write on both the internal and public group of trees. redhat-all has access to read the internal trees and should include all the people allowed to read that data.

Benefits

Flexibility

Having the possibility to create various groups with differentiated read and write checks should allow us to define rules that cover all the use cases.

Reusability

A generic system attachable to any class provides flexibility to add new classes with access authorization rules without the need to design new checks.

Secure by default

Making sure that an object without a policy attached is private by default we make sure that anything missing to configure is kept internal.

Drawbacks

Visibility definition becomes complex

The question of whether an object is ‘private’ or ‘public’ changes from evaluating a boolean flag to a query where the user accessing the data is involved.

While being a much more flexible approach, the complexity is increased and the result of the evaluation obscured.

To help visualize the policies and authorizations, adding details about which users and groups are able to read and write each object to the user interface would be a good feature. An audit page should list the policies attached to each object, as well as the details of the policy and the users fulfilling it.

Mixed policies

Using the same Policy model for both GitTree and Issue classes means that the objects available for each of the cases are shared. For the first case, different group policies might be desired while for Issues only ‘public’ and ‘internal’ categories could be useful.

This means that naming policies will need to be defined and filters put in place for the UI menus, such as:

  • auth-issue- prefix for Issue policies.
  • auth-gittree- prefix for GitTree policies.

Session caching

To improve the performance and reduce the overhead of this authorization queries, the authorized GitTrees and Issues will be cached on the user session. In other words, the authorizations are calculated and stored when the user logs in, and need a log-out log-in cycle to refresh them.

It might be necessary to implement a cache invalidation system to update a users' authorizations when the it is added to or removed from a group.

Message queues are privileged

DataWarehouse uses AMQP queues to communicate data to micro services running from it.

With the current architecture, where all the available data is dumped into exchanges and queues and the decision of ignoring a message is handed over to the consumer, this communication channel and all it’s listeners need to be considered privileged.

To control the policies a consumer is attached to, it’s necessary to change the messages architecture.

One possible scenario would mean removing all the data from the messages and keep only the id of the object, making the consumer request additional data through the API with it’s credentials, which would go through the authorization checks, but loosing the performance benefits of rendering and broadcasting a single message.