Rationale

In the last months and years, it became increasingly evident to me that I have an obsession with interoperability. At first, I didn’t know it’s called like that or what it even means, but the idea was there already.

So what is interoperability then?

“Interoperability is a characteristic of a product or system, whose interfaces are completely understood, to work with other products or systems, present or future, in either implementation or access, without any restrictions.” ¹

Interoperable software, for instance, is designed to efficiently exchange information with other software by providing the output of functionally similar (or the same) operations in a common arrangement or format, standardising access to the resulting data. This principle is not only true for software written in different programming languages, but can also apply to several packages, or workflows, within the R ecosystem.

What was visible to me, back in the time, was a wave of standardisation of workflows and software, for example, the R package dplyr. dplyr provides a relatively small set of "verbs" as the basic building blocks (I would call them operators) to build various data management algorithms. I would regard these operators as interoperable software because they are the result of identifying those aspects many workflows have in common. Eventually they are a harmonised and easy to understand interface to access and process data. A typical example would be to calculate the average of a set of values for the levels of a factor. This task is composed of two subtasks:

grouping the values according to the levels,
calculating the average for each group.

While the base R function table() carries out the task in one go, the functions dplyr functions group_by() and summarise() do the same thing, but more explicitly ².

Exceptions to an otherwise well-defined operator are not available as a standalone function. Instead, the correct building blocks have to be combined to end up with the desired result, and this is what ultimately results in more interoperability. It is crystal clear what any workflow means, in terms of the combination of well-understood operators. Moreover, the "algorithms" resulting from this are understandable at a more human-readable level. They could be implemented in a very similar way also in other programming languages, in case the same operators (with the same functional meaning) are available in those languages.

While dplyr may be consistent in itself about the individual tasks each function solves, this starts looking different when taking the whole R ecosystem into account. Here, many functions carry out the same or a similar task, and many tasks can be carried out with several functions. However, some functions have been well designed (with a clear and unique use-case in mind), and other functions can be used in many different ways, depending on the context.

Select a column in a data frame

To exemplify where non-interoperable software might become problematic, I want to select a column from a data frame (simple, right?).

Anybody who started using R before dplyr or the tidyverse would suggest to "simply use that bracket function". Some witty people would even suggest using $, because "it’s cool", so let’s also think about that. Spoiler: [ and [[ are actually different functions, and also using a ',' or not results in functionally different behaviour.

What is the difference between [ and [[? Well, their documentation states that [ can select more than one element, whereas the other selects only a single element. Also $ selects only a single element, but due to the way it is invoked, this seems rather obvious. Let’s first look at [.

df <- data.frame(num = c(1, 2, 3, 4),
                 char = LETTERS[1:4],
                 bool = c(F, T, T, F), 
                 stringsAsFactors = FALSE)

# select columns by index; the first and second column
df[, c(1, 2)]

  num char
1   1    A
2   2    B
3   3    C
4   4    D

# select columns by name
df[, c("num", "char")]

  num char
1   1    A
2   2    B
3   3    C
4   4    D

# selecting only one column with the same notation results in a different output format ...
df[, 1]

[1] 1 2 3 4

# ... except when using additional arguments (wait what... that's possible?)
df[, 1, drop = FALSE]

# using the [[ function that has been intended for selecting single element, we
# also only get the values, and not the whole column
df[[1]]

[1] 1 2 3 4

With this simple test, I gained the insight that [ is by default not a function that has been designed for an interoperable workflow without any restrictions. When providing only a single value to [, I don’t receive a column (which would imply that the output format is a data frame), but a vector of the contents of a column. I can’t use this function in any context, where it is not previously known how many columns I will have to select, without first running additional tests to inquire about the context (as in the following hypothetical function).

mySelect <- function(df, row, cols){
  if(length(cols) > 1){
    df[row, cols]               # here the default is sufficient
  } else {
    df[row, cols, drop = FALSE] # but in this case I require the additional argument 
  }
}

As I could simply use the drop = FALSE argument all the time, this lack of interoperability may not look like a big deal right now. The example nevertheless underpins the point that if I use the default of this function, the format of the output I get from this particular operation depends on my input case (whether I need to select one or more columns). I would argue that it is a big deal. We always have to bear in mind that a program only knows what we tell it to know. When we tell our program or expect by implication ³ that a downstream function requires a data frame with column names, that function will not be able to proceed when it gets a numeric vector instead.

"Like a matrix"

Moreover, to understand both [ and [[, we also need to know that we can index a data frame "like a matrix" using a ','. We would signal that either rows (anything before the comma), columns (anything behind the comma), or a single cell (both positions) are selected. That indexing like a matrix-statement presumably only makes sense when its clear that a data frame is, in fact, a special case of a list, so that indexing like a matrix is not the first thing to think about when indexing a data frame. That case is characterised by only one level of nesting, the same amount of elements per list-item and with a specific way of being printed in the console. So the other case of indexing a data frame (next to like a matrix, i.e., with ',') is indexing it like a list (i.e., without ',').

When indexing like a list, it makes sense that no comma is involved, because this is how we treat lists.

lst <- list(num = c(1, 2, 3, 4),
            char = LETTERS[1:4], 
            bool = c(F, T, T, F))

# get element with its name
lst[1]

$num
[1] 1 2 3 4

df[1]

# get only the values
lst[[1]]

[1] 1 2 3 4

df[[1]]

[1] 1 2 3 4

In the documentation, there is no clear statement that indexing like a matrix and indexing like a list are different functions. Thus I find it most intuitive to assume that "omitting the ','" is like choosing the first argument of the function call, as this the same behaviour as in any other function in R. However, you have guessed it, this is not how [ works with the comma-notation because columns are specified after the comma.

# indexing like a list
df[1]

# indexing naively (like a matrix)
df[1, ]

  num char  bool
1   1    A FALSE

# actually indexing like a matrix and keeping the name
df[, 1, drop = FALSE]

One could say [ is not even interoperable with itself (depending on how the arguments are specified)!? I am aware that this example is very artificial and anybody who has understood how "the comma-notation" works should be able to avoid that pitfall. Still, the fact that it even exists is another testimony of a lack of interoperability. I touched on it above already, assuming it’s the same function might be the problem here. Perhaps we should simply not regard [ as the same function as [,, or not even as a special case of it – because their arguments contradict one another.

Select columns by name

The previous two issues are not even where the lack of interoperability ends or could conclusively be dealt with. When we want to select a column by its name, we can do that with both [ and [[. However, in case we were that witty person using $, we might get into trouble latest here. Some things are unknown at the time a function is built, such as the column names of data frames the user provides. Thus, a function that is supposed to work with such dynamic data has to be compatible with unknown column names.

# select any column that is specified or determined elsewhere in a script or function
myColumn <- "num"

df[myColumn]

df[[myColumn]]

[1] 1 2 3 4

df$myColumn                   # bummer!

NULL

With this additional test, I learned that not all the "traditional" functions of the select operation could even handle the same input in the first place. This also hampers interoperability. In case I decide to use the $ function, I can only do this where:

I am certain that a single column is selected,
that column definitely exists in the data-frame and
I want the output as a vector without name.

Granted, in (some/many/most) cases this may be sufficient, depending on your use-cases. However, I would argue, here again, the code must be provided with routines that ensure all requirements are met (i.e., made interoperable), before being able to successfully proceed.

Summing up

There are several ways of selecting a column from a data frame, some of which are more or less context-specific and thus not interoperable (in the sense of interchangeable).

# get the values as column in a data-frame
df["num"]
df[, "num", drop = FALSE]
df[myColumn]

# get the values as vector
df[, "num"]
df[["num"]]
df$num

Suppose you choose to go for the "traditional way" of coding with those functions. In that case, the result is almost certainly a rather complex code that potentially requires a lot of time allocated to debugging. When you instead use modern, more interoperable functions, you can allocated more time to building an algorithm instead of managing implicit assumptions of data formats.

library(dplyr)

# select the column ...
select(df, num)

# ... or pull its values
pull(df, num)

[1] 1 2 3 4

Interoperability can have many faces and can be interpreted from different perspectives; the one I complained about here is only one. While [ might have some problematic aspects (which seem to be mostly due to its implementation of indexing like a matrix), it is itself an attempt at making code more interoperable. The bracket function exists in many programming languages and typically serves as a tool to index arrays. As many objects are stored in arrays, it is thus a frequently used tool to access data or their subsets. Moreover, also within R, it allows accessing objects of different classes with the same notation.

vector <- c(1, 2, 3, 4)

df[1]

lst[1]

$num
[1] 1 2 3 4

vector[1]

[1] 1

In future blog posts, I will write about other aspects of interoperability, such as semantic interoperability, ontologies, more examples of (non-)interoperable code (such as the geometr package) and other related aspects.

Comment on this article Share:

What is interoperability?