March 31, 2017 by scott
Operating over many columns with dplyr
I had this problem recently: I had a wide data table with one ID (representing individual people) and many columns (representing properties of each person). It made sense to compare these properties: they were things like, “Does this person have disease X?”, “Disease Y?”, etc. So the records look like:
person 1; person 1 doesn’t have disease X; person 1 does have disease 2; etc.
The tidy data philosophy encourages “gathering” these disparate columns into multiple rows:
person 1; disease X; doesn’t have
person1; disease Y; does have
I wanted to ask, how many diseases does each person have? dplyr provides a nice way to do this with the tidied data via the group_by and summarize functions. Unfortunately, my table was too big to comfortably tidy it: I had millions of rows and around a dozen columns. I wanted a way to more efficiently do this summation without having to leave dplyr.
I found only one StackOverflow question that was closely related. After some work, I was able to generalize the first answer to solve my problem. The results are in my response to that question and in this gist. (The question is couched in frames of taking the maximum over multiple columns, but the logic is the same.) This requires some fancy footwork with lazy evaluation, formulas, and the interp function.