Introduction

In DataFrames.jl a data frame can be grouped by columns. In this way its view is created. The type of this view is GroupedDataFrame and it allows for performing groupwise operations on the data stored in the original data frame.

If you use SQL then you can think of GroupedDataFrame as an object with is obtained by using GROUP BY command but without performing an aggregation step. This is often useful as after grouping data you might want to perform several different aggregation operations on it without having to group it each time.

In this post I want to discuss how filtering of GroupedDataFrame works in DataFrames.jl. In general there are two ways how you might want to filter it:

  • by dropping whole groups;
  • by dropping rows within group.

I will discuss both today.

The post was written under Julia 1.9.0 and DataFrames.jl 1.5.0.

Preparing the data

Start by creating a sample GroupedDataFrame:

julia> using DataFrames

julia> df = DataFrame(x = repeat(1:4, 2), id=1:8)
8×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     1      5
   6 │     2      6
   7 │     3      7
   8 │     4      8

julia> gdf = groupby(df, :x, sort=true);

julia> show(gdf, allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
Group 2 (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
Group 3 (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7
Group 4 (2 rows): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     4      8

Dropping whole groups

The first type of operation is when you want to drop whole groups from your grouped data frame.

First, assume that you want to drop the first and last group from gdf. You can achieve this using indexing:

julia> gdf[[2, 3]]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

julia> gdf[Not(begin, end)]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

Now assume that you want to drop groups in which the value of the grouping variable :x is even. You could still use indexing like this:

julia> gdf[isodd.(getproperty.(keys(gdf), :x))]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

However, a more general approach is to use the filter function:

julia> filter(sdf -> isodd(first(sdf.x)), gdf)
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

If you wanted to drop grouping of the result you can pass ungroup=true in filter:

julia> filter(sdf -> isodd(first(sdf.x)), gdf, ungroup=true)
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
   3 │     3      3
   4 │     3      7

Dropping rows within group

Now that you know how to drop whole groups you might wonder how to drop individual rows within group. Assume that we want to drop from each group all rows except the row with minimal value of :id within group. You can achieve this using the subset function:

julia> subset(gdf, :id => x -> x .== minimum(x))
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4

Note that by default subset ungroups data. If you want to keep it grouped pass ungroup=false:

julia> show(subset(gdf, :id => x -> x .== minimum(x), ungroup=false),
            allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (1 row): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
Group 2 (1 row): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
Group 3 (1 row): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
Group 4 (1 row): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4

Conclusions

I hope you found the examples using indexing, filter, and subset useful and they improved your understanding of row filtering options available in DataFrames.jl.

In particular remember that filter and subset have a different default value of the ungoup keyword argument. This difference was made to reflect the fact that when doing filter typically we do not want to ungroup data, while when doing subset typically we want to drop grouping.