Blog by Bogumił Kamiński

Annotating columns of a data frame with DataFramesMeta.jl

2024-04-26T08:12:31+00:00

Introduction

Today I want to discuss a functionality that was recently added to DataFramesMeta.jl. These utility macros and functions make it easy to add custom labels and notes to columns of a data frame. This functionality is especially useful when working with wide data frames, as is often the case when e.g. analyzing economic data.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

Column labels

A column label is a short description of the contents of a column. When using DataFramesMeta.jl you can use the following basic commands to work with them:

@label! attaches a label to a column;
label allows you to retrieve column label;
printlabels presents you labels of all annotated columns in a data frame.

Here is a simple example:

julia> using DataFramesMeta

julia> df = DataFrame(year=[2000, 2001], rev=[12, 17])
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> @label!(df, :rev = "Revenue (USD)")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> label(df, :rev)
"Revenue (USD)"

julia> printlabels(df)
┌────────┬───────────────┐
│ Column │         Label │
├────────┼───────────────┤
│   year │          year │
│    rev │ Revenue (USD) │
└────────┴───────────────┘

Note that if some column did not get an explicit label (like :year in our example) by default its name is its label.

Column notes

Column notes are meant to give more detailed information about a column in a data frame. You can use the following basic commands to work with them:

@note! attaches a note to a column;
note allows you to retrieve column note;
printnotes presents you notes of all columns in a data frame.

julia> @note!(df, :rev = "Total revenue of a company in in a calendar year in nominal USD")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> note(df, :rev)
"Total revenue of a company in in a calendar year in nominal USD"

julia> printnotes(df)
Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

julia> @note!(df, :year = "Calendar year")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> printnotes(df)
Column: year
────────────
Calendar year

Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

Observe that printnotes only prints notes that were actually added to a column (as opposed to printlabels which prints labels of all columns, using the default fallback to column name).

Conclusions

Today I covered the basic functions allowing to work with column metadata of data frames. If you are interested in learning more advanced functionalities please refer to DataFrames.jl and TableMetadataTools.jl documentations.

I hope that you will find the metadata functionality provided by DataFramesMeta.jl useful in your work.

Onboarding DataFrames.jl

2024-04-19T04:21:22+00:00

Introduction

Working with data frames is one of the basic needs of any data scientist. In the Julia ecosystem DataFrames.jl is a package providing support for these operations. It was designed to be efficient and flexible.

Sometimes, however, novice users can be overwhelmed by the syntax due to its flexibility. Therefore data scientists often find it useful to use the packages that make it easier to do transformations of data frames.

Interestingly, these packages use metaprogramming, which might sound to novices as something scary, while in reality it is the opposite. Metaprogramming is used to make them easier to use.

Today I want do do a quick review of the main metaprogramming packages that are available in the ecosystem. I will not go into the details functionality and syntax of the packages, but rather just present them briefly and give my personal (opinionated) view of their status.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, Chain.jl 0.5.0, DataFramesMeta.jl 0.15.2, DataFrameMacros.jl 0.4.1, and TidyData.jl 0.15.1.

A basic example

Let us start with a basic example of DataFrames.jl syntax, which we will later rewrite using metaprogramming:

julia> using Statistics

julia> using DataFrames

julia> df = DataFrame(id=[1, 2, 1, 2], v=1:4)
4×2 DataFrame
 Row │ id     v
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4

julia> transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100)
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

The syntax looks complex and might be scary. Let us see if we can make it simpler.

Chain.jl

The first functionality we might want to use is to put the operations in a pipe. This is achieved with the Chain.jl package:

julia> using Chain

julia> @chain df begin
           groupby(:id)
           transform(:v => (x -> x .- mean(x)) => :v100)
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

We have achieved the benefit of a better visual separation of operations. In my opinion Chain.jl can be considered as a currently mostly accepted approach to piping operations in Julia (there are alternatives in the ecosystem but as far as I can tell they have lower adoption level).

DataFramesMeta.jl

Still the transform(:v => (x -> x .- mean(x)) => :v100) part looks verbose. Let us start by showing how it can be made simpler using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

In my opinion the code is now really easy to read.

Here is the status of DataFramesMeta.jl:

It is actively maintained.
Its syntax is close to DataFrames.jl.
It uses : to signal that some name is a column of a data frame.

DataFrameMacros.jl

The DataFrameMacros.jl is another package that is closely tied to DataFrames.jl. Let us see how we can use it. Note that you need to restart the Julia session before running the code as the macro names are overlapping with DataFramesMeta.jl:

julia> using DataFrameMacros

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = @bycol :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

Note the difference with the @bycol expression. It is needed because in DataFrameMacros.jl @transform by default vectorizes operations. This is often more convenient for users, but sometimes (like in this case), one wants to suppress vectorization.

What is the status of DataFrameMacros.jl?

It is maintained but less actively developed than DataFramesMeta.jl.
Its syntax is close to DataFrames.jl, but several macros, for user convenience, vectorize operations by default (as opposed to Base Julia).
It uses : to signal that some text is a column of a data frame.

TidierData.jl

Now let us see the TidierData.jl package that is designed to follow dplyr from R:

julia> using TidierData

julia> @chain df begin
           @group_by(id)
           @mutate(v100 = v - mean(v))
           @ungroup
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     1      3      1.0
   3 │     2      2     -1.0
   4 │     2      4      1.0

If you know dplyr you should be at home with this syntax.

What is the status of TidierData.jl:

It is actively maintained.
It tries to guess as much as possible; the package automatically decides which functions should be vectorized (in our example - was vectorized but mean was not).
You do not need a : prefix in column names, the package uses scoping similar to R to resolve variable names.

As you can see, the R-style syntax is designed for maximum convenience, at the expense of control (a lot of “magic” happens behind the scenes; admittedly most of the time this magic is what novice users would want).

Conclusions

Here is a recap of what we have discussed:

Meta-packages are here to make life easier for users. There is no need to be afraid of them.
For piping I recommend using Chain.jl.
Use plain DataFrames.jl if you are a die-hard Julia user and want all your code to be valid Julia syntax (I prefer it when writing production stuff).
Use DataFramesMeta.jl if you want an experience most consistent with Base Julia (this is my personal preference for interactive sessions, but it requires most knowledge of Julia).
DataFrameMacros.jl is an in-between package, it adds some more convenience (e.g. vectorization by default), but does not push it to the extreme (it also has a super convenient {} notation which you might find useful; I decided to skip it to keep the post simple to follow).
TidyData.jl goes for maximum convenience. It follows R-style and tries to guess what you most likely wanted to do. Users with dplyr should be able to start using it immediately.

Sorting data with missing values

2024-04-12T05:33:45+00:00

Introduction

Sorting is one of the most common operations one wants to do with collections. In this post I discuss how one can sort data that contain missing values.

The post was written under Julia 1.10.1 and Missings.jl 1.2.0.

General rules of comparison with missing values

By default missing is considered as greater than any other different value it is compared with:

julia> isless(Inf, missing)
true

julia> isless("abc", missing)
true

julia> isless(r"abc", missing)
true

Note, in particular, the last case. Although Regex does not support comparisons it can be compared to missing. The reason is that isless has a general catch-all definition when one of the arguments is missing. Let us see it:

isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true

The rule that missing is greater than all else has an important consequence when sorting.

Default sorting with missing values

Let us create a simple vector containing missing values:

julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
  missing
 3
 1
  missing
 2
 4
  missing

If we sort it missing values end up at the end of the produced vector because, by default, sorting is done in ascending order:

julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
  missing
  missing
  missing

If we want to get values in descending order missing values come first:

julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 4
 3
 2
 1

But what if we wanted to have values sorted in descending order, but put missing at the end?

Supplementary sorting order

Users often wanted a functionality that would allow them to sort values, but treat missing as the smallest. This means that if you sort your data in a descending order missing would be put at the end. Similarly, if you want to sort your data in ascending order missing would be put at the beginning.

With Missings.jl release 1.2 this functionality is supported with the missingsmallest function:

julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 1
 2
 3
 4

julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
 4
 3
 2
 1
  missing
  missing
  missing

By default missingsmallest uses the isless comparison.

More advanced cases of treating missing as smallest

Assume that you have the following vector that you want to sort by the length of the string:

julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
 missing
 "abc"
 "x"
 missing
 "bcde"
 "pq"
 missing

If you try a simple way to do it you get an error:

julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)

We need to wrap length in passmissing to get what we want:

julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
 "x"
 "pq"
 "abc"
 "bcde"
 missing
 missing
 missing

julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "bcde"
 "abc"
 "pq"
 "x"

But what if we wanted to treat missing values as smallest?

The first approach is the one we already know:

julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

However, there is an alternative. You can define a comparison function that works on strings:

julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)

Then you can pass the isshorter function to missingsmallest as a single argument to generate a comparison function that automatically treats missing values as smallest:

julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

Conclusions

The missingsmallest functionality was added in Missings.jl 1.2. I hope you will find it useful when working with your data!

Deduplication of rows in DataFrames.jl

2024-04-05T04:12:34+00:00

Introduction

Deduplication of rows in a table is one of the basic functionalities that is often needed when working with data frames. Today I discuss the allunique, nonunique, unique, and unique! functions that are provided by DataFrames.jl and can help you with this task.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Checking if a data frame has duplicate rows

Let us start with discussing how one can check if a data frame has duplicate rows as this is the simplest check and the functionalities that we discuss here carry-over to other functions that we discuss later.

First create a simple data frame:

julia> using DataFrames

julia> df = DataFrame(x=1:6, y=[1.0, 2.0, 1.0, 2.0, 0.0, -0.0])
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

By just calling the allunique function we can check if whole rows of this data frame are unique:

julia> allunique(df)
true

In this case we get true as indeed all rows are unique. It is guaranteed by the column "x" which holds consecutive integers.

However, we can pass a second positional argument to allunique. In this case we can narrow down the list of checked columns:

julia> allunique(df, "y")
false

Here we checked uniqueness of only column "y", which contains duplicates, e.g. row 1 and row 3 contain the same value 1.0, so we got false.

But this is not all. The second positional argument can be any transformation that is supported by the select function. Therefore, for example, we can run:

julia> allunique(df, "x" => ByRow(iseven))
false

We got false, as applying the iseven to the x column creates duplicates since we have multiple even and odd values in it. But e.g. we have:

julia> allunique(df, "x" => ByRow(x -> x^2))
true

Now we get true as squares of consecutive integers are unique.

We can pass several transformations as well:

julia> allunique(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
true

To convince ourselves that the true result is correct let us run the select operation with the same argument:

julia> select(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
6×2 DataFrame
 Row │ x_function  y_identity
     │ Int64       Float64
─────┼────────────────────────
   1 │          1         1.0
   2 │          2         2.0
   3 │          0         1.0
   4 │          1         2.0
   5 │          2         0.0
   6 │          0        -0.0

Indeed the rows produced by this operation are unique.

Finding duplicate rows

To get a vector with indicators of duplicate rows in a data frame use the nonunique function. Here are three examples of its usage (note it also can take a second positional argument just like allunique):

julia> nonunique(df)
6-element Vector{Bool}:
 0
 0
 0
 0
 0
 0

All rows are unique in df, as we already know, so we got a vector of falses in the call above.

Now the second example:

julia> nonunique(df, "x" => ByRow(iseven))
6-element Vector{Bool}:
 0
 0
 1
 1
 1
 1

Here we see that we get true for all rows for which there was already a duplicate row before. So first two rows get false (non-duplicated) and the following rows have the true indicator (as we have already seen an even and an odd number in column "x").

Now look at the last example:

julia> nonunique(df, "y")
6-element Vector{Bool}:
 0
 0
 1
 1
 0
 0

You might be surprised by the last false. The reason is that all the de-duplication functions use isequal to compare values for equality, and 0.0 is not equal to -0.0 in this comparison:

julia> isequal(0.0, -0.0)
false

This behavior matches the way how dictionaries work in Julia.

Additionally the nonunique has a keep keyword argument. It allows us to change the default behavior which rows are marked as duplicate. If we pass keep=:last then the last of the duplicated rows is marked as unique. See for example:

julia> nonunique(df, "x" => ByRow(iseven); keep=:last)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

We get false in last two rows as 5 and 6 are last even and odd numbers respectively.

The third option is keep=:noduplicates in which case only rows that have no duplicates are marked as unique. So we have:

julia> nonunique(df, "x" => ByRow(iseven); keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 1

as no row was truly unique, but we have:

julia> nonunique(df, "y"; keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

as first four rows were duplicated, but rows with 0.0 and -0.0 are indeed unique.

Removing duplicate rows from a data frame

The nonunique function returns a vector of duplicate indicators. Often we just want to get rid of them from our data frame. The unique and unique! functions can be used to perform this operation. They support the same arguments as nonunique. You have three options how you cen get your result:

using unique you get a new data frame by default;
using unique with view=true keyword argument passed you get a view of the source data frame with duplicates removed;
using unique! you drop the duplicates in-place from the source data frame.

Let us see how it works. First plain unique:

julia> unique(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

We got a new data frame. The df data frame is unchanged. The second option is a view:

julia> unique(df, "y"; view=true)
4×2 SubDataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

Note that still df is untouched:

julia> df
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

And finally we can change the df data frame in place:

julia> unique!(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

In this case, as you can see, the df data frame was updated.

Conclusions

I hope that you will find this review of the functionalities of the allunique, nonunique, unique, and unique! functions useful.

As a summary remember that:

You can determine uniqueness of rows based on transformations of data contained in the source data frame.
You can decide which rows are marked as duplicate using the keep keyword argument.

Getting full factorial design in DataFrames.jl

2024-03-29T08:32:12+00:00

Introduction

Often when working with data we need to get all possible combinations of some input factors in a data frame. In the field of design of experiments this is called full factorial design. In this post I will discuss two functions that DataFrames.jl provides that can help you to generate such designs if you needed them.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is a full factorial design and how to create it?

Assume we are a cardboard box producer have three factors describing a box: its width, height, and depth. Each of them has a finite set of possible values (due to production process limitations). Let us create some sample data of this kind:

julia> height = [10, 12]
2-element Vector{Int64}:
 10
 12

julia> width = [8, 10, 15]
3-element Vector{Int64}:
  8
 10
 15

julia> depth = [5, 6]
2-element Vector{Int64}:
 5
 6

Our task is to compute the volume of all possible boxes that can be created by our factory. The list of all possible cardboard box configurations is a full factorial design. You can get an iterator of these values by using the Iterators.product function:

julia> Iterators.product(height, width, depth)
Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}}(([10, 12], [8, 10, 15], [5, 6]))

This function is lazy, to see the result we need to materialize its return value using e.g. collect:

julia> collect(Iterators.product(height, width, depth))
2×3×2 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (10, 8, 5)  (10, 10, 5)  (10, 15, 5)
 (12, 8, 5)  (12, 10, 5)  (12, 15, 5)

[:, :, 2] =
 (10, 8, 6)  (10, 10, 6)  (10, 15, 6)
 (12, 8, 6)  (12, 10, 6)  (12, 15, 6)

We can see that we get an array of tuples of all possible combinations of dimensions. Let us now compute the volumes:

julia> prod.(collect(Iterators.product(height, width, depth)))
2×3×2 Array{Int64, 3}:
[:, :, 1] =
 400  500  750
 480  600  900

[:, :, 2] =
 480  600   900
 576  720  1080

The results are nice and efficient. However, sometimes it is more convenient to have this data in a data frame.

Full factorial design in DataFrames.jl

Let us repeat the exercise using DataFrames.jl:

julia> using DataFrames

julia> df = allcombinations(DataFrame; height, width, depth)
12×3 DataFrame
 Row │ height  width  depth
     │ Int64   Int64  Int64
─────┼──────────────────────
   1 │     10      8      5
   2 │     12      8      5
   3 │     10     10      5
   4 │     12     10      5
   5 │     10     15      5
   6 │     12     15      5
   7 │     10      8      6
   8 │     12      8      6
   9 │     10     10      6
  10 │     12     10      6
  11 │     10     15      6
  12 │     12     15      6

Note that we passed height, width, depth as keyword arguments to allcombinations taking advantage of a nice functionality of Julia that in this case we can avoid writing height=height as just writing height gives us the same result.

Now we can add a volume column:

julia> transform!(df, All() => ByRow(*) => "volume")
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
│     10      8      5     400
│     12      8      5     480
│     10     10      5     500
│     12     10      5     600
│     10     15      5     750
│     12     15      5     900
│     10      8      6     480
│     12      8      6     576
│     10     10      6     600
│     12     10      6     720
│     10     15      6     900
│     12     15      6    1080

We have added the "volume" column in place to df. Note that we used * as it can take any number of positional arguments and returns their product. The ByRow wrapper signals that we want to perform this operation row-wise.

In comparison to the solution shown before many users find presentation of a full factorial design easier to work with.

What if we have a fractional factorial design?

Sometimes your data is incomplete, and some level combinations are missing. Let us start by creating such a data frame:

julia> df2 = df[Not(2, 5, 9), :]
9×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     10     10      5     500
   3 │     12     10      5     600
   4 │     12     15      5     900
   5 │     10      8      6     480
   6 │     12      8      6     576
   7 │     12     10      6     720
   8 │     10     15      6     900
   9 │     12     15      6    1080

Now one might ask to complete this design and re-fill the design to be complete. This can be done by the fillcombinations function. Let us see it at work:

julia> fillcombinations(df2, Not("volume"))
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64?
─────┼───────────────────────────────
│     10      8      5      400
│     12      8      5  missing
│     10     10      5      500
│     12     10      5      600
│     10     15      5  missing
│     12     15      5      900
│     10      8      6      480
│     12      8      6      576
│     10     10      6  missing
│     12     10      6      720
│     10     15      6      900
│     12     15      6     1080

Observe that after calling this function we have created a new data frame with the missing rows added. The "volume" column is filled by default with missing for rows that were added. The Not("volume") argument meant that we want to get all combinations of values in all columns except "volume".

Conclusions

Today we worked with two functions: allcombinations and fillcombinations. You will find them useful if in your work you will ever need to create all combinations of levels of some factors. This functionality seems niche, but it is needed in practice surprisingly often.

Storing vectors of vectors in DataFrames.jl

2024-03-22T04:32:12+00:00

Introduction

The beauty of DataFrames.jl design is that you can store any data as columns of a data frame. However, this leads to one tricky issue - what if we want to store a vector as a single cell of a data frame? Today I will explain you what is exactly the problem and how to solve it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Basic transformations of columns in DataFrames.jl

Let us start with a simple example:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 5), x=1:10)
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

We want to group the df data frame by "id" and then store the "x" column unchanged in the result.

This can be done by writing:

julia> combine(groupby(df, "id", sort=true), "x")
10×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
   5 │     1      9
   6 │     2      2
   7 │     2      4
   8 │     2      6
   9 │     2      8
  10 │     2     10

Note that the column "x" is expanded into multiple rows by combine. The rule that is applied here states that if some transformation of data returns a vector it gets expanded into multiple rows. The reason for such a behavior is that this is what we want most of the time.

However, what if we would want the vectors to be kept as they are without expanding them? This can be achieved by writing:

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

We see that we got what we wanted, but the question is why does it work? Let me explain.

Containers holding one element in Julia

What we just did with Ref is that we wrapped some value in a container that held exactly one element. There are three basic ways to create such a container in Julia. The first is to wrap a vector within another vector:

julia> [[1,2,3]]
1-element Vector{Vector{Int64}}:
 [1, 2, 3]

Above you have a vector that has one element, which is a [1, 2, 3] vector.

The second method is to create a 0-dimensional array with fill:

julia> fill([1,2,3])
0-dimensional Array{Vector{Int64}, 0}:
[1, 2, 3]

The key point here is that 0-dimensional arrays are guaranteed to hold exactly one element (as opposed to a vector presented above).

The third approach is to use Ref:

julia> Ref([1,2,3])
Base.RefValue{Vector{Int64}}([1, 2, 3])

Wrapping an object with Ref also creates a 0-dimensional container. The difference between Ref and fill is that fill creates an array, while Ref is just a container (but not an array).

How to use 1-element containers in DataFrames.jl as wrappers

All three methods described above can be used to ensure that we protect a vector from being expanded into multiple rows. Therefore the following operations give the same output:

julia> combine(groupby(df, "id", sort=true), "x" => (x -> [x]) => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => fill => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

julia> combine(groupby(df, "id", sort=true), "x" => Ref => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  SubArray…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

The point is that combine unwraps the outer container (vector, 0-dimensional array, and Ref respectively) and stores its contents as a cell of a data frame.

Now, you might ask why initially I recommended Ref? The reason is that it is the method that has the smallest memory footprint:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> @allocated [x]
64

julia> @allocated fill(x)
64

julia> @allocated Ref(x)
16

This difference is important if you have a huge data frame that has millions of groups.

Also writing Ref is simpler than writing (x -> [x]) 😄.

Aliasing trap

You might have noticed that in the above examples the resulting "x" column held SubArrays? Why it is the case? To improve performance combine did not copy the inner vectors from the source df data frame, but instead made their views. This is faster and more memory efficient, but results in creating an alias between the source data frame and the result. In many cases this is not a problem.

However, in some cases you might want to avoid it. A most common case is when you later want to mutate df in place, but do not want the result of combine to reflect this change. If you want to de-alias data you need to copy the data in the produced columns. Therefore you should do:

julia> combine(groupby(df, "id", sort=true), "x" => Ref∘copy => "x")
2×2 DataFrame
 Row │ id     x
     │ Int64  Array…
─────┼─────────────────────────
   1 │     1  [1, 3, 5, 7, 9]
   2 │     2  [2, 4, 6, 8, 10]

Notice that now the "x" column stores Array (which indicates that the copy was made). The Ref∘copy expression signals function composition. We first applly the copy function to the source data and then pass the result to Ref.

An alternative

Sometimes we want to keep the groups as columns not as rows of a data frame. In this case you can use unstack to achieve the desired result. Here is an example how to do it:

julia> unstack(df, :id, :x, combine=identity)
1×2 DataFrame
 Row │ 1                2
     │ SubArray…?       SubArray…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

and a version copying the underlying data:

julia> unstack(df, :id, :x, combine=copy)
1×2 DataFrame
 Row │ 1                2
     │ Array…?          Array…?
─────┼───────────────────────────────────
   1 │ [1, 3, 5, 7, 9]  [2, 4, 6, 8, 10]

Conclusions

Having read this post you should be comfortable with protecting vectors from being expanded into multiple rows when processing data frames in DataFrames.jl. Enjoy!

Transforming multiple columns in DataFrames.jl

2024-03-15T11:32:12+00:00

Introduction

Today I want to comment on a recurring topic that DataFrames.jl users raise. The question is how one should transform multiple columns of a data frame using operation specification syntax.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is operation specification syntax?

In DataFrames.jl the combine, select, and transform functions allow users for passing the requests for data transformation using operation specification syntax. This syntax is feature-rich, and you can find its description for example here. Today I want to focus on its principal concept.

In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form. That means that we need to specify:

source columns to get data from (the extract part);;
the operation to apply to these columns (the transform part);
the target columns where we want to store the result of the operation (the load part).

These tree parts are syntactically expressed using the following form:

[source columns specification] => [transformation function] => [target columns specification]

Let me give an example. Assume you have the following data:

julia> using DataFrames

julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6     11
   2 │     2      7     12
   3 │     3      8     13
   4 │     4      9     14
   5 │     5     10     15

We want to compute the sum of column "x1" and store it in column names "x1_sum" Since the sum function performs the addition operation the syntax specification should be:

"x1" => sum => "x1_sum"

Let us check it with the combine function:

julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
 Row │ x1_sum
     │ Int64
─────┼────────
   1 │     15

In this syntax it is important to note two things:

the "x1" column as a whole was passed to the sum function (as we want to compute its sum);
the "x1" column is a single positional argument passed to the sum function.

Two natural questions that arise are the following:

What if I do not want to perform an operation on a whole column, but on its elements (a.k.a. vectorization of operation)?
What if I want to pass multiple columns as a source for computations?

We will now investigate these two dimensions.

Vectorization of operations

Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow object. Here is an example:

julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
 Row │ x1_str
     │ String
─────┼─────────────────
   1 │ [1, 2, 3, 4, 5]

julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
 Row │ x1_strs
     │ String
─────┼─────────
   1 │ 1
   2 │ 2
   3 │ 3
   4 │ 4
   5 │ 5

Note that "x1" => string => "x1_str" passed the whole "x1" column to the string function so we got a single "[1, 2, 3, 4, 5]" string in the output.

While writing "x1" => ByRow(string) => "x1_strs" passed each element of "x1" column to the string function individually, so in the result we got a vector of five string representations of numbers of the numbers from the source.

Passing multiple columns

Now let us have a look at passing multiple columns. There are two ways you can do it.

The first is when your function accepts multiple positional arguments. An example of such function is string see:

julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

If we pass a collection of columns as a source in operation specification syntax we get this behavior:

julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Naturally, the above combines with vectorization. Therefore since:

julia> string.(df.x1, df.x2)
5-element Vector{String}:
 "16"
 "27"
 "38"
 "49"
 "510"

we also have:

julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument. This is handled in DataFrames.jl with the AsTable wrapper, which you can apply to the source columns. If you use it then instead of getting multiple positional arguments the function will get a single positional argument that will be a NamedTuple holding the source columns.

To convince ourselves that this is indeed what happens let us create a helper function:

julia> function helper(x)
           @show x
           return string(x.x1, x.x2)
       end
helper (generic function with 1 method)

This helper function first prints us its only argument x and next assumes that it has x1 and x2 fields and applies the string function to them. Let us first check it in practice:

julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"

Now let us use the helper function with combine:

julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
 Row │ x1_x2_str
     │ String
─────┼─────────────────────────────────
   1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]

Indeed, we see that helper got a named tuple holding two columns of the source data frame.

Again, this syntax plays well with ByRow:

julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
 Row │ x1_x2_strs
     │ String
─────┼────────────
   1 │ 16
   2 │ 27
   3 │ 38
   4 │ 49
   5 │ 510

We see that this time helper got a separate named tuple for each row of source data frame.

Conclusions

In summary today we discussed two special operations in DataFrames.jl operation specification syntax:

the ByRow which vectorizes the function passed to it;
the AsTable which allows us to pass source columns as a single named tuple to the transformation function (instead of passing them as consecutive positional arguments, which is the default).

I hope these examples were useful in helping you understand the design of operation specification syntax.

Working with a grouped data frame, part 2

2024-03-08T08:32:12+00:00

Introduction

This is a follow up to the post from last week. We will continue discussing how one can work with GroupedDataFrame objects in DataFrames.jl. Today we focus on indexing of grouped data frames.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Warm-up: getting group indices

First create some grouped data frame:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
⋮
Last Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

It is sometimes useful to learn what is a group number of each row of the source data frame df in a grouped data frame gdf. You can easily get this information with groupindices:

julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
 1
 1
 3
 3
 2
 2

Extracting a single group

A basic operation when indexing a GroupedDataFrame is to pick a group by its number. Here is an example:

julia> gdf[1]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[2]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[3]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Note, that gdf behaves similarly to a vector. You can even use begin and end in indexing:

julia> gdf[begin]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[end]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c

Often you might want to extract a group not by its position in gdf, but by the value of the grouping variable or variables. In this case you can use GroupKey, dictionary, tuple, or named tuple to achieve this.

Let us check how it works. Start with dictionary, tuple, and named tuple:

julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[("b",)] # tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

With GroupKey we first need to get it from keys, but everything else works the same:

julia> key = keys(gdf)[1]
GroupKey: (str = "a",)

julia> gdf[key]
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey) and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:

julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
⋮
Last Group (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[3] # third group
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b

Extracting multiple groups

You now know how to pick a single group, so selecting multiple groups is a natural next step. You can use a collection of any of the selectors we have already discussed. Here are some examples:

julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
⋮
Last Group (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a

Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data. Also note that groupindices is aware of such changes:

julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
 2
 2
 1
 1
  missing
  missing

Here group with "c" is first, with "a" is second and with "b" is dropped, so missing is returned in the produced vector.

It is also worth to remember that subset and filter can be used with GroupedDataFrames. This topic is discussed in this post.

Key lookup

Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey function:

julia> haskey(gdf, ("a",))
true

julia> haskey(gdf, ("z",))
false

Conclusions

In this post we discussed indexing of GroupedDataFrames. This concludes the basic tutorial of working with these data structures. I hope you will find the functionalities I have covered useful in your work.

Working with a grouped data frame, part 1

2024-03-01T14:32:12+00:00

Introduction

One of the features of DataFrames.jl that I often find useful is that when you group a data frame by some of its columns the resulting GroupedDataFrame is an object that gains new and useful functionalities.

Some time ago I have discussed how GroupedDataFrame can be filtered. You can find this post here. In this post and the following one that I plan to write next week I thought that it would be useful to review other key functionalities of a GroupedDataFrame.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Creating a grouped data frame

You can create a GroupedDataFrame using the groupby function.

Here are some examples:

julia> using DataFrames

julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
                      str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
Group 3 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
Group 2 (2 rows): int = 3
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
Group 3 (2 rows): int = 2
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b
Group 3 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
Group 2 (2 rows): str = "c"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     1  c
Group 3 (2 rows): str = "b"
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  b

What this example shows is that the key thing you need to remember to decide about a grouped data frame is the order of groups.

There are two options here:

groups sorted by the grouping column value, when you pass sort=true;
groups sorted by the order of appearance of values in the source, when you pass sort=false.

You might ask what happens if you do not pass the sort keyword argument? In this case either of the options is used depending on which one is faster. Therefore, omitting sort, can be thought of as an information that the user does not care about the order of groups but wants the grouping operation to be as fast as possible.

When does the order of groups not matter?

In some cases the order of groups is irrelevant (so you can safely skip passing it). The most important scenario of this kind is when you use the select or transform function with a GroupedDataFrame. The reason is that these functions anyway always keep the order of rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame). However, it is not the case with combine, as it respects the order of groups in a GroupedDataFrame.

Let us see an example highlighting the difference between these cases:

julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     2  c
   4 │     2  b
   5 │     3  a
   6 │     3  b

julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     3  a
   3 │     2  c
   4 │     1  c
   5 │     3  b
   6 │     2  b

julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
   3 │     3  a
   4 │     3  b
   5 │     2  c
   6 │     2  b

As you can see select kept the rows in the order in which they are present in df no matter if we passed sort=true or sort=false. On the other hand combine returns rows grouped by the groups and the order of groups corresponds to their order in GroupedDataFrame, so passing sort=true or sort=false in general changes.

Special operation specification syntax for working with grouped data frames

When discussing select or combine in conjunction with GroupedDataFrame it is important to mention that there are four special cases of operation specification syntax designed specifically for working with them. They are:

nrow to compute the number of rows in each group.
proprow to compute the proportion of rows in each group.
eachindex to return a vector holding the number of each row within each group.
groupindices to return the group number.

Each of them optionally allows you to specify the name of the target column by => syntax. Here are some examples:

julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
 Row │ int    nrow
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      2
   3 │     2      2

julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
 Row │ int    row %
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.333333
   2 │     3  0.333333
   3 │     2  0.333333

julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
 Row │ int    eachindex
     │ Int64  Int64
─────┼──────────────────
   1 │     1          1
   2 │     1          2
   3 │     3          1
   4 │     3          2
   5 │     2          1
   6 │     2          2

julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
 Row │ int    group #
     │ Int64  Int64
─────┼────────────────
   1 │     1        1
   2 │     3        2
   3 │     2        3

Iterating a grouped data frame

Apart from using functions such as select or combine on a GroupedDataFrame it is useful to know that it supports iteration. Therefore you can use a GroupedDataFrame in a loop or in a comprehension. When iterated GroupedDataFrame returns data frames corresponding to the groups. Let us see:

julia> for v in groupby(df, :int, sort=false)
           println(v)
       end
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

The last example has shown you that you can pass a GroupedDataFrame to a function expecting an iterable, in this case the collect function. The one exception to this rule is that you cannot use GroupedDataFrame with the map function directly:

julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved

The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine or select.

Advanced iteration

Sometimes, when iterating a GroupedDataFrame we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys and pairs functions (depending on whether you only want grouping values or both grouping values and data frames):

julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
 GroupKey: (int = 1,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  c
 GroupKey: (int = 3,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     3  a
   2 │     3  b
 GroupKey: (int = 2,) => 2×2 SubDataFrame
 Row │ int    str
     │ Int64  String
─────┼───────────────
   1 │     2  c
   2 │     2  b

I used the map function to show you that it is only reserved to use it with plain GroupedDataFrame.

Working with group keys

As you can see in this example each group in a GroupedDataFrame is associated with a GroupKey. To get all keys use the keys function:

julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (int = 1,)
 GroupKey: (int = 3,)
 GroupKey: (int = 2,)

Let us, as an example extract the last key so see how one can work with it:

julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)

You can get a value of the key by property access or indexing:

julia> key.int
2

julia> key[1]
2

julia> key["int"]
2

julia> key[:int]
2

It is also easy co convert GroupKey to a dictionary, vector, Tuple or NamedTuple if you would need it:

julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
  :int => 2

julia> collect(key)
1-element Vector{Int64}:
 2

julia> Tuple(key)
(2,)

julia> NamedTuple(key)
(int = 2,)

Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column in the examples above. If you needed to get a list of grouping columns use the groupcols function:

julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
 :int

Conclusions

In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it. As a follow-up we have shown how GroupedDataFrame interacts with functions like select or combine. Next we discussed iterator interface support by GroupedDataFrame and how to get and use information about values of grouping columns for each group. I hope you found these examples useful.

In the post next week we will discuss how GroupedDataFrame supports the indexing interface.

Partial function application in Julia

2024-02-23T04:00:00+00:00

Introduction

Some functions provided in Base Julia support partial application. I often find this functionality useful. Therefore in this post I want to give you its explanation and a summary which functions have this property.

The post was tested with Julia Version 1.12.0-DEV.53.

Explaining partial function application

We will focus on partial application of functions having two positional arguments. Let us work by example.

Consider the in function. You can call it to check if some item is in a collection. Here is an example:

julia> in('a', "Abracadabra")
true

julia> in('x', "Abracadabra")
false

A common pattern you might need is to perform a repeated check if various items are contained in the same collection. For example assume you have a vector of characters and you want to filer it to keep only the elements contained in a reference collection. You can do it like this:

julia> v = 'a':'z'
'a':1:'z'

julia> filter(x -> in(x, "Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This pattern is so commonly needed that there is a shorthand for x -> in(x, "Abracadabra"). Instead of creating this anonymous function you can just write in("Abracadabra"). The value returned by this function call behaves in the same way as x -> in(x, "Abracadabra"). Let us check:

julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

You can think of this operation as if we partially applied the in function by fixing its second argument (the collection) and leaving the first (the item we check) to be specified later.

In other words the following two operations are equivalent:

julia> in('a', "Abracadabra")
true

julia> in("Abracadabra")('a')
true

Fixing of the second argument is most common. However, sometimes it is useful to fix the first argument. This is exactly the case of the filter function we have just used.

What if you wanted to perform the filter(in("Abracadabra"), v) test for multiple different values of v but with a fixed predicate function? Here is an example:

julia> vv = ['a'+i:'z' for i in 0:4]
5-element Vector{StepRange{Char, Int64}}:
 'a':1:'z'
 'b':1:'z'
 'c':1:'z'
 'd':1:'z'
 'e':1:'z'

julia> map(v -> filter(in("Abracadabra"), v), vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

You probably see, where I am getting at. Instead of v -> filter(in("Abracadabra"), v) we can write filter(in("Abracadabra")) and fix the first positional argument of filter, leaving the second to be specified later. Let us check if this works:

julia> map(filter(in("Abracadabra")), vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

Indeed, we get what we expected. Again, for a reference note that the following two operations are equivalent:

julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

julia> filter(in("Abracadabra"))(v)
5-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

Before I finish this section let me note that if you do not like writing that many parentheses you could use the |> operator. In our example we could write:

julia> map("Abracadabra" |> in |> filter, vv)
5-element Vector{Vector{Char}}:
 ['a', 'b', 'c', 'd', 'r']
 ['b', 'c', 'd', 'r']
 ['c', 'd', 'r']
 ['d', 'r']
 ['r']

Which style you use is a matter of preference.

Catalogue of function supporting partial application

We saw that some functions taking two arguments support partial application. Below I give you a list of all of them that are currently supported (and this is the reason why the post is written under Julia nightly, as there were recent changes in this list).

There is only one function in Base Julia that supports fixing its first argument and this function is filter.

However, there are many functions supporting fixing of their second argument. Here is their list:

comparisons: isequal, ==, !=, >=, <=, >, <;
inclusion testing: in, ∈, ∋, ∉, ∌;
string checking: contains, occursin, endswith, startswith;
set operations (supported since Julia 1.11; not released yet): issubset,⊆, ⊇, ⊈, ⊉, ⊊, ⊋, isdisjoint, issetequal.

Conclusions

After reading this post you know how to use partial function application in Julia and which functions from Base support it. I hope you will find this functionality useful in your code.