Julia for Data Analysis Strikes Back
Introduction
This week I got a nice little surprise in my office. A year after my Julia for Data Analysis book has been published I got a package with a set of printed versions of its Korean translation 데이터 분석을 위한 줄리아. It was really a nice experience and I hope that Julia users from Korea will like it.
Therefore, for today, I decided to discuss a functionality that is little known, but often quite useful. It is related to adding conditional columns to a data frame.
The post was written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.
The problem
Assume you have the following data frame:
julia> using DataFrames
julia> df = DataFrame(x=-2.0:0.5:2.0)
9×1 DataFrame
Row │ x
│ Float64
─────┼─────────
1 │ -2.0
2 │ -1.5
3 │ -1.0
4 │ -0.5
5 │ 0.0
6 │ 0.5
7 │ 1.0
8 │ 1.5
9 │ 2.0
Now we want to add a second column to this data frame that contains a square root of column "x"
.
A basic approach fails:
julia> df.sqrtx = sqrt.(df.x)
ERROR: DomainError with -2.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
The reason is that we cannot normally take a square root of a negative number.
We can perform a conditional processing for example like this:
julia> df.sqrtx = (x -> x < 0.0 ? missing : sqrt(x)).(df.x)
9-element Vector{Union{Missing, Float64}}:
missing
missing
missing
missing
0.0
0.7071067811865476
1.0
1.224744871391589
1.4142135623730951
julia> df
9×2 DataFrame
Row │ x sqrtx
│ Float64 Float64?
─────┼─────────────────────────
1 │ -2.0 missing
2 │ -1.5 missing
3 │ -1.0 missing
4 │ -0.5 missing
5 │ 0.0 0.0
6 │ 0.5 0.707107
7 │ 1.0 1.0
8 │ 1.5 1.22474
9 │ 2.0 1.41421
but I do not find this approach very readable (especially from the perspective of a beginner).
The alternative that I prefer is to work with a view of the source data frame. Let us first create such a view that contains all columns of the original data frame, but only rows in which column "x"
is non-negative:
julia> dfv = filter(:x => >=(0.0), df, view=true)
5×2 SubDataFrame
Row │ x sqrtx
│ Float64 Float64?
─────┼───────────────────
1 │ 0.0 0.0
2 │ 0.5 0.707107
3 │ 1.0 1.0
4 │ 1.5 1.22474
5 │ 2.0 1.41421
Now, we can add a column to such a view by using a plain sqrt
function without any decorations:
julia> dfv.sqrtx2 = sqrt.(dfv.x)
5-element Vector{Float64}:
0.0
0.7071067811865476
1.0
1.224744871391589
1.4142135623730951
julia> dfv
5×3 SubDataFrame
Row │ x sqrtx sqrtx2
│ Float64 Float64? Float64?
─────┼─────────────────────────────
1 │ 0.0 0.0 0.0
2 │ 0.5 0.707107 0.707107
3 │ 1.0 1.0 1.0
4 │ 1.5 1.22474 1.22474
5 │ 2.0 1.41421 1.41421
julia> df
9×3 DataFrame
Row │ x sqrtx sqrtx2
│ Float64 Float64? Float64?
─────┼─────────────────────────────────────────
1 │ -2.0 missing missing
2 │ -1.5 missing missing
3 │ -1.0 missing missing
4 │ -0.5 missing missing
5 │ 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107
7 │ 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474
9 │ 2.0 1.41421 1.41421
Note that both dfv
and df
are updated as expected. The filtered-out rows get missing
values.
It is important to highlight that this functionality works if the view (SubDataFrame
) was created using all columns of the source data frame (like is done in the case of our filter
call above).
The reason for this restriction is that if view contained some subset of columns the operation of adding a column would be unsafe (there would be a risk of accidental and unwanted overwrite of a column present in the source data frame that was not included in the view).
This functionality is especially nice in combination with DataFramesMeta.jl, just have a look:
julia> @chain df begin
@rsubset(:x >= 0; view=true)
@rtransform!(:sqrtx3 = sqrt(:x))
parent
end
9×4 DataFrame
Row │ x sqrtx sqrtx2 sqrtx3
│ Float64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ -2.0 missing missing missing
2 │ -1.5 missing missing missing
3 │ -1.0 missing missing missing
4 │ -0.5 missing missing missing
5 │ 0.0 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107 0.707107
7 │ 1.0 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474 1.22474
9 │ 2.0 1.41421 1.41421 1.41421
In the code above I used parent
in the last step to recover the source df
.
As a final comment note that an alternative in DataFramesMeta.jl is to just use a plain @rtransform!
macro:
julia> @rtransform!(df, :sqrtx4 = :x < 0 ? missing : sqrt(:x))
9×5 DataFrame
Row │ x sqrtx sqrtx2 sqrtx3 sqrtx4
│ Float64 Float64? Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────
1 │ -2.0 missing missing missing missing
2 │ -1.5 missing missing missing missing
3 │ -1.0 missing missing missing missing
4 │ -0.5 missing missing missing missing
5 │ 0.0 0.0 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107 0.707107 0.707107
7 │ 1.0 1.0 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474 1.22474 1.22474
9 │ 2.0 1.41421 1.41421 1.41421 1.41421
In this case it also quite clean.
Conclusions
I am really happy that we have a Korean version of Julia for Data Analysis.
I hope that the example transformations I have shown today were useful and improved your knowledge of DataFrames.jl and DataFramesMeta.jl packages.