Separating a column into multiple columns in DataFrames.jl
Introduction
Recently several users have asked how one can split one column into several columns in DataFrames.jl.
A functionality of this kind is provided e.g. in dplyr by the separate
function, but we are currently missing it in DataFrames.jl.
In this post I comment how to obtain the expected behavior in the current version of DataFrames.jl (i.e. v21) with a little help of the SplitApplyCombine.jl pakage.
The codes were tested under Julia 1.5, DataFrames.jl 0.21, and SplitApplyCombine.jl 1.1.
Update: The functionality is available in DataFrames.jl 0.22. See a new section at the end of this post.
Separating a column
For our example I use the data I have found in this post about most popular first and last name combinations in the United States:
julia> using DataFrames
julia> df = DataFrame(name=["James Smith", "Michael Smith", "Robert Smith",
                            "Maria Garcia", "David Smith", "Maria Rodriguez",
                            "Mary Smith", "Maria Hernandez", "Maria Martinez",
                            "James Johnson"],
                      freq=[38_313, 34_810, 34_269, 32_092, 31_294,
                            30_507, 28_692, 27_836, 26_956, 26_850])
10×2 DataFrame
│ Row │ name            │ freq  │
│     │ String          │ Int64 │
├─────┼─────────────────┼───────┤
│ 1   │ James Smith     │ 38313 │
│ 2   │ Michael Smith   │ 34810 │
│ 3   │ Robert Smith    │ 34269 │
│ 4   │ Maria Garcia    │ 32092 │
│ 5   │ David Smith     │ 31294 │
│ 6   │ Maria Rodriguez │ 30507 │
│ 7   │ Mary Smith      │ 28692 │
│ 8   │ Maria Hernandez │ 27836 │
│ 9   │ Maria Martinez  │ 26956 │
│ 10  │ James Johnson   │ 26850 │
We want to separate :name column into :first and :last columns holding first
and last name respectively.
Here is how one can do it using just Julia Base:
julia> tmp = split.(df.name)
10-element Array{Array{SubString{String},1},1}:
 ["James", "Smith"]
 ["Michael", "Smith"]
 ["Robert", "Smith"]
 ["Maria", "Garcia"]
 ["David", "Smith"]
 ["Maria", "Rodriguez"]
 ["Mary", "Smith"]
 ["Maria", "Hernandez"]
 ["Maria", "Martinez"]
 ["James", "Johnson"]
julia> insertcols!(df, [n => getindex.(tmp, i) for (i, n) in
                        enumerate([:first, :last])]...)
10×4 DataFrame
│ Row │ name            │ freq  │ first    │ last      │
│     │ String          │ Int64 │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┤
│ 1   │ James Smith     │ 38313 │ James    │ Smith     │
│ 2   │ Michael Smith   │ 34810 │ Michael  │ Smith     │
│ 3   │ Robert Smith    │ 34269 │ Robert   │ Smith     │
│ 4   │ Maria Garcia    │ 32092 │ Maria    │ Garcia    │
│ 5   │ David Smith     │ 31294 │ David    │ Smith     │
│ 6   │ Maria Rodriguez │ 30507 │ Maria    │ Rodriguez │
│ 7   │ Mary Smith      │ 28692 │ Mary     │ Smith     │
│ 8   │ Maria Hernandez │ 27836 │ Maria    │ Hernandez │
│ 9   │ Maria Martinez  │ 26956 │ Maria    │ Martinez  │
│ 10  │ James Johnson   │ 26850 │ James    │ Johnson   │
The code is a bit verbose and uses a temporaty variable. We could have written the second step also e.g. like this:
for (i, n) in enumerate([:first, :last])
    df[!, n] = getindex.(tmp, i)
end
but it is still quite verbose.
We can have a shorter code and avoid a temporary variable using the invert
function from SplitApplyCombine.jl:
julia> using SplitApplyCombine
julia> insertcols!(df, ([:first, :last] .=> invert(split.(df.name)))...,
                   makeunique=true)
10×6 DataFrame
│ Row │ name            │ freq  │ first    │ last      │ first_1  │ last_1    │
│     │ String          │ Int64 │ SubStri… │ SubStrin… │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┼──────────┼───────────┤
│ 1   │ James Smith     │ 38313 │ James    │ Smith     │ James    │ Smith     │
│ 2   │ Michael Smith   │ 34810 │ Michael  │ Smith     │ Michael  │ Smith     │
│ 3   │ Robert Smith    │ 34269 │ Robert   │ Smith     │ Robert   │ Smith     │
│ 4   │ Maria Garcia    │ 32092 │ Maria    │ Garcia    │ Maria    │ Garcia    │
│ 5   │ David Smith     │ 31294 │ David    │ Smith     │ David    │ Smith     │
│ 6   │ Maria Rodriguez │ 30507 │ Maria    │ Rodriguez │ Maria    │ Rodriguez │
│ 7   │ Mary Smith      │ 28692 │ Mary     │ Smith     │ Mary     │ Smith     │
│ 8   │ Maria Hernandez │ 27836 │ Maria    │ Hernandez │ Maria    │ Hernandez │
│ 9   │ Maria Martinez  │ 26956 │ Maria    │ Martinez  │ Maria    │ Martinez  │
│ 10  │ James Johnson   │ 26850 │ James    │ Johnson   │ James    │ Johnson   │
In this call I have used makeunique=true as we update the df data frame in
place and it already contains :first and :last columns.
So the code is not that long, but admittedly separate in dplyr is shorter.
Before we finish let us see what the invert function produces when applied
to the tmp variable we have created above:
julia> invert(tmp)
2-element Array{Array{SubString{String},1},1}:
 ["James", "Michael", "Robert", "Maria", "David", "Maria", "Mary", "Maria",
 "Maria", "James"]
 ["Smith", "Smith", "Smith", "Garcia", "Smith", "Rodriguez", "Smith",
 "Hernandez", "Martinez", "Johnson"]
As you can see it takes a container of containers and reverses the order of nesting.
Conclusion
If you feel that it would be good to have an in-built function in DataFrames.jl that would do splitting of columns in a data frame please leave a comment in this issue.
DataFrames.jl 0.22 functionality
You can do column splitting easily in DataFrames.jl 0.22 like this:
julia> using DataFrames
julia> df = DataFrame(name=["James Smith", "Michael Smith", "Robert Smith",
                                   "Maria Garcia", "David Smith", "Maria Rodriguez",
                                   "Mary Smith", "Maria Hernandez", "Maria Martinez",
                                   "James Johnson"],
                             freq=[38_313, 34_810, 34_269, 32_092, 31_294,
                                   30_507, 28_692, 27_836, 26_956, 26_850])
10×2 DataFrame
 Row │ name             freq
     │ String           Int64
─────┼────────────────────────
   1 │ James Smith      38313
   2 │ Michael Smith    34810
   3 │ Robert Smith     34269
   4 │ Maria Garcia     32092
   5 │ David Smith      31294
   6 │ Maria Rodriguez  30507
   7 │ Mary Smith       28692
   8 │ Maria Hernandez  27836
   9 │ Maria Martinez   26956
  10 │ James Johnson    26850
julia> transform(df, :name => ByRow(split) => [:first, :last])
10×4 DataFrame
 Row │ name             freq   first      last
     │ String           Int64  SubStrin…  SubStrin…
─────┼──────────────────────────────────────────────
   1 │ James Smith      38313  James      Smith
   2 │ Michael Smith    34810  Michael    Smith
   3 │ Robert Smith     34269  Robert     Smith
   4 │ Maria Garcia     32092  Maria      Garcia
   5 │ David Smith      31294  David      Smith
   6 │ Maria Rodriguez  30507  Maria      Rodriguez
   7 │ Mary Smith       28692  Mary       Smith
   8 │ Maria Hernandez  27836  Maria      Hernandez
   9 │ Maria Martinez   26956  Maria      Martinez
  10 │ James Johnson    26850  James      Johnson
(also note that the text/plain display has changed to save more horizontal space (in the output)