Is DataFrames.jl Hamiltonian?
Introduction
Hamilton is an interesting general purpose micro-framework for creating dataflows from python functions. It has been developed by Stitch Fix to help managing complex DAGs, where the resulting data frames are wide (1000+ columns).
While DataFrames.jl is not a framework for DAG execution it is a natural tool for performing single steps in such processes. When I was reading the explanation of motivation and design of Hamilton I was struck by the fact that it shares similarities with some concepts in DataFrames.jl.
In this post I want to discuss some of these principles. Additionally, I want to highlight how I believe that having metadata support in DataFrames.jl nicely combines with them.
In the post I use Julia 1.8.2, and DataFrames.jl 1.4.1.
The principles
I do not list here all design choices that Hamilton makes. Instead I want to discuss ideas that are similar between this framework and DataFrames.jl.
These concepts are simple:
- if you use some column name it should have a single meaning in your pipeline;
- every transformation should be a function with a name.
Such an approach helps with understanding of the code, its maintenance, and documentation of the pipeline.
Let me discuss these two concepts one by one in the context of DataFrames.jl design.
Single column name has a single meaning
This rule seems pretty intuitive but is often violated in practice. For example
users often transform a column (e.g. by taking log
) and still keep its name.
The general recommendation is that such practices should be avoided. If you
significantly transform a column a recommended thing to do is to give it a new
name. In this way you are sure that you will not be confused later.
As an additional benefit of following this rule you can safely annotate your columns with metadata to keep important information about their meaning. In DataFrames.jl column-level metadata design works as follows:
- you can add a
key
-value
pair as metadata to any column of a data frame; - you can choose one of two styles of metadata propagation; it is either
:note
(signaling it is safe to retain metadata under certain transformations) and:default
(signaling that metadata should be stripped after any transformation).
Let us see it in action. You can use colmetadata!
to set column-level metadata
and colmetadata
to retrieve it:
julia> using DataFrames
julia> df = DataFrame(sales=[1.2, 3.4, 2.1], year=[2001, 2002, 2003])
3×2 DataFrame
Row │ sales year
│ Float64 Int64
─────┼────────────────
1 │ 1.2 2001
2 │ 3.4 2002
3 │ 2.1 2003
julia> colmetadata!(df, :sales, "label", "Yearly sales in USD", style=:note);
julia> colmetadata!(df, :year, "label", "Fiscal year (ending on June 30)", style=:note);
julia> colmetadata!(df, :sales, "sum", sum(df.sales), style=:default);
julia> colmetadata(df, :sales, "label")
"Yearly sales in USD"
julia> colmetadata(df, :year, "label")
"Fiscal year (ending on June 30)"
julia> colmetadata(df, :sales, "sum")
6.699999999999999
You might ask what is the use of :note
and :default
metadata distinction.
Most of the time metadata of :default
style are things that are specific to
only a given instance of a data frame. An example of such metadata is column
creation time. In some scenarios having such information might be used, e.g.
for auditing purposes. However, such information should not be propagated under
transformations.
On the other hand :note
meatadata is kept when data frame is transformed.
The rules, in short, are:
- if the column is not changed under transformation (data it contains remains
unchanged)
:note
-style metadata is kept; - if you change column data, but decide to keep the original column name metadata is also kept.
What is the rationale behind these rules? We could be super strict and say that
metadata is propagated only if the column data does not change and its name and
does not change. However, such a rule in practice seems to be overly strict.
When you get some source data you usually might want to e.g. rename the column
to some more descriptive name or perform data cleaning (e.g. dropping rows with
missing values). In such cases users usually feel that the contents of the colum
conceptually remains the same. However, some information about data in the
column might change, like for example its length. Therefore :note
-style
metadata should not be used to store information that is strictly attached to a
concrete instance of a column (:default
-style metadata should be used
instead).
Let us now look back at our example code.
A typical :note
-style metadata is descriptive label of a column. We used
"label"
key for it. Now we made a "sum"
metadata to have style :default
.
Someone might have wanted to store the sum information for easy access later.
However, such metadata should not be propagated under any transformation of
a data frame as it might potentially be invalidated, so we made its style
:default
.
If you ask why such metadata might be useful have a look at this example:
julia> show(df, header=colmetadata.(Ref(df), names(df), "label"))
3×2 DataFrame
Row │ Yearly sales in USD Fiscal year (ending on June 30)
─────┼──────────────────────────────────────────────────────
1 │ 1.2 2001
2 │ 3.4 2002
3 │ 2.1 2003
Note that we have substituted column names (which might be not clear for end users) with descriptive labels.
The metadata propagation rules work nice and will help you documenting your data frame assuming that you follow the first principle: do not use a single column name to store different information.
Every transformation should be a function with a name
The principle to define named functions for transformations, combined with the principle of single name for single meaning, is a core of operation specification syntax in DataFrames.jl.
Consider the following example code:
julia> usd2¢(x) = 100x
usd2¢ (generic function with 1 method)
julia> transform(df, :sales => usd2¢)
3×3 DataFrame
Row │ sales year sales_usd2¢
│ Float64 Int64 Float64
─────┼─────────────────────────────
1 │ 1.2 2001 120.0
2 │ 3.4 2002 340.0
3 │ 2.1 2003 210.0
Note that the auto-generated column name sales_usd2¢
gives us information
about a source column and transformation function applied to it. Of course you
could use a custom output name. However, using automatically generated column
name, assuming that you pass a names function as a transformation, has a big
benefit that it helps in automatic visual documentation of the transformation
applied to data (and looking up the code of the transformation function used).
This is a big feature if you work with hundreds or thousands of columns.
As with the previous principle this works if you follow a simple rule - define transformations you want to apply as functions with a name.
Conclusions
In this post I have highlighted two principles that are useful when implementing complex data transformation jobs:
- if you use some column name it should have a single meaning;
- every transformation should be a function with a name.
DataFrames.jl was designed to work nicely if you want to follow these rules, as I have tried to explain in this post. However, nothing is carved in stone in DataFrames.jl. You can write your code whatever way you want and nothing is forced on the user. This is especially important in quick-and-dirty one time interactive sessions, where users want flexibility.
If you would like to learn more about the design of operation specification syntax I recommend you start with JuliaCon 2022 tutorial.
Similarly, we tried to design handling of metadata in a way that is easy to understand and convenient to use. You can find more details about metadata support in DataFrames.jl in Metadata section of the DataFrames.jl documentation. I also plan to write a separate post in which I will give a more in-depth example how metadata can be used in DataFrames.jl. Also there is a plan to release TableMetadataTools.jl package, which will add even more convenience to most common operations.