# How does Tables.jl handle schema-less tables?

# Introduction

Tables.jl is a fundamental package in the JuliaData ecosystem.

One of the key concepts used in Tables.jl is a *schema* of a table.
Schema is information about the column names and types of a table.
Having access to schema is useful as being able to query these properties
is constantly needed in practice.

In this post I want to discuss in what cases a table might not have a schema information associated with it and how Tables.jl handles this case.

The post was written using Julia 1.9.2, Tables.jl 1.10.1, and DataFrames.jl 1.6.1.

# What is a schema of a table?

Let us create some example tables and investigate their schema:

```
julia> df = DataFrame(a=[1, 2], b=[3.0, 4.0], c=[1, "a"])
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> Tables.schema(df)
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> ntv = Tables.columntable(df)
(a = [1, 2], b = [3.0, 4.0], c = Any[1, "a"])
julia> Tables.schema(ntv)
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> vnt = Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}}:
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((1, 3.0, 1))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((2, 4.0, "a"))
julia> Tables.schema(vnt)
Tables.Schema:
:a Int64
:b Float64
:c Any
```

I have created a data frame `df`

with three columns and then converted it into
a named tuple of vectors `ntv`

and a vector of named tuples `vnt`

.
All these three objects are Tables.jl tables.

In all cases we have checked that the `Tables.schema`

properly identifies the schema
of the se tables. They have three columns `:a`

, `:b`

, and `:c`

, and these columns
have types `:Int64`

, `Float64`

, and `Any`

.

So in what case a table might not have a schema? To understand this consider the following transformation:

```
julia> vnt2 = [(; collect(pairs(r))...) for r in vnt]
2-element Vector{NamedTuple{(:a, :b, :c)}}:
(a = 1, b = 3.0, c = 1)
(a = 2, b = 4.0, c = "a")
julia> Tables.schema(vnt2)
julia> isnothing(Tables.schema(vnt2))
true
```

Let us understand what happens in this code. With
`(; collect(pairs(r))...)`

operation I make the type of the fields
of the named tuples stored in the `vnt`

vector *concrete*.
To understand this better note:

```
julia> typeof(vnt[1])
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}
julia> typeof((; collect(pairs(vnt[1]))...))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Int64}}
```

Note that the type of the `:c`

field is `Any`

and `Int64`

respectively.

This transformation means that the `vnt2`

vector itself does not, in consequence,
have a concrete element type:

```
julia> eltype(vnt2)
NamedTuple{(:a, :b, :c)}
```

This type only specifies column names, but not their types. In such a situation
the `Tables.schema`

function returns `nothing`

signaling to the user that the schema
of a table is unknown.

You might think that this is an artificial example but consider the following common case:

```
julia> vnt3 = [(a=1, b=2), (a=missing, b=3)]
2-element Vector{NamedTuple{(:a, :b)}}:
(a = 1, b = 2)
(a = missing, b = 3)
julia> Tables.schema(vnt3)
julia> isnothing(Tables.schema(vnt3))
true
```

As you can see when we have a vector of named tuples, a mix in of `missing`

values
makes it schema-less.

# How does Tables.jl handle schema-less tables? Part 1

If Tables.jl detects a schema-less table it tries to dynamically detect its schema when functions defined in Tables.jl are called. The three key functions here are:

`Tables.columns`

: returns an object that can be queried column-wise;`Tables.rows`

: returns an object that can be queried row-wise;`Tables.dictrowtable`

: returns a row-wise vector that performs “column unioning”.

You might wonder what this “column unioning” part means. We will get to it in a second.

First let us investigate how these functions work on our `vnt2`

object that is schema-less:

```
julia> Tables.columns(vnt2)
Tables.CopiedColumns{NamedTuple{(:a, :b, :c), Tuple{Vector{Int64}, Vector{Float64}, Vector{Any}}}} with 2 rows, 3 columns, and schema:
:a Int64
:b Float64
:c Any
julia> Tables.schema(Tables.columns(vnt2))
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> Tables.rows(vnt2)
2-element Vector{NamedTuple{(:a, :b, :c)}}:
(a = 1, b = 3.0, c = 1)
(a = 2, b = 4.0, c = "a")
julia> Tables.schema(Tables.rows(vnt2))
julia> isnothing(Tables.schema(Tables.rows(vnt2)))
true
julia> Tables.dictrowtable(vnt2)
Tables.DictRowTable([:a, :b, :c], Dict{Symbol, Type}(:a => Int64, :b => Float64, :c => Union{Int64, String}), Dict{Symbol, Any}[Dict(:a => 1, :b => 3.0, :c => 1), Dict(:a => 2, :b => 4.0, :c => "a")])
julia> Tables.schema(Tables.dictrowtable(vnt2))
Tables.Schema:
:a Int64
:b Float64
:c Union{Int64, String}
```

As you can see `Tables.columns`

performed eltype detection and determined the type of column `:c`

as `Any`

.
The `Tables.rows`

produces a schema-less object as for `vnt2`

object calling `Tables.rows`

on it just returns the input.
Finally `Tables.dictrowtable`

performs a narrower eltype for column `:c`

which is `Union{Int64, String}`

.
All these results are technically correct, but it is important to note that they start to matter when we create
a data frame from the result:

```
julia> DataFrame(Tables.columns(vnt2))
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> DataFrame(vnt2)
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> DataFrame(Tables.dictrowtable(vnt2))
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Union…
─────┼────────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
```

In this case for `Tables.columns`

and `Tables.rows`

we have `Any`

as element type for `:c`

.
While for `Tables.dictrowtable`

we get the `Union`

. This difference might matter to you
if you wanted to later change data stored in `:c`

column.

However, the key difference between `Tables.dictrowtable`

and other methods is visible
in case when we have heterogeneous column list data.

# How does Tables.jl handle schema-less tables? Part 2

What is the data that has a heterogeneous column list? It is relatively common in practice. Let me create two examples:

```
julia> h1 = [(; a=1), (; a=2, b=3)]
2-element Vector{NamedTuple}:
(a = 1,)
(a = 2, b = 3)
julia> h2 = [(; a=1), (; b=3)]
2-element Vector{NamedTuple{names, Tuple{Int64}} where names}:
(a = 1,)
(b = 3,)
```

Both `h1`

and `h2`

have a different set of columns. How does the
Tables.jl handle them? Let us check by creating a data frame from them:

```
julia> DataFrame(Tables.columns(h1))
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> DataFrame(Tables.rows(h1))
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> DataFrame(Tables.dictrowtable(h1))
2×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 3
```

For the `h1`

input we see that using `Tables.columns`

and `Tables.rows`

drops the `:b`

column while `Tables.dictrowtable`

keeps it and fills missing entries with `missing`

.
This behavior is called “column unioning”.

Now let us check the `h2`

case:

```
julia> DataFrame(Tables.columns(h2))
ERROR: type NamedTuple has no field a
julia> DataFrame(Tables.rows(h2))
ERROR: type NamedTuple has no field a
julia> DataFrame(Tables.dictrowtable(h2))
2×2 DataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ 1 missing
2 │ missing 3
```

Now we have an even stranger behavior. The first two calls error, while the
`Tables.dictrowtable`

works by “column unioning”. Why do the first two cases error?
The reason is that `DataFrame`

constructor internally always calls `Tables.columns`

on a passed table. And `Tables.columns`

when doing column detection assumes that
the list of columns for row-wise table is given by its first row. So even
without calling `DataFrame`

we get an error when we do:

```
julia> Tables.columns(h2)
ERROR: type NamedTuple has no field a
```

In summary we have two cases of behaviors for row-wise tables:

- assume the columns are given in the first row of a table;
this is what
`Tables.columns`

does, and, in consequence`DataFrame`

constructor; - perform “column unioning”; this is what
`Tables.dictrowtable`

does.

# Conclusions

The take-away from the examples we have given today are as follows:

- by default
`DataFrame(row_wise_table)`

inherits the`Tables.columns`

behavior and assumes that the set of columns of the passed row-wise table is given by its first row; - if you want to override the default behavior, and want to perform “column unioning” instead
call
`DataFrame(Tables.dictrowtable(row_wise_table))`

; this option is useful if you have data that has heterogeneous column lists in different rows.

The examples given today were a bit low-level, but I hope they improved your understanding how the Tables.jl functions handle different tabular data sources.