Introduction

Before I begin let me make a small announcement. In my last post I have advertised an introductory Julia course for data science that will start next week at MIT. If someone is interested in the material we are going to cover you can find it here. There is also an option to attend the course remotely. The content will be streamed via MS Teams. If you would like to join please register yourself here.

Now, back to the usual business. This week I have registered the TableMetadataTools.jl package. It aims to make common operations on table metadata more convenient.

Recently I have written about how one can work with metadata in this post. Today, I want to give another example of working with table metadata and discuss the major functionalities that TableMetadataTools.jl offers. The key difference is that this time we will use JSON source.

The post was written under Julia 1.8.5, DataFrames.jl 1.4.4, JSON3.jl 1.12.0, TableMetadataTools.jl 0.1.0, and Parquet2.jl 0.2.5.

The data

This time I chose the Electric Vehicle Population Data data to analyze. This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).

We will want with the JSON file rows.json that can be downloaded from that website, so before proceeding please download it to your working folder. The analysis I present is done for the file that I downloaded on Jan 12, 2023.

In the post we will:

  1. Read the JSON file and load both data and metadata contained in it into a data frame.
  2. Perform some simple analysis of the data.
  3. Save the data as Parquet file (since it supports metadata).

(note - some of the output in this post will be wide, so in some displays it might not look very nice; however, it looks OK if you try it in the terminal or Jupyter Notebook)

Data ingest

First load the required packages:

julia> using DataFrames

julia> using JSON3

julia> using Parquet2

julia> using Statistics

julia> using TableMetadataTools

Next we read the JSON file to a dictionary-like structure:

julia> data_json = open(JSON3.read, "rows.json")
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 2 entries:  :meta => {…
  :data => JSON3.Array[Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-…

It is instructive to investigate the structure of this object, as it is complex:

julia> function traverse(obj; level=0)
           if level > 4 || !(obj isa AbstractDict)
               print(" ...")
               return
           end
           for key in keys(obj)
               print("\n", " " ^ level, key, ":")
               traverse(obj[key], level=level+2)
           end
       end
traverse (generic function with 2 methods)

julia> traverse(data_json)

meta:
  view:
    id: ...
    name: ...
    assetType: ...
    attribution: ...
    averageRating: ...
    category: ...
    createdAt: ...
    description: ...
    displayType: ...
    downloadCount: ...
    hideFromCatalog: ...
    hideFromDataJson: ...
    newBackend: ...
    numberOfComments: ...
    oid: ...
    provenance: ...
    publicationAppendEnabled: ...
    publicationDate: ...
    publicationGroup: ...
    publicationStage: ...
    rowsUpdatedAt: ...
    rowsUpdatedBy: ...
    tableId: ...
    totalTimesRated: ...
    viewCount: ...
    viewLastModified: ...
    viewType: ...
    approvals: ...
    clientContext: ...
    columns: ...
    grants: ...
    metadata: ...
    owner: ...
    query: ...
    rights: ...
    tableAuthor: ...
    tags: ...
    flags: ...
data: ...

As you can see we have descended only two levels in the hierarchy and already there is a lot of information.

Actual data is stored in :data field. The rest is metadata, of which interesting for us will be :name, :description, and :columns.

Before move forward, let us peek at data_json.meta.view.columns vector:

julia> data_json.meta.view.columns[[1; end]]
2-element Vector{JSON3.Object}:
 {
               "id": -1,
             "name": "sid",
     "dataTypeName": "meta_data",
        "fieldName": ":sid",
         "position": 0,
   "renderTypeName": "meta_data",
           "format": {},
            "flags": [
                       "hidden"
                     ]
}
 {
                    "id": 561974359,
                  "name": "WAOFM - GIS - Legislative District Boundary",
          "dataTypeName": "number",
           "description": "",
             "fieldName": ":@computed_region_8ddd_yn5v",
              "position": 21,
        "renderTypeName": "number",
         "tableColumnId": 87270518,
   "computationStrategy": {
                             "source_columns": [
                                                 "geocoded_column"
                                               ],
                                       "type": "georegion_match_on_point",
                                 "parameters": {
                                                       "region": "_8ddd-yn5v",
                                                  "primary_key": "_feature_id"
                                               }
                          },
                "format": {}
}

(I have displayed only first and last element of the vector contained in this metadata entry as the rest are similar). This vector stores column-level metadata. The interesting fields are :name, :fieldName, and :description (the last is not always present).

Finally, let us investigate data_json.data object. Again, we pick only the first and last element:

julia> data_json.data[[1; end]]
2-element Vector{JSON3.Array}:
 Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-436D376BD73A", 0,
 1670541195, nothing, 1670541238, nothing, "{ }", "JTMEB3FV6N",
 "Monroe"  …  "42", "0", nothing, "198968248",
 "POINT (-81.80023 24.5545)", nothing, "12087972100", "51", nothing, nothing]
 Any["row-acyi~zx2n-aq3x", "00000000-0000-0000-BA64-2573B5CAA8F7", 0,
 1670541195, nothing, 1670541258, nothing, "{ }", "YV4BR0CL8N",
 "King"  …  "18", "0", "47", "194673692",
 "POINT (-122.09124 47.33778)", "PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)",
 "53033032005", "3009", "8", "46"]

We see that the data is stored in rows.

Loading data to a data frame

Having this knowledge let us load the data into a data frame.

First we extract the column-level metadata into a data frame for convenience:

julia> data_colmeta = DataFrame(name=getproperty.(cols, :fieldName),
                                label=getproperty.(cols, :name),
                                note=get.(cols, :description, nothing))
28×3 DataFrame
 Row │ name                         label                              note
     │ String                       String                             Union…
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ :sid                         sid
   2 │ :id                          id
   3 │ :position                    position
   4 │ :created_at                  created_at
   5 │ :created_meta                created_meta
   6 │ :updated_at                  updated_at
   7 │ :updated_meta                updated_meta
   8 │ :meta                        meta
   9 │ vin_1_10                     VIN (1-10)                         The 1st 10 characters of each ve…
  10 │ county                       County                             The county in which the register…
  11 │ city                         City                               The city in which the registered…
  12 │ state                        State                              The state in which the registere…
  13 │ zip_code                     Postal Code                        The 5 digit zip code in which th…
  14 │ model_year                   Model Year                         The model year of the vehicle, d…
  15 │ make                         Make                               The manufacturer of the vehicle,…
  16 │ model                        Model                              The model of the vehicle, determ…
  17 │ ev_type                      Electric Vehicle Type              This distinguishes the vehicle a…
  18 │ cafv_type                    Clean Alternative Fuel Vehicle (…  This categorizes vehicle as Clea…
  19 │ electric_range               Electric Range                     Describes how far a vehicle can …
  20 │ base_msrp                    Base MSRP                          This is the lowest Manufacturer'…
  21 │ legislative_district         Legislative District               The specific section of Washingt…
  22 │ dol_vehicle_id               DOL Vehicle ID                     Unique number assigned to each v…
  23 │ geocoded_column              Vehicle Location                   The center of the ZIP Code for t…
  24 │ electric_utility             Electric Utility                   This is the electric power retai…
  25 │ _2020_census_tract           2020 Census Tract                  The census tract identifier is a…
  26 │ :@computed_region_x4ys_rtnd  Counties
  27 │ :@computed_region_fny7_vc3j  Congressional Districts
  28 │ :@computed_region_8ddd_yn5v  WAOFM - GIS - Legislative Distri…

We now, in a structured way, see that the metadata contains: column name, column descriptive label, and, optionally, note about column contents.

Assume that from all the available columns we want to keep only few most interesting. The DATA_COLSKEEP vector keeps column numbers we want to store in a data frame.

julia> const DATA_COLSKEEP = [9, 12, 14, 15, 17, 18, 19];

julia> keepat!(data_colmeta, DATA_COLSKEEP)
7×3 DataFrame
 Row │ name            label                              note
     │ String          String                             Union…
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │ vin_1_10        VIN (1-10)                         The 1st 10 characters of each ve…
   2 │ state           State                              The state in which the registere…
   3 │ model_year      Model Year                         The model year of the vehicle, d…
   4 │ make            Make                               The manufacturer of the vehicle,…
   5 │ ev_type         Electric Vehicle Type              This distinguishes the vehicle a…
   6 │ cafv_type       Clean Alternative Fuel Vehicle (…  This categorizes vehicle as Clea…
   7 │ electric_range  Electric Range                     Describes how far a vehicle can …

Now we are ready to create a data frame with our data. First create its schema and populate the metadata:

julia> data_df = DataFrame(data_colmeta.name .=> Ref(String[]))
0×7 DataFrame
 Row │ vin_1_10  state   model_year  make    ev_type  cafv_type  electric_range
     │ String    String  String      String  String   String     String
─────┴──────────────────────────────────────────────────────────────────────────

julia> caption!(data_df, data_json.meta.view.name);

julia> note!(data_df, data_json.meta.view.description);

julia> for row in eachrow(data_colmeta)
           label!(data_df, row.name, row.label)
           note!(data_df, row.name, row.note)
       end

Let us check if indeed metadata got properly stored. We can either extract dictionaries:

julia> metadata(data_df)
Dict{String, String} with 2 entries:
  "note"    => "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."
  "caption" => "Electric Vehicle Population Data"

julia> colmetadata(data_df)
Dict{Symbol, Dict{String, String}} with 7 entries:
  :ev_type        => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
  :make           => Dict("label"=>"Make", "note"=>"The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")
  :cafv_type      => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-o…
  :state          => Dict("label"=>"State", "note"=>"The state in which the registered owner resides.")
  :vin_1_10       => Dict("label"=>"VIN (1-10)", "note"=>"The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).")
  :electric_range => Dict("label"=>"Electric Range", "note"=>"Describes how far a vehicle can travel purely on its electric charge.")
  :model_year     => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")

Or get a TOML representation of metadata (the first entry of TOML output means that we did not store metadata style information):

julia> println(meta2toml(data_df, style=false))
style = false

[colmetadata.cafv_type]
label = "Clean Alternative Fuel Vehicle (CAFV) Eligibility"
note = "This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session."

[colmetadata.electric_range]
label = "Electric Range"
note = "Describes how far a vehicle can travel purely on its electric charge."

[colmetadata.ev_type]
label = "Electric Vehicle Type"
note = "This distinguishes the vehicle as all electric or a plug-in hybrid."

[colmetadata.make]
label = "Make"
note = "The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

[colmetadata.model_year]
label = "Model Year"
note = "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

[colmetadata.state]
label = "State"
note = "The state in which the registered owner resides."

[colmetadata.vin_1_10]
label = "VIN (1-10)"
note = "The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN)."

[metadata]
caption = "Electric Vehicle Population Data"
note = "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."

All looks good so far.

Additionally you can easily extract all column labels using the labels function:

julia> labels(data_df)
7-element Vector{String}:
 "VIN (1-10)"
 "State"
 "Model Year"
 "Make"
 "Electric Vehicle Type"
 "Clean Alternative Fuel Vehicle (CAFV) Eligibility"
 "Electric Range"

Now we store the data in our data frame:

julia> foreach(row -> push!(data_df, row[DATA_COLSKEEP]), data_json.data)

julia> show(data_df, truncate=10)
112634×7 DataFrame
    Row │ vin_1_10    state   model_year  make         ev_type      cafv_type    electric_range
        │ String      String  String      String       String       String       String
────────┼───────────────────────────────────────────────────────────────────────────────────────
      1 │ JTMEB3FV6N  FL      2022        TOYOTA       Plug-in Hy…  Clean Alte…  42
      2 │ 1G1RD6E45D  NV      2013        CHEVROLET    Plug-in Hy…  Clean Alte…  38
      3 │ JN1AZ0CP8B  WA      2011        NISSAN       Battery El…  Clean Alte…  73
      4 │ 1G1FW6S08H  WA      2017        CHEVROLET    Battery El…  Clean Alte…  238
      5 │ 3FA6P0SU1K  WA      2019        FORD         Plug-in Hy…  Not eligib…  26
      6 │ 5YJ3E1EB5J  WA      2018        TESLA        Battery El…  Clean Alte…  215
   ⋮    │     ⋮         ⋮         ⋮            ⋮            ⋮            ⋮             ⋮
 112630 │ 7SAYGDEF2N  WA      2022        TESLA        Battery El…  Eligibilit…  0
 112631 │ 1N4BZ1CP7K  WA      2019        NISSAN       Battery El…  Clean Alte…  150
 112632 │ 1FMCU0KZ4N  WA      2022        FORD         Plug-in Hy…  Clean Alte…  38
 112633 │ KNDCD3LD4J  WA      2018        KIA          Plug-in Hy…  Not eligib…  26
 112634 │ YV4BR0CL8N  WA      2022        VOLVO        Plug-in Hy…  Not eligib…  18
                                                                             112623 rows omitted

julia> describe(data_df, :min, :max, :nmissing)
7×4 DataFrame
 Row │ variable        min                                max                                nmissing
     │ Symbol          String                             String                             Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ vin_1_10        1C4JJXP60M                         YV4H60DZ9N                                0
   2 │ state           AK                                 WY                                        0
   3 │ model_year      1997                               2023                                      0
   4 │ make            AUDI                               VOLVO                                     0
   5 │ ev_type         Battery Electric Vehicle (BEV)     Plug-in Hybrid Electric Vehicle …         0
   6 │ cafv_type       Clean Alternative Fuel Vehicle E…  Not eligible due to low battery …         0
   7 │ electric_range  0                                  97                                        0

We are almost done. It looks like :model_year and :electric_range are integer, so we perform an appropriate conversion:

julia> transform!(data_df,
                  [:model_year, :electric_range] .=> ByRow(x -> parse(Int, x)),
                  renamecols=false);

Here one important comment has to be made. Since we keep the old column names for the transformed columns their "label" and "note" metadata is kept. The reason is that this metadata originally had :note style. Let us check that indeed metadata is kept:

julia> colmetadata(data_df, :model_year)
Dict{String, String} with 2 entries:
  "label" => "Model Year"
  "note"  => "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

julia> colmetadata(data_df, :electric_range)
Dict{String, String} with 2 entries:
  "label" => "Electric Range"
  "note"  => "Describes how far a vehicle can travel purely on its electric charge."

It is an important rule when working with :note-style metadata to keep old column name only if we do not change the meaning of the column (like in our case, where we just parsed it).

Doing some analysis of the data

Let us check how many observations we have for combination of :ev_type and :cafv_type columns:

julia> data_gdf = groupby(data_df, [:ev_type, :cafv_type], sort=true)

julia> data_agg1 = combine(data_gdf, nrow, :electric_range .=> extrema)
5×4 DataFrame
 Row │ ev_type                            cafv_type                          nrow   electric_range_extrema
     │ String                             String                             Int64  Tuple{Int64, Int64}
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Battery Electric Vehicle (BEV)     Clean Alternative Fuel Vehicle E…  46799  (39, 337)
   2 │ Battery Electric Vehicle (BEV)     Eligibility unknown as battery r…  39236  (0, 0)
   3 │ Battery Electric Vehicle (BEV)     Not eligible due to low battery …      9  (29, 29)
   4 │ Plug-in Hybrid Electric Vehicle …  Clean Alternative Fuel Vehicle E…  11840  (30, 153)
   5 │ Plug-in Hybrid Electric Vehicle …  Not eligible due to low battery …  14750  (6, 29)

The results seem interesting, but we most likely cannot immediately understand them. Let us check the column metadata:

julia> colmetadata(data_agg1)
Dict{Symbol, Dict{String, String}} with 2 entries:
  :ev_type   => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
  :cafv_type => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only r…

We can see that only :ev_type and :cafv_type columns have metadata. This is intended. These columns were not transformed so their metadata is kept because it had :note style. On the other hand the :electric_range column was transformed, by taking its extrema so its metadata was not propagated.

Now we can more easily understand the results. The :ev_type column tells us if the vehicle is electric only or hybrid. On the other hand :cafv_type column tells us if the vehicle meets the requirements of Clean Alternative Fuel Vehicles.

Interestingly, for some BEV cars it is not known if they meet CAFV condition, and this is indicated by 0 in their :electric_range. For other cases we see that 30 is a minimal range that allows to classify vehicle as meeting CAFV. Surprisingly six BEV do not meet it, as their battery range is too low.

Let us check how does the frequency of electric range equal to 0 behaves:

julia> data_agg2 = sort(unstack(data_df,
                                :model_year,
                                :ev_type,
                                :electric_range,
                                combine = x -> mean(x .> 0)))
20×3 DataFrame
 Row │ model_year  Plug-in Hybrid Electric Vehicle (PHEV)  Battery Electric Vehicle (BEV)
     │ Int64       Union{Missing, Float64}                 Union{Missing, Float64}
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │       1997                               missing                         1.0
   2 │       1998                               missing                         1.0
   3 │       1999                               missing                         1.0
   4 │       2000                               missing                         1.0
   5 │       2002                               missing                         1.0
   6 │       2008                               missing                         1.0
   7 │       2010                               missing                         1.0
   8 │       2011                                     1.0                       1.0
   9 │       2012                                     1.0                       1.0
  10 │       2013                                     1.0                       1.0
  11 │       2014                                     1.0                       1.0
  12 │       2015                                     1.0                       1.0
  13 │       2016                                     1.0                       1.0
  14 │       2017                                     1.0                       1.0
  15 │       2018                                     1.0                       1.0
  16 │       2019                                     1.0                       0.999763
  17 │       2020                                     1.0                       0.994388
  18 │       2021                                     1.0                       0.0327439
  19 │       2022                                     1.0                       0.0
  20 │       2023                                     1.0                       0.0

We see that indeed the newest BEV do not have the range data (because it was not collected yet). Also we learn that hybrids were introduced only in year 2011.

As above we can check that column metadata is only kept for :model_year column since we did not transform it:

julia> colmetadata(data_agg2)
Dict{Symbol, Dict{String, String}} with 1 entry:
  :model_year => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")

Saving the data frame while keeping metadata

The meta2toml function was designed to allow for easy saving of metadata to an independent file. However, I find it more convenient to save metadata along with data. Here Parquet2.jl comes handy.

Let us work with data_agg2 data frame:

julia> Parquet2.writefile("data_agg2.parquet", data_agg2)
✏ Parquet2.FileWriter{IOStream}(data_agg2.parquet)

julia> data_pq = Parquet2.Dataset("data_agg2.parquet")
≔ Parquet2.Dataset (928 bytes)
        1. "model_year": Int64
        2. "Plug-in Hybrid Electric Vehicle (PHEV)": Union{Missing, Float64}
        3. "Battery Electric Vehicle (BEV)": Union{Missing, Float64}


julia> data_pq_df = DataFrame(data_pq);

We can check that both data_pq and data_pq_df have metadata information:

julia> println(meta2toml(data_pq))
style = true

[colmetadata."Battery Electric Vehicle (BEV)"]

[colmetadata."Plug-in Hybrid Electric Vehicle (PHEV)"]

[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]

[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]

julia> println(meta2toml(data_pq_df))
style = true

[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]

[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]

We note that the metadata has :default style, because Parquet does not support metadata styles. Therefore, if we wanted to turn all metadata to :note style, so that it would be propagated we can write e.g.:

julia> setallmetadatastyle!(data_pq_df);

julia> print(meta2toml(data_pq_df))
style = true

[colmetadata.model_year]
label = ["Model Year", "note"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "note"]

[metadata]
caption = ["Electric Vehicle Population Data", "note"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "note"]

What features does TableMetadataTools.jl provide?

Here is the list of functionalities that you can find in TableMetadataTools.jl that are added on top of standard metadata management functions defined in DataAPI.jl:

  • the label, label!, labels, and findlabels functions for convenient work with column label metadata;
  • the caption and caption! functions for convenient work with table caption metadata;
  • note and note! functions for convenient work with note metadata both on table and column level;
  • unit (re-exported from Unitful.jl), unit!, and units functions for convenient work with column unit metadata;
  • setmetadatastyle!, setcolmetadatastyle!, setallmetadatastyle! for group setting style for keys matching a passed pattern; usually needed when working with metadata that initially has :default style set and one wants it to have :note style (common when reading metadata from storage formats that do not support metadata style information);
  • meta2toml and toml2meta! for storing and loading metadata in TOML format;
  • dict2metadata!, dict2colmetadata! for setting table and column level metadata stored in a dictionary (e.g. earlier retrieved from some storage format or by using metadata or colmetadata functions);
  • the @track macro for tracking operations applied to table and the tracklog function for convenient printing of tracking metadata.

Conclusions

I hope you will find this post useful as a tutorial to both working with metadata and ingesting data from a complex JSON file into a data frame.

My general feeling is that, while metadata is commonly used by Stata or SAS users, it had less adoption in other ecosystems. The reason is likely that there was no decent support for storing and working with tabular metadata. Now, I hope the combination of DataFrames.jl + Parquet2.jl helps to solve this issue.

Admittedly in many cases you do not need table metadata. However, I am convinced that if you work with tables that have hundreds or thousands of columns or that you want to share your tables between teams having proper metadata attached to data is essential.