DataFrames.jl vs Pandas, dplyr, and Stata
New content in DataFrames.jl documentation
Many people moving to DataFrames.jl from other data-management ecosystems are interested in learning how to map their favorite code patterns to Julia.
It was a long standing issue. Fortunately recently thanks to the efforts of Matthieu Gomez and Tom Kwong (with the usual major support from Peter Deffebach and Milan Bouchet-Valat, and a few other contributors) we finally have a section in the manual on comparisons against Pandas, dplyr, and Stata.
In parallel Tom Kwong also prepared DataFrames.jl cheat sheet which excellently shows key functionalities that we currently provide.
We all hope that these materials will be useful for people to get started with DataFrames.jl. If you would like to see some additional content in the comparisons section of the DataFrames.jl manual – please do not hesitate to open an issue or pull request.
Lessons learned
As an after-word let me comment that getting dplyr and Stata material was much smoother than Pandas. It is also reflected in the volume of the material covered (though probably dplyr and Stata coverage could be improved). The main reason is that Pandas differs many more ways from DataFrames.jl than dplyr or Stata. A few of the notable differences are:
- the type of return value from
loc
function in Pandas depends on the value (not only the type) of its arguments; 0
based indexing (Pandas) vs1
based indexing (DataFrames.jl);NaN
in Pandas is treated asmissing
in Julia, but is skipped by default as opposed to Julia, where you have to be explicit;- Pandas has
inplace
argument to functions while in Julia we have functions with and without!
to distinguish between non-mutating and mutating operations; - Pandas provides row index, while in DataFrames.jl you need a separate column
(or columns) in a
DataFrame
to hold it and later run agroupby
function on them to get an efficient row-lookup functionality throughGroupedDataFrame
object (note, in particular, that in this way you can have many different row indexing column sets to for the same data frame).