New content in DataFrames.jl documentation
Many people moving to DataFrames.jl from other data-management ecosystems are interested in learning how to map their favorite code patterns to Julia.
It was a long standing issue. Fortunately recently thanks to the efforts of Matthieu Gomez and Tom Kwong (with the usual major support from Peter Deffebach and Milan Bouchet-Valat, and a few other contributors) we finally have a section in the manual on comparisons against Pandas, dplyr, and Stata.
We all hope that these materials will be useful for people to get started with DataFrames.jl. If you would like to see some additional content in the comparisons section of the DataFrames.jl manual – please do not hesitate to open an issue or pull request.
As an after-word let me comment that getting dplyr and Stata material was much smoother than Pandas. It is also reflected in the volume of the material covered (though probably dplyr and Stata coverage could be improved). The main reason is that Pandas differs many more ways from DataFrames.jl than dplyr or Stata. A few of the notable differences are:
- the type of return value from
locfunction in Pandas depends on the value (not only the type) of its arguments;
0based indexing (Pandas) vs
1based indexing (DataFrames.jl);
NaNin Pandas is treated as
missingin Julia, but is skipped by default as opposed to Julia, where you have to be explicit;
- Pandas has
inplaceargument to functions while in Julia we have functions with and without
!to distinguish between non-mutating and mutating operations;
- Pandas provides row index, while in DataFrames.jl you need a separate column
(or columns) in a
DataFrameto hold it and later run a
groupbyfunction on them to get an efficient row-lookup functionality through
GroupedDataFrameobject (note, in particular, that in this way you can have many different row indexing column sets to for the same data frame).