using DataFrames, CategoricalArrays15 Basics
We start by loading both DataFrames and CategoricalArrays packages.
CategoricalArrays adds support for categorical variables (similar to factors in R)
15.1 Create a DataFrame
Some synthetic arrays:
a = [1, 2, 3, 4, 5];
b = collect(25:-1:21);
c = categorical(["a", "b", "c", "c", "a"]);Combine into a DataFrame. Note that the second argument, :auto, is required to automatically generate column names:
dat = DataFrame([a, b, c], :auto)5×3 DataFrame
| Row | x1 | x2 | x3 |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 1 | 25 | a |
| 2 | 2 | 24 | b |
| 3 | 3 | 23 | c |
| 4 | 4 | 22 | c |
| 5 | 5 | 21 | a |
15.2 Get column names: names()
names(dat)3-element Vector{String}:
"x1"
"x2"
"x3"
15.3 Set column names
To specify column names when creating a new DataFrame you can pass a vector of Symbols:
dat = DataFrame([a, b, c], [:alpha, :beta, :gamma])5×3 DataFrame
| Row | alpha | beta | gamma |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 1 | 25 | a |
| 2 | 2 | 24 | b |
| 3 | 3 | 23 | c |
| 4 | 4 | 22 | c |
| 5 | 5 | 21 | a |
..or pass named arguments:
dat = DataFrame(ey = a, bee = b, cee = c)5×3 DataFrame
| Row | ey | bee | cee |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 1 | 25 | a |
| 2 | 2 | 24 | b |
| 3 | 3 | 23 | c |
| 4 | 4 | 22 | c |
| 5 | 5 | 21 | a |
15.4 Rename columns of a DataFrame: rename!().
The ! signifies the change happens in-place.
rename!(dat, [:alpha, :beta, :gamma])5×3 DataFrame
| Row | alpha | beta | gamma |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 1 | 25 | a |
| 2 | 2 | 24 | b |
| 3 | 3 | 23 | c |
| 4 | 4 | 22 | c |
| 5 | 5 | 21 | a |
15.5 Get dimensions: size(), nrow(), ncol()
size(dat)(5, 3)
nrow(dat)5
ncol(dat)3
15.6 DataFrame summary: describe()
describe(dat)3×7 DataFrame
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | alpha | 3.0 | 1 | 3.0 | 5 | 0 | Int64 |
| 2 | beta | 23.0 | 21 | 23.0 | 25 | 0 | Int64 |
| 3 | gamma | a | c | 0 | CategoricalValue{String, UInt32} |
15.7 Sort: sort() & sort!()
Print sorted DataFrame without altering it with sort():
sort(dat, :beta)5×3 DataFrame
| Row | alpha | beta | gamma |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 5 | 21 | a |
| 2 | 4 | 22 | c |
| 3 | 3 | 23 | c |
| 4 | 2 | 24 | b |
| 5 | 1 | 25 | a |
dat has not changed:
dat5×3 DataFrame
| Row | alpha | beta | gamma |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 1 | 25 | a |
| 2 | 2 | 24 | b |
| 3 | 3 | 23 | c |
| 4 | 4 | 22 | c |
| 5 | 5 | 21 | a |
Change order of DataFrame rows in-place with sort():
sort!(dat, :beta)5×3 DataFrame
| Row | alpha | beta | gamma |
|---|---|---|---|
| Int64 | Int64 | Cat… | |
| 1 | 5 | 21 | a |
| 2 | 4 | 22 | c |
| 3 | 3 | 23 | c |
| 4 | 2 | 24 | b |
| 5 | 1 | 25 | a |
15.8 Indexing
return a vector:
dat[:, 2]5-element Vector{Int64}:
21
22
23
24
25
dat[:, :alpha]5-element Vector{Int64}:
5
4
3
2
1
dat[!, :alpha]5-element Vector{Int64}:
5
4
3
2
1
return a DataFrame
dat[:, [2]]5×1 DataFrame
| Row | beta |
|---|---|
| Int64 | |
| 1 | 21 |
| 2 | 22 |
| 3 | 23 |
| 4 | 24 |
| 5 | 25 |
dat[:, [:alpha]]5×1 DataFrame
| Row | alpha |
|---|---|
| Int64 | |
| 1 | 5 |
| 2 | 4 |
| 3 | 3 |
| 4 | 2 |
| 5 | 1 |
15.9 Access columns by name saved in variable
var = "gamma"
dat[!, Symbol(var)]5-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"c"
"c"
"b"
"a"
dat[!, [Symbol(var)]]5×1 DataFrame
| Row | gamma |
|---|---|
| Cat… | |
| 1 | a |
| 2 | c |
| 3 | c |
| 4 | b |
| 5 | a |
15.10 Add column to DataFrame
dat.asq = dat.alpha .^ 25-element Vector{Int64}:
25
16
9
4
1
dat5×4 DataFrame
| Row | alpha | beta | gamma | asq |
|---|---|---|---|---|
| Int64 | Int64 | Cat… | Int64 | |
| 1 | 5 | 21 | a | 25 |
| 2 | 4 | 22 | c | 16 |
| 3 | 3 | 23 | c | 9 |
| 4 | 2 | 24 | b | 4 |
| 5 | 1 | 25 | a | 1 |
You can also do the same, using transform!() (useful if performed programmatically)
transform!(dat, :alpha => (x -> x .^ 2) => :asqtoo)5×5 DataFrame
| Row | alpha | beta | gamma | asq | asqtoo |
|---|---|---|---|---|---|
| Int64 | Int64 | Cat… | Int64 | Int64 | |
| 1 | 5 | 21 | a | 25 | 25 |
| 2 | 4 | 22 | c | 16 | 16 |
| 3 | 3 | 23 | c | 9 | 9 |
| 4 | 2 | 24 | b | 4 | 4 |
| 5 | 1 | 25 | a | 1 | 1 |
dat5×5 DataFrame
| Row | alpha | beta | gamma | asq | asqtoo |
|---|---|---|---|---|---|
| Int64 | Int64 | Cat… | Int64 | Int64 | |
| 1 | 5 | 21 | a | 25 | 25 |
| 2 | 4 | 22 | c | 16 | 16 |
| 3 | 3 | 23 | c | 9 | 9 |
| 4 | 2 | 24 | b | 4 | 4 |
| 5 | 1 | 25 | a | 1 | 1 |
15.10.1 Insert column(s) at location
x = DataFrame(v = 1:5, w = 16:20)5×2 DataFrame
| Row | v | w |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 16 |
| 2 | 2 | 17 |
| 3 | 3 | 18 |
| 4 | 4 | 19 |
| 5 | 5 | 20 |
insertcols!(x, 2, :v² => x.v .^2)5×3 DataFrame
| Row | v | v² | w |
|---|---|---|---|
| Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 16 |
| 2 | 2 | 4 | 17 |
| 3 | 3 | 9 | 18 |
| 4 | 4 | 16 | 19 |
| 5 | 5 | 25 | 20 |
wt = DataFrame(w² = x.w .^ 2, w³ = x.w .^ 3)5×2 DataFrame
| Row | w² | w³ |
|---|---|---|
| Int64 | Int64 | |
| 1 | 256 | 4096 |
| 2 | 289 | 4913 |
| 3 | 324 | 5832 |
| 4 | 361 | 6859 |
| 5 | 400 | 8000 |
insertcols!(x, ([:w², :w³] .=> eachcol(wt))...)5×5 DataFrame
| Row | v | v² | w | w² | w³ |
|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 1 | 16 | 256 | 4096 |
| 2 | 2 | 4 | 17 | 289 | 4913 |
| 3 | 3 | 9 | 18 | 324 | 5832 |
| 4 | 4 | 16 | 19 | 361 | 6859 |
| 5 | 5 | 25 | 20 | 400 | 8000 |