using DataFrames17 Joins
a = DataFrame(PID = collect(1:9),
Hospital = ["UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"],
Age = [22, 34, 41, 19, 53, 21, 63, 22, 19],
Sex = [1, 1, 0, 1, 0, 0, 1, 0, 0])9×4 DataFrame
| Row | PID | Hospital | Age | Sex |
|---|---|---|---|---|
| Int64 | String | Int64 | Int64 | |
| 1 | 1 | UCSF | 22 | 1 |
| 2 | 2 | HUP | 34 | 1 |
| 3 | 3 | Stanford | 41 | 0 |
| 4 | 4 | Stanford | 19 | 1 |
| 5 | 5 | UCSF | 53 | 0 |
| 6 | 6 | HUP | 21 | 0 |
| 7 | 7 | HUP | 63 | 1 |
| 8 | 8 | Stanford | 22 | 0 |
| 9 | 9 | UCSF | 19 | 0 |
b = DataFrame(PID = collect(6:12),
V1 = [153, 89, 112, 228, 91, 190, 101],
Department = ["Neurology", "Radiology",
"Emergency", "Cardiology",
"Surgery", "Neurology", "Psychiatry"])7×3 DataFrame
| Row | PID | V1 | Department |
|---|---|---|---|
| Int64 | Int64 | String | |
| 1 | 6 | 153 | Neurology |
| 2 | 7 | 89 | Radiology |
| 3 | 8 | 112 | Emergency |
| 4 | 9 | 228 | Cardiology |
| 5 | 10 | 91 | Surgery |
| 6 | 11 | 190 | Neurology |
| 7 | 12 | 101 | Psychiatry |
17.1 Inner join
ab_inner = innerjoin(a, b, on = :PID)4×6 DataFrame
| Row | PID | Hospital | Age | Sex | V1 | Department |
|---|---|---|---|---|---|---|
| Int64 | String | Int64 | Int64 | Int64 | String | |
| 1 | 6 | HUP | 21 | 0 | 153 | Neurology |
| 2 | 7 | HUP | 63 | 1 | 89 | Radiology |
| 3 | 8 | Stanford | 22 | 0 | 112 | Emergency |
| 4 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
Note that the resulting table only contains cases found in both data frames (i.e. IDs 6 through 9)
17.2 Outer join
ab_outer = outerjoin(a, b, on = :PID)12×6 DataFrame
| Row | PID | Hospital | Age | Sex | V1 | Department |
|---|---|---|---|---|---|---|
| Int64 | String? | Int64? | Int64? | Int64? | String? | |
| 1 | 6 | HUP | 21 | 0 | 153 | Neurology |
| 2 | 7 | HUP | 63 | 1 | 89 | Radiology |
| 3 | 8 | Stanford | 22 | 0 | 112 | Emergency |
| 4 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
| 5 | 1 | UCSF | 22 | 1 | missing | missing |
| 6 | 2 | HUP | 34 | 1 | missing | missing |
| 7 | 3 | Stanford | 41 | 0 | missing | missing |
| 8 | 4 | Stanford | 19 | 1 | missing | missing |
| 9 | 5 | UCSF | 53 | 0 | missing | missing |
| 10 | 10 | missing | missing | missing | 91 | Surgery |
| 11 | 11 | missing | missing | missing | 190 | Neurology |
| 12 | 12 | missing | missing | missing | 101 | Psychiatry |
Note that the resulting data frame contains all IDs found in either input data frame and missing values are represented with missing
17.3 Left outer join
ab_leftouter = leftjoin(a, b, on = :PID)9×6 DataFrame
| Row | PID | Hospital | Age | Sex | V1 | Department |
|---|---|---|---|---|---|---|
| Int64 | String | Int64 | Int64 | Int64? | String? | |
| 1 | 6 | HUP | 21 | 0 | 153 | Neurology |
| 2 | 7 | HUP | 63 | 1 | 89 | Radiology |
| 3 | 8 | Stanford | 22 | 0 | 112 | Emergency |
| 4 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
| 5 | 1 | UCSF | 22 | 1 | missing | missing |
| 6 | 2 | HUP | 34 | 1 | missing | missing |
| 7 | 3 | Stanford | 41 | 0 | missing | missing |
| 8 | 4 | Stanford | 19 | 1 | missing | missing |
| 9 | 5 | UCSF | 53 | 0 | missing | missing |
Note how the resulting data frame contains all IDs present in the left input DataFrame only.
17.4 Right outer join
ab_rightouter = rightjoin(a, b, on = :PID)7×6 DataFrame
| Row | PID | Hospital | Age | Sex | V1 | Department |
|---|---|---|---|---|---|---|
| Int64 | String? | Int64? | Int64? | Int64 | String | |
| 1 | 6 | HUP | 21 | 0 | 153 | Neurology |
| 2 | 7 | HUP | 63 | 1 | 89 | Radiology |
| 3 | 8 | Stanford | 22 | 0 | 112 | Emergency |
| 4 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
| 5 | 10 | missing | missing | missing | 91 | Surgery |
| 6 | 11 | missing | missing | missing | 190 | Neurology |
| 7 | 12 | missing | missing | missing | 101 | Psychiatry |
Note how the resulting data frame contains all IDs present in the right input DataFrame only.