W#04: Relational Data, Math: Sets and Functions, Programming Functions
Jan Lorenz
Preliminaries
In this lectures we will use these packages and datasets. You need to do this code in the Console to download data and play with some of the code in this lecture.
We modify strings with the stringr package from the tidyverse core. All functions from stringr start with str_.
Very few examples:
c("x","y")
[1] "x" "y"
str_c("x","y")
[1] "xy"
str_c("x","y","z", sep=",")
[1] "x,y,z"
length(c("x","y","z"))
[1] 3
str_length(c("x","y","z"))
[1] 1 1 1
str_length(c("This is a string.","z"))
[1] 17 1
String wrangling with variable names
data <-tibble(Name =c("A","B","C"), Age_2020 =c(20,30,40), Age_2021 =c(21,31,41), Age_2022 =c(22,32,42))data
# A tibble: 3 × 4
Name Age_2020 Age_2021 Age_2022
<chr> <dbl> <dbl> <dbl>
1 A 20 21 22
2 B 30 31 32
3 C 40 41 42
We tidy that data set by creating a year variable.
data |>pivot_longer(c("Age_2020", "Age_2021", "Age_2022"), names_to ="Year", values_to="Age")
# A tibble: 9 × 3
Name Year Age
<chr> <chr> <dbl>
1 A Age_2020 20
2 A Age_2021 21
3 A Age_2022 22
4 B Age_2020 30
5 B Age_2021 31
6 B Age_2022 32
7 C Age_2020 40
8 C Age_2021 41
9 C Age_2022 42
OK, but the year variable is a string but we want numbers.
Use word
word extracts words from a sentence. However, the separator need not be " " but can be any character.
word("This is a string.", start=2, end=-2)
[1] "is a"
#Selects from the second to the second last word.word("Age_2022", start=2, sep ="_")
# A tibble: 9 × 3
Name Year Age
<chr> <dbl> <dbl>
1 A 2020 20
2 A 2021 21
3 A 2022 22
4 B 2020 30
5 B 2021 31
6 B 2022 32
7 C 2020 40
8 C 2021 41
9 C 2022 42
String Detection / Regular Expressions
Many stringr function follow the structure (string, pattern). string is in our data, pattern is a string interpreted as a Regular Expressions
Regular expressions (short regexp or regex): Character-Sequences specifying match patterns.
Useful when strings contain unstructured or semi-structured data.
When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will start to make sense. (Or you ask an AI chatbot to give you what you need 🙂)
You should know the differences of special values.
NA: Not available (string, number, or whatever vector entry)
NULL: Null object, the undefined object, e.g. an empty list or an empty list element
NaN: For numbers: Meaning “Not a Number” (when math cannot solve what you want)
Inf: For numbers: Positive infinity
-Inf: For numbers: Negative infinity
Check some math:
1/0
[1] Inf
-1/0
[1] -Inf
0/0
[1] NaN
1/0+1/0
[1] Inf
1/0-1/0
[1] NaN
NAs
Instead of NaN, NA stands for genuinely unknown values.
It can also be in a character of logical vector.
x =c(1, 2, 3, 4, NA)mean(x)
[1] NA
mean(x, na.rm =TRUE)
[1] 2.5
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 1.75 2.50 2.50 3.25 4.00 1
The type of NA is logical.
typeof(NA)
[1] "logical"
typeof(NaN)
[1] "double"
Does it make sense? Let us look at NA’s in logical operations.
NAs in logical operations
NA can be TRUE or FALSE.
Usually operations including NA results again in NA, but some not!
NA&TRUE
[1] NA
NA|TRUE
[1] TRUE
NA&FALSE
[1] FALSE
NA|FALSE
[1] NA
Understanding logical operations is important!
NULL is the null object
used to represent lists with zero length
x <-1:10attributes(x)
NULL
used as a placeholder for missing values in lists and dataframes
L <-list(a =1)L[[3]] <-5L
$a
[1] 1
[[2]]
NULL
[[3]]
[1] 5
Relational Data - Joining
Working with more dataframes
Data can be distributed in several dataframes which have relations which each other.
For example, they share variables as the five dataframes in nycflights13.
Data: Women in science
10 women in science who changed the world: Ada Lovelace, Marie Curie, Janaki Ammal, Chien-Shiung Wu, Katherine Johnson, Rosalind Franklin, Vera Rubin, Gladys West, Flossie Wong-Staal, Jennifer Doudna
Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name, profession
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
professions
# A tibble: 10 × 2
name profession
<chr> <chr>
1 Ada Lovelace Mathematician
2 Marie Curie Physicist and Chemist
3 Janaki Ammal Botanist
4 Chien-Shiung Wu Physicist
5 Katherine Johnson Mathematician
6 Rosalind Franklin Chemist
7 Vera Rubin Astronomer
8 Gladys West Mathematician
9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna Biochemist
dates <-read_csv("data/dates.csv")
Rows: 8 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (2): birth_year, death_year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dates
# A tibble: 8 × 3
name birth_year death_year
<chr> <dbl> <dbl>
1 Janaki Ammal 1897 1984
2 Chien-Shiung Wu 1912 1997
3 Katherine Johnson 1918 2020
4 Rosalind Franklin 1920 1958
5 Vera Rubin 1928 2016
6 Gladys West 1930 NA
7 Flossie Wong-Staal 1947 NA
8 Jennifer Doudna 1964 NA
works <-read_csv("data/works.csv")
Rows: 9 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name, known_for
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
works
# A tibble: 9 × 2
name known_for
<chr> <chr>
1 Ada Lovelace first computer algorithm
2 Marie Curie theory of radioactivity, discovery of elements polonium a…
3 Janaki Ammal hybrid species, biodiversity protection
4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson calculations of orbital mechanics critical to sending the …
6 Vera Rubin existence of dark matter
7 Gladys West mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking…
We want this dataframe
professions |>left_join(dates) |>left_join(works)
Joining with `by = join_by(name)`
Joining with `by = join_by(name)`
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
Joining dataframes
something_join(x, y)1 for dataframes x and y which have a relation
left_join(): all rows from x
right_join(): all rows from y
full_join(): all rows from both x and y
inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
# A tibble: 10 × 3
name profession known_for
<chr> <chr> <chr>
1 Ada Lovelace Mathematician first computer algorit…
2 Marie Curie Physicist and Chemist theory of radioactivit…
3 Janaki Ammal Botanist hybrid species, biodiv…
4 Chien-Shiung Wu Physicist confim and refine theo…
5 Katherine Johnson Mathematician calculations of orbita…
6 Rosalind Franklin Chemist <NA>
7 Vera Rubin Astronomer existence of dark matt…
8 Gladys West Mathematician mathematical modeling …
9 Flossie Wong-Staal Virologist and Molecular Biologist first scientist to clo…
10 Jennifer Doudna Biochemist one of the primary dev…
professions |>right_join(works)
Joining with `by = join_by(name)`
# A tibble: 9 × 3
name profession known_for
<chr> <chr> <chr>
1 Ada Lovelace Mathematician first computer algorithm
2 Marie Curie Physicist and Chemist theory of radioactivity…
3 Janaki Ammal Botanist hybrid species, biodive…
4 Chien-Shiung Wu Physicist confim and refine theor…
5 Katherine Johnson Mathematician calculations of orbital…
6 Vera Rubin Astronomer existence of dark matter
7 Gladys West Mathematician mathematical modeling o…
8 Flossie Wong-Staal Virologist and Molecular Biologist first scientist to clon…
9 Jennifer Doudna Biochemist one of the primary deve…
dates |>full_join(works)
Joining with `by = join_by(name)`
# A tibble: 10 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics c…
4 Rosalind Franklin 1920 1958 <NA>
5 Vera Rubin 1928 2016 existence of dark matter
6 Gladys West 1930 NA mathematical modeling of the shape …
7 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cr…
8 Jennifer Doudna 1964 NA one of the primary developers of CR…
9 Ada Lovelace NA NA first computer algorithm
10 Marie Curie NA NA theory of radioactivity, discovery…
dates |>inner_join(works)
Joining with `by = join_by(name)`
# A tibble: 7 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr…
4 Vera Rubin 1928 2016 existence of dark matter
5 Gladys West 1930 NA mathematical modeling of the shape o…
6 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cre…
7 Jennifer Doudna 1964 NA one of the primary developers of CRI…
professions |>left_join(dates) |>left_join(works)
Joining with `by = join_by(name)`
Joining with `by = join_by(name)`
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
Keys
A key is a variable or a set of variables which uniquely identifies observations
What was the key in the dataframe of women in science? name
Switching back to nycflights13 as example
In simple cases, a single variable is sufficient to identify an observation, e.g. each plane in planes is identified by tailnum.
Sometimes, multiple variables are needed; e.g. to identify an observation in weather you need five variables: year, month, day, hour, and origin
How can we check?
Counting observation and filter those more than one
# A tibble: 3 × 6
year month day hour origin n
<int> <int> <int> <int> <chr> <int>
1 2013 11 3 1 EWR 2
2 2013 11 3 1 JFK 2
3 2013 11 3 1 LGA 2
OK, here 3 observations are twice, one for each airport. Probably this is related to Daylight saving time in the US. In Fall clocks are turned back 1 hour. So this hour appears twice.
Why do we have only 22 and 23 on day 1 of month 1? No idea. Dig deeper yourself!
Terminology: Primary and foreign keys
A primary key uniquely identifies an observation in its own table. E.g, planes$tailnum in planes.
A foreign key uniquely identifies an observation in another dataframe E.g. flights$tailnum is a foreign key in flights because it matches each flight to a unique plane in planes.
A primary key and a foreign key form a relation.
Relations are typically 1-to-many. Each plane has many flights
Relations can also be many-to-many. Airlines can fly to many airports; airport can host many airplanes.
Joining when key names differ?
We have to specify the key relation with a named vector in the by argument.
dim(flights)
[1] 336776 19
flights |>left_join(airports, by =c("dest"="faa"))
In a join where keys have different column names the name of the first dataframe survives (unless you use keep = TRUE). Here, faa instead of dest
The columns from the first dataframe come first
The order of rows is taken from the first dataframe, while duplication and dropping of variables is determined by the second dataframe (because it is a right_join)
Using the fact that flights seem to be ordered by year, month, day, dep_time we can re-arrange:
airports_right_flights |>rename(dest = faa) |>select(names(flights)) |># Use order of flightsarrange(year, month, day, dep_time)
Note of caution: A deeper analysis shows that the order is still not exactly the same.
left_join with reversed dataframes
dim(airports)
[1] 1458 8
dim(flights)
[1] 336776 19
airports |>left_join(flights, by =c("faa"="dest"))
# A tibble: 330,531 × 26
faa name lat lon alt tz dst tzone year month day dep_time
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int> <int> <int> <int>
1 04G Lansdo… 41.1 -80.6 1044 -5 A Amer… NA NA NA NA
2 06A Moton … 32.5 -85.7 264 -6 A Amer… NA NA NA NA
3 06C Schaum… 42.0 -88.1 801 -6 A Amer… NA NA NA NA
4 06N Randal… 41.4 -74.4 523 -5 A Amer… NA NA NA NA
5 09J Jekyll… 31.1 -81.4 11 -5 A Amer… NA NA NA NA
6 0A9 Elizab… 36.4 -82.2 1593 -5 A Amer… NA NA NA NA
7 0G6 Willia… 41.5 -84.5 730 -5 A Amer… NA NA NA NA
8 0G7 Finger… 42.9 -76.8 492 -5 A Amer… NA NA NA NA
9 0P2 Shoest… 39.8 -76.6 1000 -5 U Amer… NA NA NA NA
10 0S9 Jeffer… 48.1 -123. 108 -8 A Amer… NA NA NA NA
# ℹ 330,521 more rows
# ℹ 14 more variables: sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Why does the number of rows changes after joining?
dest is not a primary key in flights. There are more flights with the same destination so rows of airports get duplicated.
Why is the number of rows then less than the number of rows in flights?
336776 flights and 330531 airports left joined by flights.
Let us do some checks:
length(unique(airports$faa)) # Unique turns out to be redundant because faa is a primary key
[1] 1458
length(unique(flights$dest))
[1] 105
There are much more airports then destinations in flights!
… but the rows of airports prevail when it is the first in a left_join.
So, the dataframe should even increase because we get several rows of airports without flights.
Let us dig deeper.
7,602 flights go to destinations not listed as airport
nrow(airports |>left_join(flights, by =c("faa"="dest"))) ==nrow(flights) +1357-7602
[1] TRUE
OK, now we have a clear picture:
airport with left_joined flights duplicates the rows of airports for each flight flying to it. So the total number of rows is
the number of flights
plus the number of airport which do not appear as a destination
minus the flights which go to destinations which are not listed in airports
Learning: The new number of observation after a join can be a complex combination of duplication and dropping.
It is your responsibility to understand what is happening!
Math: Sets and vectors
Definition: Sets and vectors
A set is mathematical model for the unordered collection of different things (elements).
Examples
\(\{3, \text{Hi}, 😀, 🖖 \}\)
\(\{1,3,5\}\)
The natural numbers \(\mathbb{N} = \{1, 2, 3, \dots\}\) (infinite!)
\(\{\mathtt{"EWR"}, \mathtt{"LGA"}, \mathtt{"JFK"}\}\)
these are origin airports in flights
Math: Sets and vectors
A vector is an ordered collection of things (components) of the same type.
In a set, each thing can only be once and the order does not matter!
\(\{1,3,5\} = \{3,5,1\} = \{1,1,1,3,5,5\}\)
For vectors:
\([1\ 3\ 5] \neq [3\ 5\ 1]\) because we compare component-wise, so we cannot even compare with those with the vector \([1\ 1\ 1\ 3\ 5\ 5]\)
a =c(1, 0.5, 2, -0.5, -2)ggplot() +geom_function(fun =function(x) x^3- x^2, size =2, alpha =0.5) +geom_function(fun =function(x) x^3- x^2+a[2], color ="blue4", size =2) +geom_function(fun =function(x) x^3- x^2+a[3], color ="blue", size =2) +geom_function(fun =function(x) x^3- x^2+a[4], color ="red4") +geom_function(fun =function(x) x^3- x^2+a[5], color ="red") +coord_fixed() +xlim(c(-3,3)) +ylim(c(-3,3)) +xlab("x") +theme_minimal(base_size =24)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 53 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 53 rows containing missing values or values outside the scale range
(`geom_function()`).
Subtract a constant from all \(x\) within the function definition.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = (x - a)^3 - (x - a)^2\)
For \(a =\)-2, -0.5, 0.5, 2
Attention:
Shifting \(a\) units to the right needs subtracting \(a\)!
You can think of the coordinate system being shifted in direction \(a\) while the graph stays.
a =c(1, 0.5, 2, -0.5, -2)ggplot() +geom_function(fun =function(x) x^3- x^2, size =2, alpha =0.5) +geom_function(fun =function(x) (x-a[2])^3- (x-a[2])^2, color ="blue4", size =2) +geom_function(fun =function(x) (x-a[3])^3- (x-a[3])^2, color ="blue", size =2) +geom_function(fun =function(x) (x-a[4])^3- (x-a[4])^2, color ="red4") +geom_function(fun =function(x) (x-a[5])^3- (x-a[5])^2, color ="red") +coord_fixed() +xlim(c(-3,3)) +ylim(c(-3,3)) +xlab("x") +theme_minimal(base_size =24)
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 64 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 53 rows containing missing values or values outside the scale range
(`geom_function()`).
Multiply a constant to all \(x\) within the function definition.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = a(x^3 - x^2)\)
For \(a =\)-2, -0.5, 0.5, 2
Negative numbers flip the graph around the \(x\)-axis.
a =c(1, 0.5, 2, -0.5, -2)ggplot() +geom_function(fun =function(x) x^3- x^2, size =2, alpha =0.5) +geom_function(fun =function(x) a[2]*((x)^3- (x)^2), color ="blue4", size =2) +geom_function(fun =function(x) a[3]*((x)^3- (x)^2), color ="blue", size =2) +geom_function(fun =function(x) a[4]*((x)^3- (x)^2), color ="red4") +geom_function(fun =function(x) a[5]*((x)^3- (x)^2), color ="red") +coord_fixed() +xlim(c(-3,3)) +ylim(c(-3,3)) +xlab("x") +theme_minimal(base_size =24)
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 39 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 60 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 39 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 60 rows containing missing values or values outside the scale range
(`geom_function()`).
Divide all \(x\) within the function definition by a constant.
\(f(x) = x^3 - x^2 \leadsto\)
\(\quad f(x) = (x/a)^3 - (x/a)^2\)
For \(a =\)-2, -0.5, 0.5, 2
Negative numbers flip the graph around the \(y\)-axis.
Attention: Stretching needs a division by \(a\)!
You can think of the coordinate system being stretched multiplicatively by \(a\) while the graph stays.
a =c(1, 0.5, 2, -0.5, -2)ggplot() +geom_function(fun =function(x) x^3- x^2, size =2, alpha =0.5) +geom_function(fun =function(x) (x/a[2])^3- (x/a[2])^2, color ="blue4", size =2) +geom_function(fun =function(x) (x/a[3])^3- (x/a[3])^2, color ="blue", size =2) +geom_function(fun =function(x) (x/a[4])^3- (x/a[4])^2, color ="red4") +geom_function(fun =function(x) (x/a[5])^3- (x/a[5])^2, color ="red") +coord_fixed() +xlim(c(-3,3)) +ylim(c(-3,3)) +xlab("x") +theme_minimal(base_size =24)
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 76 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 76 rows containing missing values or values outside the scale range
(`geom_function()`).
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_function()`).
Math: Polynomials and exponentials
A polynomial is a function which is composed of (many) addends of the form \(ax^n\) for different values of \(a\) and \(n\).
In an exponential the \(x\) appears in the exponent.
Instead of a list or a vector reduce returns a single value.
To that end it needs a function with two arguments. It applies it to the first two elements of the vector, then to the result and the third element, then the result and the fourth element, and so on.
1:10|>reduce(\(x,y) x + y)
[1] 55
Note: \(x) is a short way to write an anonymous function as function(x).
Example: Reading multiple files
Instead of
a <-read_csv("a.csv")b <-read_csv("b.csv")c <-read_csv("c.csv")d <-read_csv("d.csv")e <-read_csv("e.csv")f <-read_csv("f.csv")g <-read_csv("g.csv")bind_rows(a,b,c,d,e,f,g)
map applies a function to each element of a vector (or list) and returns a list (or vector if map_dbl or other vaeriants are used)
The output has the same length as the input
reduce applies a function taking two arguments and then summarizes the list by applying them two element 1 and 2 and then again the result with element 3 and then the result with element 4, …
The output is one object
Remark: The problem of vectorizing a function which has an if and else statement (see the example piecewise_vectorized) can also be solved using the vectorized version of it: if_else(condition = *condition*, true = *value-if_TRUE*, false = *value-if_FALSE*)
Here every argument can be a vector!
Function programming: Take away
Functions are the most important building blocks of programming.
Functions can and often should be vectorized.
Vectorized functions are the basis for plotting and transformation.
map functions are powerful tools for iterative tasks! Expect to not get the idea first but to love them later.