library(lubridate) library(tidyr) library(stringi) library(stringr) library(stringdist) library(editrules) library(deducorrect) library(rex)
Statistical Programming with R
library(lubridate) library(tidyr) library(stringi) library(stringr) library(stringdist) library(editrules) library(deducorrect) library(rex)
By Edwin de Jonge & Mark van der Loo:
Comprehensive book
Shorter working paper, available online
R
and make it technically correctThere is a handheld way to look and correct inconsistencies, and there are packages to make the process easier and more reproducible
numeric
integer
logical
- TRUE
or FALSE
character
- character data Also:factor
- categorical dataordered
- ordinal datadate
datetime
NA
: Not available - placeholder for a missing value. When computed with, returns NA
NULL
: The null object. Cannot be done maths on.Inf
: Infinity - can be done maths on!NaN
: Not a Number. Generally result of calculation (eg Inf - Inf
). Is numeric, further computation always returns NaN
R
read.table
: R’s swiss army knife
readr::read_csv
read_excel
from the readxl
packageread.xlsx
from the xlsx
packageExpress a pattern of text, e.g.
\[ \texttt{"(a|b)c*"} = \{\texttt{"a"},\texttt{"ac"},\texttt{"acc"},\ldots,\texttt{"b"},\texttt{"bc"},\texttt{"bcc"},\ldots\} \]
Task | stringr | Base R |
---|---|---|
string detection | str_detect(string, pattern) | grep, grepl |
string extraction | str_extract(string, pattern) | regexpr, regmatches |
string replacement | str_extract(string, pattern, replacement) | sub, gsub |
string splitting | str_split(string, pattern) | strsplit |
Bring a text string in a standard format, e.g.
pivot_longer
and pivot_wider
in the tidyr
packagebillboard[1:3,]
## # A tibble: 3 x 5 ## artist track wk1 wk2 wk3 ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... 87 82 72 ## 2 2Ge+her The Hardest Part Of ... 91 87 92 ## 3 3 Doors Down Kryptonite 81 70 68
pivot_longer
and pivot_wider
in the tidyr
packagebillboard %>% pivot_longer( cols = starts_with("wk")) %>% print(n=6)
## # A tibble: 951 x 4 ## artist track name value ## <chr> <chr> <chr> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... wk1 87 ## 2 2 Pac Baby Don't Cry (Keep... wk2 82 ## 3 2 Pac Baby Don't Cry (Keep... wk3 72 ## 4 2Ge+her The Hardest Part Of ... wk1 91 ## 5 2Ge+her The Hardest Part Of ... wk2 87 ## 6 2Ge+her The Hardest Part Of ... wk3 92 ## # ... with 945 more rows
lubridate
: extract dates from strings
lubridate::dmy("17 December 2015")
## [1] "2015-12-17"
tidyr
: many data cleaning operations to make your life easier
readr
: Parse numbers from text strings
readr::parse_number(c("2%","6%","0.3%"))
## [1] 2.0 6.0 0.3
Data Validation is checking data against (multivariate) expectations about a data set
Often these expectations can be expressed as a set of simple validation rules.
The validate
package allows us to define a set of validation rules which can then be applied to data
validate
package, in summaryRules restrict the data. Sometimes this is enough to derive a correct value uniquely.
Both can be generalized to systems \(\mathbf{Ax}\leq\mathbf{b}\).
The deducorrect
package allows for deterministic imputation in these cases
Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied.
The errorlocate
package