library(lubridate) library(tidyr) library(stringi) library(stringr) library(stringdist) library(editrules) library(deducorrect) library(rex)
Statistical Programming with R
library(lubridate) library(tidyr) library(stringi) library(stringr) library(stringdist) library(editrules) library(deducorrect) library(rex)
By Edwin de Jonge & Mark van der Loo:
Comprehensive book
Shorter working paper, available online
R and make it technically correctThere is a handheld way to look and correct inconsistencies, and there are packages to make the process easier and more reproducible
numericintegerlogical - TRUE or FALSEcharacter - character data Also:factor - categorical dataordered - ordinal datadatedatetimeNA: Not available - placeholder for a missing value. When computed with, returns NANULL: The null object. Cannot be done maths on.Inf: Infinity - can be done maths on!NaN: Not a Number. Generally result of calculation (eg Inf - Inf). Is numeric, further computation always returns NaNRread.table: R’s swiss army knife
readr::read_csv
read_excel from the readxl packageread.xlsx from the xlsx packageExpress a pattern of text, e.g.
\[ \texttt{"(a|b)c*"} = \{\texttt{"a"},\texttt{"ac"},\texttt{"acc"},\ldots,\texttt{"b"},\texttt{"bc"},\texttt{"bcc"},\ldots\} \]
| Task | stringr | Base R |
|---|---|---|
| string detection | str_detect(string, pattern) | grep, grepl |
| string extraction | str_extract(string, pattern) | regexpr, regmatches |
| string replacement | str_extract(string, pattern, replacement) | sub, gsub |
| string splitting | str_split(string, pattern) | strsplit |
Bring a text string in a standard format, e.g.
pivot_longer and pivot_wider in the tidyr packagebillboard[1:3,]
## # A tibble: 3 x 5 ## artist track wk1 wk2 wk3 ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... 87 82 72 ## 2 2Ge+her The Hardest Part Of ... 91 87 92 ## 3 3 Doors Down Kryptonite 81 70 68
pivot_longer and pivot_wider in the tidyr packagebillboard %>% pivot_longer( cols = starts_with("wk")) %>%
print(n=6)
## # A tibble: 951 x 4 ## artist track name value ## <chr> <chr> <chr> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... wk1 87 ## 2 2 Pac Baby Don't Cry (Keep... wk2 82 ## 3 2 Pac Baby Don't Cry (Keep... wk3 72 ## 4 2Ge+her The Hardest Part Of ... wk1 91 ## 5 2Ge+her The Hardest Part Of ... wk2 87 ## 6 2Ge+her The Hardest Part Of ... wk3 92 ## # ... with 945 more rows
lubridate: extract dates from strings
lubridate::dmy("17 December 2015")
## [1] "2015-12-17"
tidyr: many data cleaning operations to make your life easier
readr: Parse numbers from text strings
readr::parse_number(c("2%","6%","0.3%"))
## [1] 2.0 6.0 0.3
Data Validation is checking data against (multivariate) expectations about a data set
Often these expectations can be expressed as a set of simple validation rules.
The validate package allows us to define a set of validation rules which can then be applied to data
validate package, in summaryRules restrict the data. Sometimes this is enough to derive a correct value uniquely.
Both can be generalized to systems \(\mathbf{Ax}\leq\mathbf{b}\).
The deducorrect package allows for deterministic imputation in these cases
Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied.
The errorlocate package