Statistical Programming with R

Packages and functions used

library(lubridate)
library(tidyr)
library(stringi)
library(stringr)
library(stringdist)
library(editrules)
library(deducorrect)
library(rex)

Resources:

By Edwin de Jonge & Mark van der Loo:

Statistical Data Cleaning with Applications in R:

Comprehensive book

An introduction to data cleaning with R:

Shorter working paper, available online

Statistical value chain

HTML5 Icon
  • A way to think about the process of producing statistics and separating production systems
  • By this logic, you should store the data for each state separately for reuse, and probably also use separate R scripts for each step
  • For reproducibility!

Concepts

Technically correct data

  • Well-defined format (data structure)
  • Well-defined types (numbers, dates, string, categorical, … )
  • Statistical units can be identified (persons, transactions, phone calls, …)
  • Variables can be identified as properties of statistical units.

Consistent data

  • Data satisfies demands from domain knowledge
  • Ready for analysis

The process

  • Get data into R and make it technically correct
  • Find inconsistencies by stating the rules the data should follow and determining where data does not follow these
  • Replace inconsistencies
    • Sometimes by deduction
    • Sometimes by guessing (imputation)

There is a handheld way to look and correct inconsistencies, and there are packages to make the process easier and more reproducible

Some basics

Variable types in R:

  • numeric
  • integer
  • logical - TRUE or FALSE
  • character - character data Also:
  • factor - categorical data
  • ordered - ordinal data
  • date
  • datetime

Special values

  • NA: Not available - placeholder for a missing value. When computed with, returns NA
  • NULL: The null object. Cannot be done maths on.
  • Inf: Infinity - can be done maths on!
  • NaN: Not a Number. Generally result of calculation (eg Inf - Inf). Is numeric, further computation always returns NaN

Raw data to technically correct data

Raw data to technically correct data

  • Get data into R
  • Clean up data, so each variable can be made the correct type
    • Particularly dates
  • Clean up messy text variables
    • String normalization and approximate text matching

Tabular data: long story short

  • read.table: R’s swiss army knife
    • fairly strict (no sniffing)
    • Very flexible
    • Interface could be cleaner (see this talk)
  • readr::read_csv
    • Easy to switch between strict/lenient parsing
    • Compact control over column types
    • Fast
    • Clear reports of parsing failure

Data from Excel

  • read_excel from the readxl package
  • read.xlsx from the xlsx package

For even dirtier data

  • (base) R has great text processing tools.
  • Need to work with regular expressions
  • Write many small functions extracting single data elements.
  • Don’t overgeneralize: adapt functions as you meet new input.

Regular expressions

Express a pattern of text, e.g.

\[ \texttt{"(a|b)c*"} = \{\texttt{"a"},\texttt{"ac"},\texttt{"acc"},\ldots,\texttt{"b"},\texttt{"bc"},\texttt{"bcc"},\ldots\} \]

Task stringr Base R
string detection str_detect(string, pattern) grep, grepl
string extraction str_extract(string, pattern) regexpr, regmatches
string replacement str_extract(string, pattern, replacement) sub, gsub
string splitting str_split(string, pattern) strsplit

String normalization

Bring a text string in a standard format, e.g.

  • Standardize upper/lower case (casefolding)
    • stringr: str_to_lower, str_to_upper, str_to_title
    • base R: tolower, toupper
  • Remove accents (transliteration)
    • stringi: stri_trans_general
    • base R: iconv

Converting between wide and long data

  • pivot_longer and pivot_wider in the tidyr package
billboard[1:3,]
## # A tibble: 3 x 5
##   artist       track                     wk1   wk2   wk3
##   <chr>        <chr>                   <dbl> <dbl> <dbl>
## 1 2 Pac        Baby Don't Cry (Keep...    87    82    72
## 2 2Ge+her      The Hardest Part Of ...    91    87    92
## 3 3 Doors Down Kryptonite                 81    70    68

Converting between wide and long data

  • pivot_longer and pivot_wider in the tidyr package
billboard %>% pivot_longer( cols = starts_with("wk")) %>% 
  print(n=6)
## # A tibble: 951 x 4
##   artist  track                   name  value
##   <chr>   <chr>                   <chr> <dbl>
## 1 2 Pac   Baby Don't Cry (Keep... wk1      87
## 2 2 Pac   Baby Don't Cry (Keep... wk2      82
## 3 2 Pac   Baby Don't Cry (Keep... wk3      72
## 4 2Ge+her The Hardest Part Of ... wk1      91
## 5 2Ge+her The Hardest Part Of ... wk2      87
## 6 2Ge+her The Hardest Part Of ... wk3      92
## # ... with 945 more rows

Further steps in making data technically correct

  • lubridate: extract dates from strings

    lubridate::dmy("17 December 2015") 
    ## [1] "2015-12-17"
  • tidyr: many data cleaning operations to make your life easier

  • readr: Parse numbers from text strings

    readr::parse_number(c("2%","6%","0.3%"))
    ## [1] 2.0 6.0 0.3

From technically correct
to consistent data

The mantra of data cleaning

  • Detection (data conflicts with domain knowledge)
  • Selection (find the value(s) that cause the violation)
  • Correction (replace them with better values)

Detection, AKA data validation

Informally:

Data Validation is checking data against (multivariate) expectations about a data set

For example

  • Ages cannot be negative and should be less than 150 years
  • The number of years married must be smaller than the age minus 17 years
  • A person labeled as “single” should have the number of years married recorded as 0

Validation rules

Often these expectations can be expressed as a set of simple validation rules.

The validate package allows us to define a set of validation rules which can then be applied to data

The validate package, in summary

  • Make data validation rules explicit
  • Treat them as objects of computation
    • store to / read from file
    • manipulate
    • annotate
  • Confront data with rules
  • Analyze/visualize the results

Use rules to correct data

Main idea

Rules restrict the data. Sometimes this is enough to derive a correct value uniquely.

Examples

  • Correct typos in values under linear restrictions
    • \(123 + 45 \not= 177\), but \(123 + \underline{54} = 177\).
  • Derive imputations from values under linear restrictions
    • \(123 + \texttt{NA} = 177\), compute \(177-123=54\).

Both can be generalized to systems \(\mathbf{Ax}\leq\mathbf{b}\).

The deducorrect package allows for deterministic imputation in these cases

Selection, or: error localization

Fellegi and Holt (1976)

Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied.

Note

  • Solutions need not be unique.
  • Random one chosen in case of degeneracy.
  • Lowest weight need not guarantee smallest number of altered variables.

Error localization

The errorlocate package

  • For in-record rules
  • Support for
    • linear (in)equality rules
    • Conditionals on categorical variables (if male then not pregnant)
    • Mixed conditionals (has job then age \(>= 15\))
    • Conditionals w/linear predicates (staff \(> 0\) then staff cost \(> 0\))
  • Optimization is mapped to MIP problem.

Practical