Data validation and editing

Statistical Programming with R

Packages and functions used

library(lubridate)
library(tidyr)
library(stringi)
library(stringr)
library(stringdist)
library(editrules)
library(deducorrect)
library(rex)

Resources:

By Edwin de Jonge & Mark van der Loo:

Statistical Data Cleaning with Applications in R:

Comprehensive book

An introduction to data cleaning with R:

Shorter working paper, available online

Statistical value chain

A way to think about the process of producing statistics and separating production systems
By this logic, you should store the data for each state separately for reuse, and probably also use separate R scripts for each step
For reproducibility!

Concepts

Technically correct data

Well-defined format (data structure)
Well-defined types (numbers, dates, string, categorical, … )
Statistical units can be identified (persons, transactions, phone calls, …)
Variables can be identified as properties of statistical units.

Consistent data

Data satisfies demands from domain knowledge
Ready for analysis

The process

Get data into R and make it technically correct
Find inconsistencies by stating the rules the data should follow and determining where data does not follow these
Replace inconsistencies
- Sometimes by deduction
- Sometimes by guessing (imputation)

There is a handheld way to look and correct inconsistencies, and there are packages to make the process easier and more reproducible

Some basics

Variable types in R:

numeric
integer
logical - TRUE or FALSE
character - character data Also:
factor - categorical data
ordered - ordinal data
date
datetime

Special values

NA: Not available - placeholder for a missing value. When computed with, returns NA
NULL: The null object. Cannot be done maths on.
Inf: Infinity - can be done maths on!
NaN: Not a Number. Generally result of calculation (eg Inf - Inf). Is numeric, further computation always returns NaN

Raw data to technically correct data

Get data into R
Clean up data, so each variable can be made the correct type
- Particularly dates
Clean up messy text variables
- String normalization and approximate text matching

Tabular data: long story short

read.table: R’s swiss army knife
- fairly strict (no sniffing)
- Very flexible
- Interface could be cleaner (see this talk)
readr::read_csv
- Easy to switch between strict/lenient parsing
- Compact control over column types
- Fast
- Clear reports of parsing failure

Data from Excel

read_excel from the readxl package
read.xlsx from the xlsx package

For even dirtier data

(base) R has great text processing tools.
Need to work with regular expressions
Write many small functions extracting single data elements.
Don’t overgeneralize: adapt functions as you meet new input.

Regular expressions

Express a pattern of text, e.g.

\[ \texttt{"(a|b)c*"} = \{\texttt{"a"},\texttt{"ac"},\texttt{"acc"},\ldots,\texttt{"b"},\texttt{"bc"},\texttt{"bcc"},\ldots\} \]

Task	stringr	Base R
string detection	str_detect(string, pattern)	grep, grepl
string extraction	str_extract(string, pattern)	regexpr, regmatches
string replacement	str_extract(string, pattern, replacement)	sub, gsub
string splitting	str_split(string, pattern)	strsplit

String normalization

Bring a text string in a standard format, e.g.

Standardize upper/lower case (casefolding)
- stringr: str_to_lower, str_to_upper, str_to_title
- base R: tolower, toupper
Remove accents (transliteration)
- stringi: stri_trans_general
- base R: iconv

Converting between wide and long data

pivot_longer and pivot_wider in the tidyr package

billboard[1:3,]

## # A tibble: 3 x 5
##   artist       track                     wk1   wk2   wk3
##   <chr>        <chr>                   <dbl> <dbl> <dbl>
## 1 2 Pac        Baby Don't Cry (Keep...    87    82    72
## 2 2Ge+her      The Hardest Part Of ...    91    87    92
## 3 3 Doors Down Kryptonite                 81    70    68

Converting between wide and long data

pivot_longer and pivot_wider in the tidyr package

billboard %>% pivot_longer( cols = starts_with("wk")) %>% 
  print(n=6)

## # A tibble: 951 x 4
##   artist  track                   name  value
##   <chr>   <chr>                   <chr> <dbl>
## 1 2 Pac   Baby Don't Cry (Keep... wk1      87
## 2 2 Pac   Baby Don't Cry (Keep... wk2      82
## 3 2 Pac   Baby Don't Cry (Keep... wk3      72
## 4 2Ge+her The Hardest Part Of ... wk1      91
## 5 2Ge+her The Hardest Part Of ... wk2      87
## 6 2Ge+her The Hardest Part Of ... wk3      92
## # ... with 945 more rows

Further steps in making data technically correct

lubridate: extract dates from strings

lubridate::dmy("17 December 2015")

## [1] "2015-12-17"

tidyr: many data cleaning operations to make your life easier

readr: Parse numbers from text strings

readr::parse_number(c("2%","6%","0.3%"))

## [1] 2.0 6.0 0.3

From technically correct
to consistent data

The mantra of data cleaning

Detection (data conflicts with domain knowledge)
Selection (find the value(s) that cause the violation)
Correction (replace them with better values)

Detection, AKA data validation

Informally:

Data Validation is checking data against (multivariate) expectations about a data set

For example

Ages cannot be negative and should be less than 150 years
The number of years married must be smaller than the age minus 17 years
A person labeled as “single” should have the number of years married recorded as 0

Validation rules

Often these expectations can be expressed as a set of simple validation rules.

The validate package allows us to define a set of validation rules which can then be applied to data

The `validate` package, in summary

Make data validation rules explicit
Treat them as objects of computation
- store to / read from file
- manipulate
- annotate
Confront data with rules
Analyze/visualize the results

Use rules to correct data

Main idea

Rules restrict the data. Sometimes this is enough to derive a correct value uniquely.

Examples

Correct typos in values under linear restrictions
- \(123 + 45 \not= 177\), but \(123 + \underline{54} = 177\).
Derive imputations from values under linear restrictions
- \(123 + \texttt{NA} = 177\), compute \(177-123=54\).

Both can be generalized to systems \(\mathbf{Ax}\leq\mathbf{b}\).

The deducorrect package allows for deterministic imputation in these cases

Selection, or: error localization

Fellegi and Holt (1976)

Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied.

Note

Solutions need not be unique.
Random one chosen in case of degeneracy.
Lowest weight need not guarantee smallest number of altered variables.

Error localization

The errorlocate package

For in-record rules
Support for
- linear (in)equality rules
- Conditionals on categorical variables (if male then not pregnant)
- Mixed conditionals (has job then age \(>= 15\))
- Conditionals w/linear predicates (staff \(> 0\) then staff cost \(> 0\))
Optimization is mapped to MIP problem.

Packages and functions used

Resources:

Statistical Data Cleaning with Applications in R:

An introduction to data cleaning with R:

Statistical value chain

Concepts

Technically correct data

Consistent data

The process

Some basics

Variable types in R:

Special values

Raw data to technically correct data

Raw data to technically correct data

Tabular data: long story short

Data from Excel

For even dirtier data

Regular expressions

String normalization

Converting between wide and long data

Converting between wide and long data

Further steps in making data technically correct

From technically correct to consistent data

The mantra of data cleaning

Detection, AKA data validation

Informally:

For example

Validation rules

The validate package, in summary

Use rules to correct data

Main idea

Examples

Selection, or: error localization

Fellegi and Holt (1976)

Note

Error localization

Practical

From technically correct
to consistent data

The `validate` package, in summary