Practical E

1 - a bit on factors

Define the vector

v <- factor(c(1,8.2,3,11))

What happens if you apply the function as.numeric to v? How can you make v into a numeric vector?

as.numeric(v)

## [1] 1 3 2 4

Since v is a factor, the visible values in it are actually labels for the underlying factor values - which are nicely ordered consequtive integers. In order to make v numeric with the right values, use

as.numeric(as.character(v))

## [1]  1.0  8.2  3.0 11.0

2 - reading in data with base R and `readxl`

The file unnamed.xls contains comma-separated values of ages and heights for a set of imaginary persons. Read in the data. Make sure to name both variables, and make both numeric. Hint: The parse_number function in the readr package is nice

unnamed <- readxl::read_excel("data/unnamed.xls", 
                   col_names = c("age", "height"))
unnamed

## # A tibble: 4 x 2
##     age height
##   <dbl> <chr> 
## 1    21 6.0   
## 2    42 5.9   
## 3    18 5.7*  
## 4    21 NA

unnamed$height <- readr::parse_number(unnamed$height)
unnamed

## # A tibble: 4 x 2
##     age height
##   <dbl>  <dbl>
## 1    21    6  
## 2    42    5.9
## 3    18    5.7
## 4    21   NA

3a - data cleaning

Exercise 3 is about removing inconsistencies from data. The exercise comes in two versions: One using base procedures by “hand”, and one using the editrules package. This first version is less abstract and does not require looking into a new framework. Feel free to skip it and do 3b instead, if it is too simple - or feel free to skip 3b if it’s too involved

We will work with the following data - you can find it on the course homepage or paste it into a text file from here:

age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3

people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)

We are going to replace impossible values with NA: * Age should not be negative and should be less than 150 * Height should be greater than zero * Age should be greater than the number of years married * Children cannot be married or widowed * People under 18 of age should be coded as children, those between 18 and 64 should be coded as adults, and those over age 65 should be coded as elderly * Age should be larger than or equal to yearsmarried + 18

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Age should not be negative and should be less than 150
people <- people %>% mutate( age = ifelse( age < 0, NA, age))
people <- people %>% mutate( age = ifelse( age>=150, NA, age))

##Height should be greater than zero
people <- people %>% mutate( height = ifelse( height <=0, NA, height))

So far so good. But the rest of the rules involve some matter of choice: If age and years married are inconsistent, which do we correct, for instance? For a rule, we could say that between age, age group and status, if two of them agree and the third doesn’t, we fix the third.

people <- people %>% mutate( agegroup = ifelse(  agegroup ==  "child" & 
                                                 status %in% c("married", "widowed") &
                                                 age >= yearsmarried + 18,
                                                 ifelse(age<=65, "adult", "elderly"),
                                                 agegroup
                                  )) # Here we can actually deduce the value from "age"

people <- people %>% mutate( status = ifelse( agegroup == "child" & 
                                              status %in%  c("married", "widowed") &
                                              age < 18,
                                              "single",
                                              status
                                              )) # Here we can also deduce the value

people <- people %>% mutate(age = ifelse( age<18 & agegroup != "child" & status != "single",
                                          NA,
                                          age
                                          )) # Here we set to missing

# When age and yearsmarried do not agree otherwise, we shall believe in age
people <- people %>% mutate( yearsmarried = ifelse( age >= yearsmarried + 18 | 
                                                    is.na(age) | is.na(yearsmarried) | 
                                                    status == "single"
                                                    , yearsmarried, NA ))

people

##   age agegroup height  status yearsmarried
## 1  21    adult    6.0  single           -1
## 2   2    child    3.0  single            0
## 3  18    adult    5.7 married           NA
## 4  NA  elderly    5.0 widowed            2
## 5  34    adult     NA married            3

3b - a walkthrough for the `editrules` package

This exercise is adapted from an example in De Jonge and van der Loo and walks you through applying functions in the editrules package to make a dataset ready for imputation. The editrules package allows allows us to define rules which each record of a data set must obey. Editrules can check which rules fail for each record, and find the minimal set of variables to adapt so that all rules can be folowed.

We will work with the following data - you can find it on the course homepage or paste it into a text file from here:

age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3

people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)

For a cleaner workflow, the rules that these data must obey are treated as a separate object, that can then be stored and reapplied (instead, say, of treating these rules and checks as lines of code in a syntax file). We define an edit set and confront the data with these rules:

library(editrules)
E <- editset( c( "age >=0" , "age <= 150" ) )
E

## 
## Edit set:
## num1 : 0 <= age
## num2 : age <= 150

violatedEdits( E, people)

##       edit
## record  num1  num2
##      1 FALSE FALSE
##      2 FALSE FALSE
##      3 FALSE FALSE
##      4 FALSE  TRUE
##      5 FALSE FALSE

A realistic set of edit rules will be rather larger. It may be defined in a text file and read in with editset. Try for the rule set in the peoplerules.txt file. Also, with a larger rule set, the summary and plot functions for the violatedEdits object become useful.

E <- editfile("data/peoplerules.txt")
ve <- violatedEdits(E, people)
summary(ve)

## Edit violations, 5 observations, 0 completely missing (0%):
## 
##  editname freq rel
##      cat5    2 40%
##      mix6    2 40%
##      num2    1 20%
##      num3    1 20%
##      num4    1 20%
##      mix8    1 20%
## 
## Edit violations per record:
## 
##  errors freq rel
##       0    1 20%
##       1    1 20%
##       2    2 40%
##       3    1 20%

plot(ve)

Obviously, fixing errors requires subject matter expertise, but the editrules package does have facilities for finding potential fixes. The logic goes as follows: Fixing a variable so that an error is eliminated may cause new errors. Also, it makes sense to chose the edits necessary to make a record consistent in a way so as to minimise the total number of edits. This is a minimisation problem and can be automated. Further add-ons for the methods allow finer control by adding confidence weights. A branch-and-bound approach instead of the MIP solver used here is more slow but also allows for more control.

le <- localizeErrors(E, people, method = "mip")
le$adapt

##     age agegroup height status yearsmarried
## 1 FALSE    FALSE  FALSE  FALSE        FALSE
## 2 FALSE    FALSE  FALSE   TRUE        FALSE
## 3  TRUE    FALSE  FALSE  FALSE        FALSE
## 4  TRUE    FALSE  FALSE  FALSE        FALSE
## 5 FALSE     TRUE   TRUE  FALSE        FALSE

The le object contains metadata and a logical array which indicates the minimal set of edits needed to make data follow the rules in E (note that E is missing at least one rule). The corrections are not part of the editrules package.

4 - more on the `editrules` package

This exercise, again borrowed from de Jonge and van der Loo, follows on the previous but has less hand-holding. If you want, you can do the exercise manually, without using the editrules package - as in exercise 3a

4.1 Reading in and manual checks

The dirty_iris dataset can be found on the course webpage. Read in the data, and make sure that strings are not converted to factors. Replace special values with NA and compute the proportion of complete observations.

dirty.iris <- read.csv( file = "data/dirty_iris.txt", 
                        stringsAsFactors = F)
summary(dirty.iris)

##   Sepal.Length     Sepal.Width      Petal.Length    Petal.Width 
##  Min.   : 0.000   Min.   :-3.000   Min.   : 0.00   Min.   :0.1  
##  1st Qu.: 5.100   1st Qu.: 2.800   1st Qu.: 1.60   1st Qu.:0.3  
##  Median : 5.750   Median : 3.000   Median : 4.50   Median :1.3  
##  Mean   : 6.559   Mean   : 3.391   Mean   : 4.45   Mean   :Inf  
##  3rd Qu.: 6.400   3rd Qu.: 3.300   3rd Qu.: 5.10   3rd Qu.:1.8  
##  Max.   :73.000   Max.   :30.000   Max.   :63.00   Max.   :Inf  
##  NA's   :10       NA's   :17       NA's   :19      NA's   :12   
##    Species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

## There are Inf values in Petal.Width
dirty.iris$Petal.Width[ !is.finite(dirty.iris$Petal.Width) ] <- NA

## The complete.cases function checks each record for missing values
prop.table( table( complete.cases(dirty.iris) ))

## 
##     FALSE      TRUE 
## 0.3666667 0.6333333

4.2 Validating with rules

Codify the following rules into an editfile object:

Species should be one of the following values: setosa, versicolor or virginica.
All measurements should be positive.
Petal length is at least 2 times petal width.
Sepal length cannot exceed 30 cm.
The sepals of an iris are longer than its petals.

iris.rules <- editset(
  c(
    "Species %in% c('setosa', 'versicolor', 'virginica')",
    "Sepal.Length > 0",
    "Sepal.Width  > 0",
    "Petal.Length > 0",
    "Petal.Width  > 0",
    "Petal.Length >= 2*Petal.Width",
    "Sepal.Length <=30",
    "Sepal.Length >= Petal.Length"
  )
)

Find out how often each rule is broken. Summarise and plot the object.

violated.rules.iris <- violatedEdits( iris.rules, dirty.iris)
summary(violated.rules.iris)

## Edit violations, 150 observations, 0 completely missing (0%):
## 
##  editname freq  rel
##      num5    3   2%
##      num2    2 1.3%
##      num6    2 1.3%
##      num7    2 1.3%
##      num1    1 0.7%
##      num3    1 0.7%
## 
## Edit violations per record:
## 
##  errors freq   rel
##       0   90   60%
##       1   17 11.3%
##       2   13  8.7%
##       3   25 16.7%
##       4    4  2.7%
##       5    1  0.7%

plot(violated.rules.iris)

Use violatedEdits to find the observations with too long petals Which column of the matrix returned by the function corresponds to that rule? We first look at the rules in the rules object

iris.rules

## 
## Data model:
## dat1 : Species %in% c('setosa', 'versicolor', 'virginica') 
## 
## Edit set:
## num1 : 0 < Sepal.Length
## num2 : 0 < Sepal.Width
## num3 : 0 < Petal.Length
## num4 : 0 < Petal.Width
## num5 : 2*Petal.Width <= Petal.Length
## num6 : Sepal.Length <= 30
## num7 : Petal.Length <= Sepal.Length

The rule is “num7”. It’s easier to pick out this rule from violated.rules.iris if we make it into a tibble

library(dplyr)
violated.rules.iris %>% as_tibble() %>%
  filter(num7 == T)

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.

## # A tibble: 2 x 8
##   num1  num2  num3  num4  num5  num6  num7  dat1 
##   <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE
## 2 TRUE  NA    FALSE FALSE FALSE FALSE TRUE  FALSE

## Find the observations corresponding to this:

## First we find the indices of these observations

indices <- violated.rules.iris %>% 
    as_tibble() %>%
    mutate( id = row_number()) %>%           
    filter(num7 == T)

dirty.iris[indices$id,]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 35          6.6         2.9         23.0         1.3 versicolor
## 43          0.0          NA          1.3         0.4     setosa

Use boxplot and boxplot.stats to find observations with outlier values of sepal length. Set outliers to NA

## boxplot does a box plot
boxplot(dirty.iris$Sepal.Length)

## boxplot.stats does statistics for the boxplot
bps <- boxplot.stats(dirty.iris$Sepal.Length)
bps

## $stats
## [1] 4.30 5.10 5.75 6.40 7.90
## 
## $n
## [1] 140
## 
## $conf
## [1] 5.576405 5.923595
## 
## $out
## [1] 73  0 49

dirty.iris <- dirty.iris %>% 
  mutate( Sepal.Length = if_else(Sepal.Length %in% bps$out, NA_real_, Sepal.Length))

4.3 Correcting data

Use the correctWithRules function from the deducorrect library to replace non-positive values of of Petal.Width with NA.

library(deducorrect)

## Warning: package 'deducorrect' was built under R version 3.6.1

# We make a set of correction rules
iris.cr <- correctionRules(expression(
   if ( is.na(Petal.Length) | Petal.Length <= 0 ) Petal.Length <- NA
))

# We use the correction rules on the data
iris.corrections <- correctWithRules( iris.cr , dirty.iris)

iris.corrections$corrections

##   row     variable old new
## 1  79 Petal.Length   0  NA
##                                                               how
## 1 if (is.na(Petal.Length) | Petal.Length <= 0) Petal.Length <- NA

## It's not very good practise, but we'll replace the data with the corrected data
dirty.iris <- iris.corrections$corrected

Use the (output of the) LocalizeErrors function and the editset defined in exercise 5.2 to replace all faulty values with NA

iris.loc.err <- localizeErrors( iris.rules , dirty.iris)
plot(iris.loc.err)

# DEGENERATE solutions are those with more than one possible edit. We look at them
table(iris.loc.err$status$degeneracy)

## 
##   1   2 
## 147   3

dirty.iris[iris.loc.err$status$degeneracy>1,]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 30            NA         2.8        0.820         1.3 versicolor
## 35           6.6         2.9       23.000         1.3 versicolor
## 106          5.5          NA        0.925         1.0 versicolor

iris.loc.err$adapt[iris.loc.err$status$degeneracy>1,]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 30          TRUE       FALSE         TRUE       FALSE   FALSE
## 35         FALSE       FALSE         TRUE       FALSE   FALSE
## 106        FALSE        TRUE        FALSE        TRUE   FALSE

# We set all values deemed faulty to NA
dirty.iris[iris.loc.err$adapt] <- NA

summary(violatedEdits( iris.rules, dirty.iris))

## No violations detected, 150 checks evaluated to NA

## NULL

Practical E

Anne Vinkel, adapted from ‘An Introduction to data cleaning with R’ by Edwin de Jonge and Mark van der Loo

Statistical Programming with R

1 - a bit on factors

2 - reading in data with base R and `readxl`

3a - data cleaning

3b - a walkthrough for the `editrules` package

4 - more on the `editrules` package

4.1 Reading in and manual checks

4.2 Validating with rules

4.3 Correcting data

Practical E

Anne Vinkel, adapted from ‘An Introduction to data cleaning with R’ by Edwin de Jonge and Mark van der Loo

Statistical Programming with R

1 - a bit on factors

2 - reading in data with base R and readxl

3a - data cleaning

3b - a walkthrough for the editrules package

4 - more on the editrules package

4.1 Reading in and manual checks

4.2 Validating with rules

4.3 Correcting data

2 - reading in data with base R and `readxl`

3b - a walkthrough for the `editrules` package

4 - more on the `editrules` package