1 - a bit on factors

Define the vector

v <- factor(c(1,8.2,3,11))

What happens if you apply the function as.numeric to v? How can you make v into a numeric vector?


2 - reading in data with base R and readr

The file unnamed.csv contains comma-separated values of ages and heights for a set of imaginary persons. Read in the data. Make sure to name both variables, and make both numeric. The obvious candidate function is read.table, but the read_csv function in the readr package makes it a lot easier


3a - data cleaning

Exercise 3 is about removing inconsistencies from data. The exercise comes in two versions: One using base procedures by “hand”, and one using the editrules package. This first version is less abstract and does not require looking into a new framework. Feel free to skip it and do 3b instead, if it is too simple - or feel free to skip 3b if it’s too involved

We will work with the following data - you can find it on the course homepage or paste it into a text file from here:

age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3
people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)

Correct the data, replacing impossible values with correct ones or with NA where the correct value cannot be deduced:


3b - a walkthrough for the editrules package

This exercise is adapted from an example in De Jonge and van der Loo and walks you through applying functions in the editrules package. The editrules package allows allows us to define rules which each record of a data set must obey. Editrules can check which rules fail for each record, and find the minimal set of variables to adapt so that all rules can be folowed.

We will work with the following data - you can find it on the course homepage or paste it into a text file from here:

age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3
people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)

For a cleaner workflow, the rules that these data must obey are treated as a separate object, that can then be stored and reapplied (instead, say, of treating these rules and checks as lines of code in a syntax file). We define an edit set and confront the data with these rules:

library(editrules)
E <- editset( c( "age >=0" , "age <= 150" ) )
E
## 
## Edit set:
## num1 : 0 <= age
## num2 : age <= 150
violatedEdits( E, people)
##       edit
## record  num1  num2
##      1 FALSE FALSE
##      2 FALSE FALSE
##      3 FALSE FALSE
##      4 FALSE  TRUE
##      5 FALSE FALSE

A realistic set of edit rules will be rather larger. It may be defined in a text file and read in with editset. Try for the rule set in the peoplerules.txt file. Also, with a larger rule set, the summary and plot functions for the violatedEdits object become useful.

E <- editfile("data/peoplerules.txt")
ve <- violatedEdits(E, people)
summary(ve)
## Edit violations, 5 observations, 0 completely missing (0%):
## 
##  editname freq rel
##      cat5    2 40%
##      mix6    2 40%
##      num2    1 20%
##      num3    1 20%
##      num4    1 20%
##      mix8    1 20%
## 
## Edit violations per record:
## 
##  errors freq rel
##       0    1 20%
##       1    1 20%
##       2    2 40%
##       3    1 20%
plot(ve)

Obviously, fixing errors requires subject matter expertise, but the editrules package does have facilities for finding potential fixes. The logic goes as follows: Fixing a variable so that an error is eliminated may cause new errors. Also, it makes sense to chose the edits necessary to make a record consistent in a way so as to minimise the total number of edits. This is a minimisation problem and can be automated. Further add-ons for the methods allow finer control by adding confidence weights. A branch-and-bound approach instead of the MIP solver used here is more slow but also allows for more control.

le <- localizeErrors(E, people, method = "mip")
le$adapt
##     age agegroup height status yearsmarried
## 1 FALSE    FALSE  FALSE  FALSE        FALSE
## 2 FALSE    FALSE  FALSE   TRUE        FALSE
## 3  TRUE    FALSE  FALSE  FALSE        FALSE
## 4  TRUE    FALSE  FALSE  FALSE        FALSE
## 5 FALSE     TRUE   TRUE  FALSE        FALSE

The le object contains metadata and a logical array which indicates the minimal set of edits needed to make data follow the rules in E (note that E is missing at least one rule). The corrections are not part of the editrules package.


4 - more on the editrules package

This exercise, again borrowed from de Jonge and van der Loo, follows on the previous but has less hand-holding.

4.1 Reading in and manual checks

The dirty_iris dataset can be found on the course webpage. Read in the data, and make sure that strings are not converted to factors. Replace special values with NA and compute the proportion of complete observations.

4.2 Validating with rules

Codify the following rules into an editfile object:

  • Species should be one of the following values: setosa, versicolor or virginica.
  • All measurements should be positive.
  • Petal length is at least 2 times petal width.
  • Sepal length cannot exceed 30 cm.
  • The sepals of an iris are longer than its petals.

Find out how often each rule is broken. Summarise and plot the object.

Use violatedEdits to find the observations with too long petals

Use boxplot and boxplot.stats to find observations with outlier values of sepal length. Set outliers to NA

4.3 Correcting data

Use the correctWithRules function from the deducorrect library to replace non-positive values of of Petal.Width with NA.

Use the (output of the) LocalizeErrors function and the editset defined in exercise 5.2 to replace all faulty values with NA