Define the vector
v <- factor(c(1,8.2,3,11))
What happens if you apply the function as.numeric to v? How can you make v into a numeric vector?
as.numeric(v)
## [1] 1 3 2 4
Since v is a factor, the visible values in it are actually labels for the underlying factor values - which are nicely ordered consequtive integers. In order to make v numeric with the right values, use
as.numeric(as.character(v))
## [1] 1.0 8.2 3.0 11.0
readxlThe file unnamed.xls contains comma-separated values of ages and heights for a set of imaginary persons. Read in the data. Make sure to name both variables, and make both numeric. Hint: The parse_number function in the readr package is nice
unnamed <- readxl::read_excel("data/unnamed.xls",
col_names = c("age", "height"))
unnamed
## # A tibble: 4 x 2
## age height
## <dbl> <chr>
## 1 21 6.0
## 2 42 5.9
## 3 18 5.7*
## 4 21 NA
unnamed$height <- readr::parse_number(unnamed$height)
unnamed
## # A tibble: 4 x 2
## age height
## <dbl> <dbl>
## 1 21 6
## 2 42 5.9
## 3 18 5.7
## 4 21 NA
Exercise 3 is about removing inconsistencies from data. The exercise comes in two versions: One using base procedures by “hand”, and one using the editrules package. This first version is less abstract and does not require looking into a new framework. Feel free to skip it and do 3b instead, if it is too simple - or feel free to skip 3b if it’s too involved
We will work with the following data - you can find it on the course homepage or paste it into a text file from here:
age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3
people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)
We are going to replace impossible values with NA: * Age should not be negative and should be less than 150 * Height should be greater than zero * Age should be greater than the number of years married * Children cannot be married or widowed * People under 18 of age should be coded as children, those between 18 and 64 should be coded as adults, and those over age 65 should be coded as elderly * Age should be larger than or equal to yearsmarried + 18
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Age should not be negative and should be less than 150
people <- people %>% mutate( age = ifelse( age < 0, NA, age))
people <- people %>% mutate( age = ifelse( age>=150, NA, age))
##Height should be greater than zero
people <- people %>% mutate( height = ifelse( height <=0, NA, height))
So far so good. But the rest of the rules involve some matter of choice: If age and years married are inconsistent, which do we correct, for instance? For a rule, we could say that between age, age group and status, if two of them agree and the third doesn’t, we fix the third.
people <- people %>% mutate( agegroup = ifelse( agegroup == "child" &
status %in% c("married", "widowed") &
age >= yearsmarried + 18,
ifelse(age<=65, "adult", "elderly"),
agegroup
)) # Here we can actually deduce the value from "age"
people <- people %>% mutate( status = ifelse( agegroup == "child" &
status %in% c("married", "widowed") &
age < 18,
"single",
status
)) # Here we can also deduce the value
people <- people %>% mutate(age = ifelse( age<18 & agegroup != "child" & status != "single",
NA,
age
)) # Here we set to missing
# When age and yearsmarried do not agree otherwise, we shall believe in age
people <- people %>% mutate( yearsmarried = ifelse( age >= yearsmarried + 18 |
is.na(age) | is.na(yearsmarried) |
status == "single"
, yearsmarried, NA ))
people
## age agegroup height status yearsmarried
## 1 21 adult 6.0 single -1
## 2 2 child 3.0 single 0
## 3 18 adult 5.7 married NA
## 4 NA elderly 5.0 widowed 2
## 5 34 adult NA married 3
editrules packageThis exercise is adapted from an example in De Jonge and van der Loo and walks you through applying functions in the editrules package to make a dataset ready for imputation. The editrules package allows allows us to define rules which each record of a data set must obey. Editrules can check which rules fail for each record, and find the minimal set of variables to adapt so that all rules can be folowed.
We will work with the following data - you can find it on the course homepage or paste it into a text file from here:
age, agegroup, height, status, yearsmarried
21, adult, 6.0, single, -1
2, child, 3, married, 0
18, adult, 5.7, married, 20
221, elderly, 5, widowed, 2
34, child, -7, married, 3
people <- read.csv( file = "data/people.txt", stringsAsFactors = F, strip.white = T)
For a cleaner workflow, the rules that these data must obey are treated as a separate object, that can then be stored and reapplied (instead, say, of treating these rules and checks as lines of code in a syntax file). We define an edit set and confront the data with these rules:
library(editrules)
E <- editset( c( "age >=0" , "age <= 150" ) )
E
##
## Edit set:
## num1 : 0 <= age
## num2 : age <= 150
violatedEdits( E, people)
## edit
## record num1 num2
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE TRUE
## 5 FALSE FALSE
A realistic set of edit rules will be rather larger. It may be defined in a text file and read in with editset. Try for the rule set in the peoplerules.txt file. Also, with a larger rule set, the summary and plot functions for the violatedEdits object become useful.
E <- editfile("data/peoplerules.txt")
ve <- violatedEdits(E, people)
summary(ve)
## Edit violations, 5 observations, 0 completely missing (0%):
##
## editname freq rel
## cat5 2 40%
## mix6 2 40%
## num2 1 20%
## num3 1 20%
## num4 1 20%
## mix8 1 20%
##
## Edit violations per record:
##
## errors freq rel
## 0 1 20%
## 1 1 20%
## 2 2 40%
## 3 1 20%
plot(ve)
Obviously, fixing errors requires subject matter expertise, but the editrules package does have facilities for finding potential fixes. The logic goes as follows: Fixing a variable so that an error is eliminated may cause new errors. Also, it makes sense to chose the edits necessary to make a record consistent in a way so as to minimise the total number of edits. This is a minimisation problem and can be automated. Further add-ons for the methods allow finer control by adding confidence weights. A branch-and-bound approach instead of the MIP solver used here is more slow but also allows for more control.
le <- localizeErrors(E, people, method = "mip")
le$adapt
## age agegroup height status yearsmarried
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE TRUE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE FALSE FALSE FALSE
## 5 FALSE TRUE TRUE FALSE FALSE
The le object contains metadata and a logical array which indicates the minimal set of edits needed to make data follow the rules in E (note that E is missing at least one rule). The corrections are not part of the editrules package.
editrules packageThis exercise, again borrowed from de Jonge and van der Loo, follows on the previous but has less hand-holding. If you want, you can do the exercise manually, without using the editrules package - as in exercise 3a
The dirty_iris dataset can be found on the course webpage. Read in the data, and make sure that strings are not converted to factors. Replace special values with NA and compute the proportion of complete observations.
dirty.iris <- read.csv( file = "data/dirty_iris.txt",
stringsAsFactors = F)
summary(dirty.iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. : 0.000 Min. :-3.000 Min. : 0.00 Min. :0.1
## 1st Qu.: 5.100 1st Qu.: 2.800 1st Qu.: 1.60 1st Qu.:0.3
## Median : 5.750 Median : 3.000 Median : 4.50 Median :1.3
## Mean : 6.559 Mean : 3.391 Mean : 4.45 Mean :Inf
## 3rd Qu.: 6.400 3rd Qu.: 3.300 3rd Qu.: 5.10 3rd Qu.:1.8
## Max. :73.000 Max. :30.000 Max. :63.00 Max. :Inf
## NA's :10 NA's :17 NA's :19 NA's :12
## Species
## Length:150
## Class :character
## Mode :character
##
##
##
##
## There are Inf values in Petal.Width
dirty.iris$Petal.Width[ !is.finite(dirty.iris$Petal.Width) ] <- NA
## The complete.cases function checks each record for missing values
prop.table( table( complete.cases(dirty.iris) ))
##
## FALSE TRUE
## 0.3666667 0.6333333
Codify the following rules into an editfile object:
iris.rules <- editset(
c(
"Species %in% c('setosa', 'versicolor', 'virginica')",
"Sepal.Length > 0",
"Sepal.Width > 0",
"Petal.Length > 0",
"Petal.Width > 0",
"Petal.Length >= 2*Petal.Width",
"Sepal.Length <=30",
"Sepal.Length >= Petal.Length"
)
)
Find out how often each rule is broken. Summarise and plot the object.
violated.rules.iris <- violatedEdits( iris.rules, dirty.iris)
summary(violated.rules.iris)
## Edit violations, 150 observations, 0 completely missing (0%):
##
## editname freq rel
## num5 3 2%
## num2 2 1.3%
## num6 2 1.3%
## num7 2 1.3%
## num1 1 0.7%
## num3 1 0.7%
##
## Edit violations per record:
##
## errors freq rel
## 0 90 60%
## 1 17 11.3%
## 2 13 8.7%
## 3 25 16.7%
## 4 4 2.7%
## 5 1 0.7%
plot(violated.rules.iris)
Use violatedEdits to find the observations with too long petals Which column of the matrix returned by the function corresponds to that rule? We first look at the rules in the rules object
iris.rules
##
## Data model:
## dat1 : Species %in% c('setosa', 'versicolor', 'virginica')
##
## Edit set:
## num1 : 0 < Sepal.Length
## num2 : 0 < Sepal.Width
## num3 : 0 < Petal.Length
## num4 : 0 < Petal.Width
## num5 : 2*Petal.Width <= Petal.Length
## num6 : Sepal.Length <= 30
## num7 : Petal.Length <= Sepal.Length
The rule is “num7”. It’s easier to pick out this rule from violated.rules.iris if we make it into a tibble
library(dplyr)
violated.rules.iris %>% as_tibble() %>%
filter(num7 == T)
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
## # A tibble: 2 x 8
## num1 num2 num3 num4 num5 num6 num7 dat1
## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## 2 TRUE NA FALSE FALSE FALSE FALSE TRUE FALSE
## Find the observations corresponding to this:
## First we find the indices of these observations
indices <- violated.rules.iris %>%
as_tibble() %>%
mutate( id = row_number()) %>%
filter(num7 == T)
dirty.iris[indices$id,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 35 6.6 2.9 23.0 1.3 versicolor
## 43 0.0 NA 1.3 0.4 setosa
Use boxplot and boxplot.stats to find observations with outlier values of sepal length. Set outliers to NA
## boxplot does a box plot
boxplot(dirty.iris$Sepal.Length)
## boxplot.stats does statistics for the boxplot
bps <- boxplot.stats(dirty.iris$Sepal.Length)
bps
## $stats
## [1] 4.30 5.10 5.75 6.40 7.90
##
## $n
## [1] 140
##
## $conf
## [1] 5.576405 5.923595
##
## $out
## [1] 73 0 49
dirty.iris <- dirty.iris %>%
mutate( Sepal.Length = if_else(Sepal.Length %in% bps$out, NA_real_, Sepal.Length))
Use the correctWithRules function from the deducorrect library to replace non-positive values of of Petal.Width with NA.
library(deducorrect)
## Warning: package 'deducorrect' was built under R version 3.6.1
# We make a set of correction rules
iris.cr <- correctionRules(expression(
if ( is.na(Petal.Length) | Petal.Length <= 0 ) Petal.Length <- NA
))
# We use the correction rules on the data
iris.corrections <- correctWithRules( iris.cr , dirty.iris)
iris.corrections$corrections
## row variable old new
## 1 79 Petal.Length 0 NA
## how
## 1 if (is.na(Petal.Length) | Petal.Length <= 0) Petal.Length <- NA
## It's not very good practise, but we'll replace the data with the corrected data
dirty.iris <- iris.corrections$corrected
Use the (output of the) LocalizeErrors function and the editset defined in exercise 5.2 to replace all faulty values with NA
iris.loc.err <- localizeErrors( iris.rules , dirty.iris)
plot(iris.loc.err)
# DEGENERATE solutions are those with more than one possible edit. We look at them
table(iris.loc.err$status$degeneracy)
##
## 1 2
## 147 3
dirty.iris[iris.loc.err$status$degeneracy>1,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 30 NA 2.8 0.820 1.3 versicolor
## 35 6.6 2.9 23.000 1.3 versicolor
## 106 5.5 NA 0.925 1.0 versicolor
iris.loc.err$adapt[iris.loc.err$status$degeneracy>1,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 30 TRUE FALSE TRUE FALSE FALSE
## 35 FALSE FALSE TRUE FALSE FALSE
## 106 FALSE TRUE FALSE TRUE FALSE
# We set all values deemed faulty to NA
dirty.iris[iris.loc.err$adapt] <- NA
summary(violatedEdits( iris.rules, dirty.iris))
## No violations detected, 150 checks evaluated to NA
## NULL