Chapter 4 Descriptive - Primelinary Analysis (Exploratory Data Analysis - EDA)
4.1 Motivation
- To make sure data cleaning step is automatized and consistent.
- To make sure data contains relevant variables before analysis.
- To make sure there are no unexpected variable type issues.
- Detect any missing values, especially in numeric columns to understand if it should be NA or 0.
- Control and compare different users` values for specific variables.
- Calculate must-have indicators such as LCSI and RCSI.
- Make data TIDY… Such as: multiple selection questions are in dummy variables and binary now.
- Setting the scene for having reproducible studies.
4.2 Data Cleaning
4.2.1 Column Renaming
If there is an error on this part, check error details. Following issues might be in-place;
- Duplication on column names
- Removal of some columns from the original data
Basically, if the following code and imported data sets column names does not match, this error is inevitable.
For solution steps:
- Check error message, find out which columns have error.
- Correct the column names using chunk of coding below by accurate names.
#Code example; I am using rename function but there might be better ones for sure.
#Lets keep using diamonds data for this as well.
#rename comes with dplyr.
library(dplyr)
library(tidyverse)
<- diamonds %>% rename("renamed_table"="table")
DiamondsRenamed colnames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price" "x" "y" "z" "Type"
colnames(DiamondsRenamed)
## [1] "carat" "cut" "color" "clarity" "depth" "renamed_table" "price" "x" "y" "z"
## [11] "Type"
4.2.2 Converting Multiple Selections into Dummy Variables
This issue happens with ODK or Device Magic, as they put multiple selections question into one variable. KOBO does this dummy coding it self, which is amazing. Converting these columns might consume a bit of time, let me introduce you how I manage this bothering situation… If you have a better way to do it, let me know!
#Lets add a column to diamonds to work with
#We added multiple selections divided by comma
$MultSelec <- "Selection1,Selection2,Selection5" DiamondsRenamed
#Take this column as data frame from our main data frame
<- as.data.frame(DiamondsRenamed$MultSelec)
df_multselec #Rename this data frame column
colnames(df_multselec)<- "examplecolumn"
#We will use cSplit_e function within splitstackshape package
library(splitstackshape)
#Now you have these columns as dummy ones, if it is selected it goes as 1 otherwise as 0.
#Each category becomes a column with binary format.
<- cSplit_e(df_multselec, split.col = "examplecolumn",
df_multselec_seperated sep = ",", type = "character", mode = "binary", fixed = TRUE, fill = 0)
#Remove original column before merging with original data, so that you will not have duplicate columns.
$examplecolumn <- NULL
df_multselec_seperated
#With column bind, add new dummy columns into original data. Done.
<- cbind(DiamondsRenamed,df_multselec_seperated)
DiamondsRenamed
#Lets see what happened. Check last three columns as example column.
library(dplyr)
glimpse(DiamondsRenamed)
## Rows: 53,940
## Columns: 15
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23, 0.22, 0.31, 0.20, 0.32, 0.30, 0.30, 0.30, 0.30, 0.30, 0.23…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Very Good, Fair, Very Good, Good, Ideal, Premium, Ideal, Premium, Premi…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I, E, H, J, J, G, I, J, D, F, F, F, E, E, D, F, E, H, D, I, I, J, D, D…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI2, SI2, I1, SI2, SI1, SI1, SI1, SI2, VS2, VS1, SI1, SI1, VVS2…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8, 60.4, 62.2, 60.2, 60.9, 62.0, 63.4, 63.8, 62.7, 63.3, 63.8…
## $ renamed_table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58, 54, 54, 56, 59, 56, 55, 57, 62, 62, 58, 57, 57, 61, 57, 57, 57, 59…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 342, 344, 345, 345, 348, 351, 351, 351, 351, 352, 353, 353, 353, 354, …
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93, 3.88, 4.35, 3.79, 4.38, 4.31, 4.23, 4.23, 4.21, 4.26, 3.85…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90, 3.84, 4.37, 3.75, 4.42, 4.34, 4.29, 4.26, 4.27, 4.30, 3.92…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46, 2.33, 2.71, 2.27, 2.68, 2.68, 2.70, 2.71, 2.66, 2.71, 2.48…
## $ Type <chr> "Population", "Population", "Population", "Population", "Population", "Population", "Population", "Population", "Population", "Pop…
## $ MultSelec <chr> "Selection1,Selection2,Selection5", "Selection1,Selection2,Selection5", "Selection1,Selection2,Selection5", "Selection1,Selection2…
## $ examplecolumn_Selection1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ examplecolumn_Selection2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ examplecolumn_Selection5 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
4.2.3 Missing Values
4.2.3.1 Defining Missing Values of Dataframe
Usually we do not have missing values, if questionnaire design tailored well in Kobo or ODK. Conditional or mandatory questions can be defined during this phase. Still, checking missing data always a must. Data cleaning step should be held with consultation of whoever design KOBO-ODK forms. These persons can explain you tips and tricks of the data and could give you a parade of the data.
#Required packages are visdat and naniar.
library(visdat)
library(naniar)
#Lets see if diamonds data have some missing values here and there.
vis_dat(diamonds)
There are no missing values in diamonds data. One of the amazing thing of this visual is it gives you types and structures of the each variable. This thing sure is an art! Amazing! Lets remove some (put some NA values intentionally) values and see that amazing visual again.
#Required packages for adding random NAs to data is missMethods.
library(missMethods)
#With delete_MCAR function, we delete random values to 25% of the data and make them NA.
<- delete_MCAR(diamonds, p = 0.25)
diamonds_missing
vis_dat(diamonds_missing)
So now, grey lines in the visual reflects NAs. Amazing! Lets check the percentage of missing now:
#Percentage of missing by variable.
vis_miss(diamonds_missing)
Amazing. delete_MCAR function is king. You can see %25 missing randomly from each column. To have a better knowledge about these operations please visit: https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html
Missing frequency.
gg_miss_var(diamonds_missing) + theme_bw()
Missing frequency disaggregated by categorical variable.
gg_miss_var(diamonds_missing, facet = cut)
4.2.3.2 Imputation of Missing Values
There are several methods for imputation such as median-mean imputations or one can go all in and use some regressions even. Most of the cases, for some numeric values, you need to replace NAs with zeros in humanitarian data. So lets say, questionnaire designer linked two questions; do you have income?-Yes or no question ; if yes, how much? If the answer is NO, then how much question is not appearing in data, means that this value will be NA. On the other hand, for analysis purposes, you need to embed zero to those values. Happens a lot! Let me show you how I handle this in R (also adding few useful codes that I use during data cleaning);
#Lets do this operation for only numeric columns of diamnonds data.
#setnafill comes with data.table package. Mapping numeric columns map_lgl comes with purrr package.
library(purrr)
library(data.table)
::map_lgl(diamonds_missing, is.numeric)] diamonds_missing[ , purrr
## # A tibble: 53,940 × 7
## carat depth table price x y z
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 NA NA 326 3.95 NA 2.43
## 2 NA 59.8 61 326 NA 3.84 2.31
## 3 NA NA NA 327 NA 4.07 NA
## 4 0.29 62.4 58 NA NA 4.23 2.63
## 5 0.31 63.3 58 335 4.34 4.35 2.75
## 6 0.24 NA 57 336 3.94 3.96 NA
## 7 0.24 62.3 57 336 3.95 3.98 2.47
## 8 0.26 61.9 55 337 4.07 4.11 2.53
## 9 0.22 65.1 61 NA 3.87 3.78 2.49
## 10 0.23 NA NA 338 4 NA 2.39
## # … with 53,930 more rows
#Fill NA values with zero only for numeric variables.
setnafill(diamonds_missing[ , purrr::map_lgl(diamonds_missing, is.numeric)], fill=0)
vis_miss(diamonds_missing)
Now, we do not have any missing values for numeric columns as they are converted to (filled with) zeros. Job well done! setnafill is a great piece of function that you can use, please do check it further.
4.2.3.3 Imputation of Missing Values with Regression
This method is amazing where you can fill values by categories. Think of humanitarian data and you have some missing values in income.
# da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
# da2 <- impute_median(da1, Sepal.Length ~ Species)
Some useful coding lines;
#This piece of code shows you how to select columns starts with etc.
#Not related with this concept but I will add those here, very useful piece of code.
#CashUsageReplace <- colnames(x1 %>% dplyr::select(ends_with("_cash",ignore.case = TRUE)))
#EcmenReplace <- colnames(x1 %>% dplyr::select(starts_with("ecmen_",ignore.case = TRUE)))
#Also, if you want to delete variables, lets say which has missing values greater than %80 of their records, you can use a piece of code given below.
#x2 <- x1[colSums(is.na(x1))/nrow(x1) < .8]
4.2.4 Outliers
Outliers are more bothering than missing values. Humanitarian data contains outliers but the trick is, you do not know if it is true or not. Lets say you are collecting expenditure data, disaggregated by components such as rent, health, education, celebrations etc. You may end up with a cases like 10000-25000, while mean is 100, median is 150. So these cases are clearly an outlier. Are they? When you contact your enumerators for these type of cases, you may realize some of them are key-entry error, yet some have stories behind. A man might find a debt from somewhere and gets really serious health surgery, or family does a wedding that boost their expenditure towards celebrations.
To deal with those cases, I have two copings;
- Use median if you are to report those values. Medians are more resistant to outliers.
- If you are doing a simple reporting include them (outliers with stories since they are reflecting truth). If you are doing some sense of regression or any inferential analysis, remove them because they will mess up with your whole algorithm.
library(dlookr)
library(dplyr)
# The mean before and after the imputation of the sodium variable
%>%
diamonds mutate(price_imp = imputate_outlier(diamonds, price,
method = "capping", no_attrs = FALSE)) %>%
group_by(cut) %>%
summarise(orig = mean(price, na.rm = TRUE),
imputation = mean(price_imp, na.rm = TRUE))
## # A tibble: 5 × 3
## cut orig imputation
## <ord> <dbl> <dbl>
## 1 Fair 4359. 4258.
## 2 Good 3929. 3823.
## 3 Very Good 3982. 3864.
## 4 Premium 4584. 4414.
## 5 Ideal 3458. 3362.
# If the variable of interest is a numerical variable
<- imputate_outlier(diamonds, price)
price
plot(price)
4.3 Exploratory Data Analysis (EDA)
There are some really cool packages for EDA in R. These packages-codings below gives really long, boring reports. It goes variable by variable and reflects;
- Information of variables (n, % missing, unique count etc.)
- Univariate analysis
- Distributions, normality test, transformations
- Comparison of variables by distributions
- Outlier densities and more…
Here is the deal. These outputs are very long consumes time to read it all. But, I strongly recommend creating and reading these reports. If you go somewhere as delegate and your task is to writing a report over monitoring survey data, this exercise allows you to intimate the data, all of the variables and their distributions. Spending an hour with this output will make you to have a way more efficient data analysis step.
ExpReport(diamonds, op_file = ‘smarteda.html’)
diagnose_report(diamonds)
library(dlookr)
diamonds %>% eda_report(target= clarity, output_format = “html”, output_file = “EDA_diamonds.html”)
Kindly run this code and check output. It will be saved as “EDA_diamonds.html”
4.4 Designing the Logframe - Calculation of Exampled Indicators
Through this page, you will see basic logframe indicators of cash-based interventions. As mentioned, these indicators are accepted generally, including ECHO. I will also put documentations about mentioned indicators. Personally, I do love WFP VAM Resource center. I would recommend to check this amazing cookbook as well: https://www.icrc.org/en/publication/4505-economic-security-indicators-cookbook
Also, it would be nice to go and check https://www.indikit.net/ where you can find a detailed explanations and examples of most of the humanitarian indicators.
Disaggregation of indicators are important. Here, disaggregation only made for resilience capacity score index to show how I personally do it.
Visualization of indicators and their codings will be shared in “dashboards” section.
4.4.1 RCSI (Reduced Coping Strategy Index)
https://resources.vam.wfp.org/data-analysis/quantitative/resilience/resilience-capacity-score-rcs
$rCSI <- PAB$cope_lessexpfood * 1 +
PAB$cope_borrowfood * 2 +
PAB$cope_reduceportionsize *1 +
PAB$cope_reduceadultfood * 3 +
PAB$cope_reducemealno *1 PAB
4.4.2 FCS (Food Consumption Score)
https://resources.vam.wfp.org/data-analysis/quantitative/food-security/food-consumption-score
$FCS <- PAB$cons_cereal*2 + PAB$cons_pulses*3 + PAB$cons_veg*1 + PAB$cons_fruit*1 +
PAB$cons_meat*4 + PAB$cons_dairy*4 + PAB$cons_sugar*0.5 + PAB$cons_oil*0.5
PAB
$FCG <- ifelse(PAB$FCS <=28,"PoorConsumption", ifelse(PAB$FCS <42, "Borderline","AcceptableConsumption"))
PAB
prop.table(table(PAB$FCG))*100
##
## AcceptableConsumption Borderline PoorConsumption
## 43.03109 29.43005 27.53886
4.4.3 LCSI (Livelihood Coping Strategy Index)
#Extract
<- PAB %>% dplyr::select(dplyr::starts_with("lhood_"))
livelihoodcoping
#Recode
<- names(livelihoodcoping)
vars <- livelihoodcoping %>%
livelihoodcoping mutate(across(all_of(vars), ~ recode(., "No" = 0, "Yes" = 1)))
#LCSI Calculation
$LCSI <-
livelihoodcoping#Stress
$lhood_ST_soldhhasset + livelihoodcoping$lhood_ST_spentsavings + livelihoodcoping$lhood_ST_foodcredit +
(livelihoodcoping$lhood_ST_borrowmoney + livelihoodcoping$lhood_ST_unusualfood ) +
livelihoodcoping#Crisis
$lhood_CR_soldprodasset + livelihoodcoping$lhood_CR_noschool + livelihoodcoping$lhood_CR_lesshealth +
(livelihoodcoping$lhood_CR_lessedu) * 2 +
livelihoodcoping#Emergency
$lhood_EM_hhmove + livelihoodcoping$lhood_EM_childwork + livelihoodcoping$lhood_EM_beg +livelihoodcoping$lhood_EM_return) * 3
(livelihoodcoping
$LCSI <- livelihoodcoping$LCSI PAB
4.4.4 Resilience Capacity Score
https://resources.vam.wfp.org/data-analysis/quantitative/resilience/resilience-capacity-score-rcs
4.4.4.1 Calculation of RCS
<- PAB %>%
PAB rowwise() %>%
mutate(
RCS = sum(c(Anticipatory + Absorptive + Transformative + Adaptive + Financial + Social +
+ Humancapital + Information)))
Institutional
mean(PAB$RCS)
## [1] 28.98212
4.4.4.2 Normalization of RCS
$RCSX <- (((PAB$RCS /9) -1)/ (5-1)) * 100
PABmean(PAB$RCSX)
## [1] 55.5059
4.4.4.3 Categorization of RCS
$RCSG <- ifelse(PAB$RCSX <33,"LOWRCS",
PABifelse(PAB$RCSX <66, "MediumRCS","HighRCS"))
table(PAB$RCSG)
##
## HighRCS LOWRCS MediumRCS
## 759 74 3027
4.4.4.4 Percentage of Groups
prop.table(table(PAB$RCSG)) * 100
##
## HighRCS LOWRCS MediumRCS
## 19.663212 1.917098 78.419689
4.4.4.5 Disaggregation by Eligibility
%>% dplyr::group_by(Programme_Status,Control) %>% summarise(mean(RCSX)) PAB
## # A tibble: 4 × 3
## # Groups: Programme_Status [2]
## Programme_Status Control `mean(RCSX)`
## <chr> <chr> <dbl>
## 1 Non-Recipient Controlz 54.8
## 2 Non-Recipient Testz 56.6
## 3 Recipient Controlz 55.0
## 4 Recipient Testz 55.5
4.4.4.6 Disaggregation by GenderHHH
%>% group_by(gender_hh,Control) %>% summarise(mean(RCSX)) PAB
## # A tibble: 4 × 3
## # Groups: gender_hh [2]
## gender_hh Control `mean(RCSX)`
## <chr> <chr> <dbl>
## 1 female Controlz 54.2
## 2 female Testz 56.5
## 3 male Controlz 55.0
## 4 male Testz 56.0
4.4.4.7 Disaggregation by Stratum
%>% group_by(stratum,Control) %>% summarise(mean(RCSX)) PAB
## # A tibble: 10 × 3
## # Groups: stratum [5]
## stratum Control `mean(RCSX)`
## <chr> <chr> <dbl>
## 1 RegionA Controlz 56.1
## 2 RegionA Testz 55.9
## 3 RegionB Controlz 53.7
## 4 RegionB Testz 56.3
## 5 RegionC Controlz 54.5
## 6 RegionC Testz 55.7
## 7 RegionD Controlz 55.1
## 8 RegionD Testz 56.3
## 9 RegionE Controlz 55.1
## 10 RegionE Testz 56.3
4.4.4.8 Individual Statement Score Calculation
<- colMeans(subset(PAB, select = c(Anticipatory , Absorptive , Transformative , Adaptive , Financial, Social ,
a na.rm = TRUE) Institutional ,Humancapital , Information)),
<- as.matrix(a)
a colnames(a) <- "AverageValue"
a
## AverageValue
## Anticipatory 5.000000
## Absorptive 2.996632
## Transformative 3.030829
## Adaptive 2.964249
## Financial 2.992487
## Social 3.024611
## Institutional 2.984197
## Humancapital 2.948705
## Information 3.040415
4.4.4.9 Resilience Capacity Score Spider Map
<- PAB %>% group_by(Control) %>% summarise(Anticipatory = mean(Anticipatory),
sResilienceCapacities Absorptive = mean(Absorptive ),
Transformative = mean(Transformative ),
Adaptive = mean(Adaptive ))
<- c(5,5,5,5)
sRowMax <- c(1,1,1,1)
sRowMin
<- rbind(sResilienceCapacities, sRowMax, sRowMin)
sResilienceCapacities <- as.data.frame(sResilienceCapacities)
sResilienceCapacities <- data.frame(sResilienceCapacities[,-1], row.names=sResilienceCapacities[,1])
sResilienceCapacities <- sResilienceCapacities[c(3,4,1,2),]
sResilienceCapacities
<- c(rgb(1, 0, 0, 0.25),
areas rgb(0, 1, 0, 0.25))
# Fill colors
<- c(rgb(1, 0, 0, 0.25),
areas rgb(0, 1, 0, 0.25))
radarchart(sResilienceCapacities,
cglty = 1, # Grid line type
cglcol = "gray", # Grid line color
pcol = 2:4, # Color for each line
plwd = 2, # Width for each line
plty = 1, # Line type for each line
pfcol = areas) # Color of the areas
4.4.5 Other SMART Logframe Indicators for CVA
- Average Total debt / percentage of households with debt
- MEB based analysis
- Exclusion - inclusion errors
- Resilience scores
- Beneficiary perception positive ratios
- ECMEN based indicators (please kindly Google this aspect, very important)
Check more indicators at: https://www.icrc.org/en/publication/4505-economic-security-indicators-cookbook
Check more indicators at: https://www.indikit.net/