Exploring the Tanzanian Water Well Data Further

Nick Kachanyuk 2021-08-05 5 min read

Introduction

Data science is a data-driven process that is iterative in nature. We explore the data, clean the data, make visualizations, build models, and repeat the process until our questions are answered.

In my previous blog post, “The Data Behind the Tanzanian Water Crisis”, I made suggestions about improving the data quality for further research. I will briefly go over what these suggestions were and my reasoning:

  • Convert the target variable from a multiclass to a binary classification problem to address class imbalance and lack of variance between FNR and non-functional classes.
    • The original data has imbalanced target classes. This is problematic because machine learning algorithms can become “fixed” on the most frequent class and make assumptions about the data based on that class more than the other classes. This can lead to decreased predictive model performance.
    • For the Tanzanian data specifically, there are 3 target classes (functional, functional needs repair, and non-functional). After examining the data visualizations, I realized that there isn’t much difference between functional needs repair (FNR) and non-functional target classes. It made sense to combine these two into a new class called “needs attention”. My logic was that if a well needs repairs it is likely to become non-functional in the near future. This preventative approach is useful because it may help reduce costs of repair (easier to repair functional than completely broken wells) and can help reduce the burden on the people relying on the well to provide them with water.
  • Reduce the number of predictors in the model from 12 to 6.
    • A model that has less independent variables (predictors) but similar accuracy to a model that has more predictors is often desired.
    • Interpreting results becomes easier when a smaller selection of predictors needs to be considered.

The Approach

In the previous blog post I built an XGBoost model that predicted the three target classes (functional, functional need repair, and non-functional) with 0.79 accuracy and 0.60 kappa. These values indicate that the model did a good job in predicting the target classes but there was an opportunity to improve the models further.

In addition to the baseline model from the previous blog post, I created three additional models. Below is a description of all 4 models:

  • model 1 (baseline model):
    • accuracy: 0.79
    • kappa: 0.60
    • 3 target classes: functional, functional need repair, non-functional
    • 12 predictors
  • model 2:
    • accuracy: 0.79
    • kappa: 0.54
    • 3 target classes: functional, functional need repair, non-functional
    • 6 predictors
  • model 3:
    • accuracy: 0.80
    • kappa: 0.60
    • 2 target classes: functional, needs attention
    • 12 predictors
  • model 4:
    • accuracy: 0.80
    • kappa: 0.60
    • 2 target classes: functional, needs attention
    • 6 predictors

As we can see there is not much difference in the performance of these models and any of these 4 models can be a good candidate. However, for the purpose of this exercise, I will compare the baseline model 1 with model 4. Compared to the baseline model, model 4 performs well considering that only the top 6 out of 12 predictors were used in model 4.

Let’s examine some visualizations!

The Tale of Two Models

Target variable histograms and variable of importance plots

Baseline model

Once again we can see the class imbalance, especially for the functional needs repair target class.

The 12 features that were used in the baseline model are shown above.

Model 4 (binary classification with 6 predictors)

The target classes are more balanced and concise compared to the baseline model target classes.

The order of importance for the top 6 variables for the new model did not change when compared with the baseline model. However, the magnitude of importance did change for some of the variable.

Let’s compare how the target variable is distributed among the 6 predictors for the baseline model 1 and the new model 4.

Target variable distribution for the top 6 predictors

Other extraction from rivers, lakes principal component variable

With model 4 graph we can see a slightly better picture: functional wells are less likely to use “other” extraction methods and rely less on water sources that come from rivers and/or lakes. Wells that need attention are more uniformly distributed for this principal component variable.

Medium to large sized, southern and eastern regions principal component variable

With model 4 graph we can see that functional wells tend to be located in medium to larger sized regions, and regions that are located in more eastern and/or southern parts of Tanzania.

Small to medium sized, northern and western regions principal componenet variable

Examining the model 4 graph we can see that functional wells tend to be located in northern and western regions to a lesser degree than wells that need attention, although there are many wells from both groups that reside in northern and western regions.

Handpump groundwater shallow well principal component variable

From the model 4 graph we can see that functional wells tend to rely less on handpump extraction methods and these wells are usually not shallow.

Amount tsh variable

Looking at model 4 graph, we can see that functional wells will indeed have more water available to waterpoint than wells that need attention.

Well strain

Model 4 graph shows that functional wells will experience less well strain than wells that need attention.

Recommendations

In this blog post we examined quite a few things about the data on the Tanzania’s well quality. With the results above, I make the following suggestions and observations:

  • Use model 4 (or a similar model) that uses 2 classes for the target variable and 6 predictors. As we saw, the model accuracy metrics are comparable and interpretation of the model results is more manageable compared to the baseline model.

  • Functional wells:

    • Tend to not rely handpump or “other” extraction methods compared to wells that need attention
    • Tend to be located more in southern and eastern regions of Tanzania
    • Have more water volume available within the well/waterpoint
    • Experience less well strain
  • Needs attention wells:

    • Rely more on handpump or “other” extraction methods compared to functional wells
    • Tend to be located more in northern and western regions of Tanzania
    • Have less water volume available within the well/waterpoint
    • Experience higher well strain