The Data Behind the Tanzanian Water Crisis

Nick Kachanyuk 2021-08-02 23 min read

Overview of this blog post

According to Water.org [1], Tanzania is experiencing a water and sanitation crisis. According to their records 4 million people lack access to safe water resources. People often spend a significant amount of time traveling long distances to collect water, and this burden often falls on women and girls. Using the data provided for the Pump it Up competition by DrivenData [2] along with Tanzania’s 2012 census report [3] we will go through the machine learning approach using XGBoost to create a model that can predict whether a water well is functional, needs repair, or non-functional. In addition, we will explore the features that were used in the model.

Background

Earlier this year, DrivenData hosted an educational data science competition that requires a machine learning approach to identify water pump functionality in Tanzania. The data provided for the competition contains variables that focus on the specific features of these water wells such as location of the well, construction year, record date, and quality of the water in the well. Identifying the functionality of water wells in Tanzania is important in addressing the water crisis in Tanzania but there is a greater opportunity to explore additional factors that may show why a particular water shortage is occurring in a given region of the country.

What is the problem?

Tanzania is experiencing a water crisis where about 4 million people lack access to safe water resources. While there is data on both Tanzania’s water well quality and the 2012 census, there is no attempt to combine these resources to assess the overall water situation in the country. Solving the water crisis requires both a problem-specific approach (identifying functionality of the wells) and a broader analysis of how different regions in the country are affected. Although there is no clear cause and effect pattern, identifying which regions are doing well in relation to others may help develop preventative strategies for the future and can also shed light on the challenges that the citizens of this country are facing when trying to access safe drinking water.

Why is it important to solve?

In 2016, Water.org found that Tanzania is eligible for what they call a “water credit solution” which allows for lending programs to households and water companies. My project attempts at providing practical guidelines in solving the water crisis by first addressing the functionality of the water wells in the country. Combining socioeconomic data into the analysis may also help answer questions such as how the water crisis is persistent for a given region. Examining the socioeconomic differences between regions may help get at issues like, which regions are experiencing water access and quality issues the most.

The Data

Outcome variable in the model

The dependent variable in the model is called status_group. This is a categorical variable that contains three classes: functional, functional needs repair (FNR), and non-functional. Since this is a categorical variable, a supervised machine learning classification approach will be used. The best model will be selected via the highest Cohen’s kappa score.

Kappa is about how much better your model classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. Kappa takes on a range of values from -1 to 1.

According to [4], Kappa scores can be categorized as follows:

less than 0.20 (not good)
0.21 to 0.40 (fair)
0.41 to 0.60 (pretty good)
0.61 to 0.80 (great)
greater than 0.80 (almost perfect)

What types of features were considered?

The features and their descriptions from the Pump it Up data can be found here: [http://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/]

The features from the Tanzania 2012 census are:

total_region_pop: total population for a given Tanzanian region
households_per_region: number of households by region
mhead_avg_household_size_per_region: average number of male head of household by region
fhead_avg_household_size_per_region: average number of female head of household by region
literacy_rate_by_region: percent of the literate population by region
pct_unemployed_by_region: percent of the unemployed population by region
region_area_sq_mi: area in square miles of a given region

Before I get into the final model and results. I would like to briefly mention a few things about the data challenges encountered while wroking on this project.

Dealing with missing data

Here is a plot of the missing data:

The data appears to not be missing at random because if we look at installer and funder variables, there is a 100% match in the rows where the data is missing. In addition, there were 4 regions in which all the observations had a zero value for amount_tsh variable. An imputation strategy is needed because we can not simply delete the rows where the data is missing because the data is not missing at random. I will not elaborate too much on this in this post, because it is a whole process of its own, but the code for this entire project can be found in my GitHub repository [https://github.com/nickkachanyuk/tanzania_project_files].

Looking at the missing values graph we can exclude variables like scheme_name, scheme_management, installer, and funder from further analysis. The wpt_name is a unique identifying name value for each observation and can also be excluded from the analysis.

Target variable: class imbalance or unnecessary classes present?

Class imbalance is an important topic to discuss when preparing data for predictive modeling. From the graph below we can see that the functional class is the most popular class in the target variable (status_group).

## # A tibble: 3 x 2
##   status_group            well_count
##   <fct>                        <int>
## 1 functional                   32259
## 2 functional needs repair       4317
## 3 non functional               22824

Class imbalance can be problematic for some classification problems because a model can become ‘fixed’ on the most frequent class and fail to learn as much about the less frequent classes thus decreasing the accuracy in the predictions. A solid strategy is to upscale a less frequent class by generating more observations that resemble that class.

Another possible approach is to collapse FNR class with either functional or non-functional category. This will result in only two factor levels for the target variable, decrease the gap in the class imbalance, and also impose a more general classification system. If a pump needs repair it is likely to become non-functional in the near future and I think it makes sense to have two classes that could be: functional pumps versus pumps that need to fixed. This becomes more evident as I discuss the final features and how well the three classes are distributed among them.

Too many features and dimensionality reduction

The number of predictor variables originally present in the data set was 39 coming from “Pump it Up” data set and 9 additional variables were added from the 2012 Tanzanian census report for a total of 48 variables. That number quickly grew to 84 predictors after some of the categorical variables were converted to dummy variables to be used in the principal component analysis and machine learning algorithm.

Having a large number of predictors (features) is problematic because it can lead to overfitting where the noise of the data is picked up by the machine learning algorithm. This can cause bad out of sample (test) performance of the algorithm on new data. It is also more difficult to explain a model that has many features and dimensionality reduction techniques like PCA help make the data more concise. Later in this blog I will mention how the variables from the principal component analysis were created and what they represent.

The model

For this project I chose to focus on the XGBoost model. The data has several outliers which XGBoost handles well. XGBoost is also good for large sized data sets that other models take longer to process and is less prone to overfitting thus making it notorious for good model performance.

On the other hand, XGBoost model is difficult to interpret when the tree plot has too many nodes (which was the case for my best performing model). It is also a bit more challenging to tune due to having many hyperparameters.

Hyperparameter tuning

XGBoost tree model has the following hyperparameters [5]:

I selected what seemed to be most appropriate hyperparameter range values based on the definitions of the hyperparameters and also using cross-validation during the model tuning.

Below are hyperparameter values of the best performing XGBoost model and also a visual representation of hyperparameter tune grid.

##    nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 61    1000         8 0.05     0              0.9                5       0.5

It seems classifiers with higher values of max_depth perform better than more shallow tree classifiers, and the lower to intermediate eta values are a good combination as well.

Next, let’s examine the confusion matrix.

Confusion matrix

## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair non functional
##   functional                    7101                     608           1366
##   functional needs repair        206                     307             72
##   non functional                 764                     142           4284
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7873          
##                  95% CI : (0.7807, 0.7939)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.599           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.8798                        0.29044
## Specificity                     0.7088                        0.97984
## Pos Pred Value                  0.7825                        0.52479
## Neg Pred Value                  0.8320                        0.94742
## Prevalence                      0.5435                        0.07118
## Detection Rate                  0.4782                        0.02067
## Detection Prevalence            0.6111                        0.03939
## Balanced Accuracy               0.7943                        0.63514
##                      Class: non functional
## Sensitivity                         0.7487
## Specificity                         0.9007
## Pos Pred Value                      0.8254
## Neg Pred Value                      0.8511
## Prevalence                          0.3853
## Detection Rate                      0.2885
## Detection Prevalence                0.3495
## Balanced Accuracy                   0.8247

The model classifies functional wells better than any other target class. It seems to struggle with classifying FNR wells and does a mediocre job at classifying non-functional wells. Collapsing FNR and non-functional class may seem like a good suggestion for further analysis considerations.

The Kappa value of 0.60 shows that this is a good classifier after all and it seems like there is no apparent overfitting.

Examining the improtance of features in the model and feature characterstics (distribution, suggestions, flaws, etc…)

Feature importance

The top four most important variables in the model are all feature engineered variables that were created via PCA (principal component analysis). This is probably because they contain a good proportion of the variance in the data. Surprisingly the original categorical variable “permit” is the least predictive even though one might believe that wells that have a permit are more likely to be functional because they have undergone more inspection and compliance.

Let’s examine each variable more closely now.

Amount tsh variable

The variable amount_tsh is the original variable in the Tanzania water well data. The description of this variable reads “total static head (amount water available to waterpoint)” with no units of measurement specified.

Examining the box plot shows that it is a reasonable variable to use after-all though. If we keep the outliers, which XGBoost handles well, they can be used to show the difference between the three target groups. It adds up, functional wells will indeed have more water available to waterpoint than any other two categories. Once again though, the argument for collapsing functional needs repair class and non-functional class maybe a purposeful suggestion to try out in future analysis.

Literacy rate by region variable

According to the VIP plot, literacy rate by region is in 9/12th position for importance which is also evident by how the IQR ranges overlap for each target category. Once again collapsing FNR(functional needs repair) with non-functional is recommended as an avenue to explore.

Perecent unemployed by region

According to VIP plot, this variable is 10/12th place for importance. There also seems to be a pattern for how importance is measured. Reading the following article [6] describes how importance is calculated:

“For classification problems, an area under the ROC curve (AUC) statistic can be used to quantify predictor importance. The AUC statistic is computed by using the predictor x as input to the ROC curve. If x can reasonably separate the classes of Y, that is a clear indicator that x is an important predictor (in terms of class separation) and this is captured in the corresponding AUC statistic.”

The target variable does not separate well in this predictor. This variable could be further dropped from analysis which may improve the accuracy of the model.

Well strain variable

This variable is 6/12th position on importance plot. Well strain is a variable that was feature engineered using the following formula:

\[wellstrain = ((population/totalregionpop) * regionpopdensity)\]

where population is the number of people living around the well
total_region_pop is the total number of people living in a given region of Tanzania
region_pop_density is total_region_pop/region_area_sq_mi

I wonder if there is any difference in the IQR ranges of the well strain variable between the different status groups?

## # A tibble: 3 x 2
##   status_group               iqr
##   <fct>                    <dbl>
## 1 functional              0.0185
## 2 functional needs repair 0.0144
## 3 non functional          0.0185

There is indeed 100% overlap between functional and non-functional groups in their IQR. That’s not too informative but if we look at the outliers we can see that for non-functional class there is a greater range than for functional and FNR classes. My understanding is that these outliers were useful in separating the target classes for this variable, which is not ideal because outliers are just few observations in the data and usually are not representative of the general pattern.

Permit variable de-dummified

From the density plot we can see that the permit variable distribution does not separate well between the target classes and probably the reason behind the dummy versions scoring 11th and last place on the variable importance graph.

Handpump groundwater shallow well principal component variable

Handpump_groundwater_shallow_well is a feature engineered variable that was created via PCA. The distribution of the target variable classes within this variable is shown below.

The density plot shows that there isn’t much separation of the target variable classes within this principal component but the distributions are not exact matches. This variable is the 4th most important variable according to the variable importance plot yet it seems to show similar distribution patterns of the target variable in the same way the other less important features are doing. One explanation for this is that since this is a principal component variable, it captures more variance in the data than other variables in the data set would.

Let’s briefly explore how this variable was created via PCA.

PCA process

A collection of variables that described the characteristics of the well such as construction year, extraction type (handpump, motorpump, gravity, etc), water quality and quantity produced by the well, and other features were used in PCA.

In PCA, a principal component is a linear combination of variables. Much like in linear regression, how well a variable “fits” to a principal component is dictated by coefficients called factor loadings. These factor loadings do not have a restrictive range of -1 to 1 like in regression. Despite this difference a factor loading essentially signifies a similar concept; the magnitude of the factor loading shows how strong a given observation fits to a principal component. Larger values show stronger correlations and the sign dictates the direction of the correlation. Based on these loadings we can characterize what a principal component is about.

Below is a dataframe that shows this:

##                                     rowname          PC1          PC2
## 1             extraction_type_class_gravity -0.342751850 -0.185647268
## 2            extraction_type_class_handpump  0.335936525  0.059006948
## 3           extraction_type_class_motorpump -0.007075295 -0.005395664
## 4               extraction_type_class_other  0.106533974  0.172035206
## 5         extraction_type_class_submersible -0.046179143  0.043303786
## 6                        quality_group_good -0.177272987 -0.179835961
## 7                     quality_group_unknown  0.047618098  0.124080539
## 8                      source_type_borehole  0.105208562 -0.044169821
## 9                    source_type_river/lake -0.271409389  0.309188396
## 10                 source_type_shallow well  0.336194014  0.134268654
## 11                       source_type_spring -0.148146320 -0.474649430
## 12                 source_class_groundwater  0.301430490 -0.408878162
## 13                     source_class_surface -0.303101053  0.394957704
## 14 waterpoint_type_group_communal standpipe -0.378991249 -0.118053561
## 15          waterpoint_type_group_hand pump  0.346072935  0.064400990
## 16              waterpoint_type_group_other  0.099224599  0.131973271
##            PC3         PC4
## 1   0.02380424  0.21930836
## 2   0.33288005 -0.04802169
## 3  -0.15612971 -0.27955970
## 4  -0.30114945  0.29609450
## 5  -0.11524774 -0.37816994
## 6   0.31115799  0.14650125
## 7  -0.29232852 -0.01395835
## 8  -0.17396980 -0.49015934
## 9   0.14596129 -0.06131773
## 10  0.13921953  0.22970403
## 11 -0.10574600  0.21885392
## 12 -0.13046322  0.01418860
## 13  0.15754372 -0.02088470
## 14 -0.09686690 -0.17848770
## 15  0.32896234 -0.04411451
## 16 -0.32216130  0.31381435

We can see that the variables extraction_type_class_handpump, source_class_groundwater, and source_type_shallow well have the largest positive values for principal component 1. Here it makes sense to rename this principal component to something like handpump_groundwater_shallow_wells principal component.

Looking at the density plot above we can see how the observations(rows) of the data are distributed for this principal component and also how the target variable classes are distributed. It seems that functional and FNR wells are distributed towards more negative loading values (with some exceptions) indicating that these wells usually do not rely on handpump extraction methods from shallow wells. Whereas non-functional wells seem to be distributed towards more positive loading values indicating that these wells tend to rely on handpump extraction from shallow and/or groundwater wells.

Other extraction from rivers and lakes principal component variable

Other_extraction_from_rivers_lakes is another feature engineered variable that was created via PCA. The distribution of the target variable classes within this variable is shown below.

This principal component was created using the same variables and process used to create the handpump groundwater shallow well principal component variable.

Again, there is not much separation in the distribution of the status group classes. It seems that functional wells are more likely to be negatively correlated with this principal component than FNR and non-functional wells.

The table below again shows how this principal component (PC2) was conceptualized.

##                                     rowname          PC1          PC2
## 1             extraction_type_class_gravity -0.342751850 -0.185647268
## 2            extraction_type_class_handpump  0.335936525  0.059006948
## 3           extraction_type_class_motorpump -0.007075295 -0.005395664
## 4               extraction_type_class_other  0.106533974  0.172035206
## 5         extraction_type_class_submersible -0.046179143  0.043303786
## 6                        quality_group_good -0.177272987 -0.179835961
## 7                     quality_group_unknown  0.047618098  0.124080539
## 8                      source_type_borehole  0.105208562 -0.044169821
## 9                    source_type_river/lake -0.271409389  0.309188396
## 10                 source_type_shallow well  0.336194014  0.134268654
## 11                       source_type_spring -0.148146320 -0.474649430
## 12                 source_class_groundwater  0.301430490 -0.408878162
## 13                     source_class_surface -0.303101053  0.394957704
## 14 waterpoint_type_group_communal standpipe -0.378991249 -0.118053561
## 15          waterpoint_type_group_hand pump  0.346072935  0.064400990
## 16              waterpoint_type_group_other  0.099224599  0.131973271
##            PC3         PC4
## 1   0.02380424  0.21930836
## 2   0.33288005 -0.04802169
## 3  -0.15612971 -0.27955970
## 4  -0.30114945  0.29609450
## 5  -0.11524774 -0.37816994
## 6   0.31115799  0.14650125
## 7  -0.29232852 -0.01395835
## 8  -0.17396980 -0.49015934
## 9   0.14596129 -0.06131773
## 10  0.13921953  0.22970403
## 11 -0.10574600  0.21885392
## 12 -0.13046322  0.01418860
## 13  0.15754372 -0.02088470
## 14 -0.09686690 -0.17848770
## 15  0.32896234 -0.04411451
## 16 -0.32216130  0.31381435

Interestingly enough this variable is the most important variable in the XGBoost model.

Densely populated wells principal component variable

This is another feature engineered variable created from PCA. The data frame below shows which variables were used in this PCA; these variables were about certain demographics of the regions in Tanzania such as the population around the well, total region population, number of households per region, and region population density.

##                               rowname         PC1         PC2          PC3
## 1                          population  0.03862235 -0.08444050 -0.994656812
## 2                    total_region_pop -0.56417133 -0.28661307  0.015408068
## 3               households_per_region -0.62128558 -0.03872151 -0.005305968
## 4 mhead_avg_household_size_per_region  0.12556695 -0.68710750  0.045236312
## 5 fhead_avg_household_size_per_region  0.19079984 -0.65843206  0.078240323
## 6                  region_pop_density -0.49198742 -0.05978208 -0.047163454
##           PC4
## 1  0.03548992
## 2  0.37270229
## 3  0.31017649
## 4  0.03053786
## 5 -0.13860836
## 6 -0.86225238

The dense populated wells is represented as PC1 in the table above. Let’s examine how the target variable is distributed for this variable.

A lot of overlap for this variable and is probably the reason why it placed 7/12 on the variable importance graph. Interestingly enough functional wells tend to be located in more densely populated areas than FNR and non-functional wells although this difference is not too great. In my opinion, it may be possible to complete further studies without this variable being included.

Sparsely populated wells principal component variable

This is another feature engineered variable created from PCA. The same variables used in creating the densely populated wells variable were also used for this variable. In the table below this variable is labeled as PC2.

##                               rowname         PC1         PC2          PC3
## 1                          population  0.03862235 -0.08444050 -0.994656812
## 2                    total_region_pop -0.56417133 -0.28661307  0.015408068
## 3               households_per_region -0.62128558 -0.03872151 -0.005305968
## 4 mhead_avg_household_size_per_region  0.12556695 -0.68710750  0.045236312
## 5 fhead_avg_household_size_per_region  0.19079984 -0.65843206  0.078240323
## 6                  region_pop_density -0.49198742 -0.05978208 -0.047163454
##           PC4
## 1  0.03548992
## 2  0.37270229
## 3  0.31017649
## 4  0.03053786
## 5 -0.13860836
## 6 -0.86225238

Once again there is a lot of overlap between the different target classes in the distribution for this variable. It seems that to some degree the functional wells tend to have slightly more sparsely populated wells than functional and FNR wells.

It is also no surprise that this variable placed 8/12 on variable of importance plot which suggests that this variable can possibly be removed from further analysis in the future.

Medium to large sized, southern and eastern regions principal component variable

Another feature engineered variable via PCA. This variable was created using geographic predictors such as altitude (gps_height), longitude, latitude, region name, and basin location. Below is the data frame used to conceptualize this variable.

##                rowname         PC1         PC2          PC3         PC4
## 1           gps_height  0.05619224 -0.21141683  0.477726573 -0.05803286
## 2            longitude  0.34653888 -0.36707572 -0.188749378  0.05053039
## 3             latitude -0.46408156 -0.16351403 -0.016820011  0.19658503
## 4    region_area_sq_mi  0.31591239  0.20510453  0.097559739  0.33034008
## 5       basin_Internal -0.03518589 -0.06380350  0.147865140  0.33925344
## 6     basin_Lake Nyasa  0.13493181  0.15380358  0.117293684 -0.26553047
## 7  basin_Lake Victoria -0.37969426  0.10895805 -0.231601633 -0.12772004
## 8        basin_Pangani -0.03477895 -0.51353837  0.075264572 -0.10273304
## 9    basin_Wami / Ruvu  0.12985574 -0.02365658 -0.316719971  0.29133059
## 10       region_Iringa  0.13641520  0.03458970  0.262524000 -0.17685527
## 11  region_Kilimanjaro -0.04818027 -0.37948085  0.060506902 -0.12153959
## 12        region_Mbeya  0.07990353  0.19458539  0.008310275 -0.26606545
## 13     region_Morogoro  0.16192694  0.06070716 -0.109406661  0.25655965

Looking at PC1 we can see that the variables that take on large and positive values are longitude and region area in square miles, and large negative value is shown for latitude. According to [7], positive longitudinal values indicate more eastern locations and negative latitudinal values indicate more southern locations.

Let’s look at the target variable distribution for this variable.

Not much separation is happening here either but looking at the functional group peaks we can see that functional wells tend to be clustered in more southern and eastern regions of Tanzania. This variable is the second most important variable in the variable of importance plot.

Small to medium sized, northern and western regions principal componenet variable

This is the last variable we will examine that was used in the XGBoost model. This variable was also feature engineered using the same geographic variables in PCA, and is shown as PC2 in the data frame below.

##                rowname         PC1         PC2          PC3         PC4
## 1           gps_height  0.05619224 -0.21141683  0.477726573 -0.05803286
## 2            longitude  0.34653888 -0.36707572 -0.188749378  0.05053039
## 3             latitude -0.46408156 -0.16351403 -0.016820011  0.19658503
## 4    region_area_sq_mi  0.31591239  0.20510453  0.097559739  0.33034008
## 5       basin_Internal -0.03518589 -0.06380350  0.147865140  0.33925344
## 6     basin_Lake Nyasa  0.13493181  0.15380358  0.117293684 -0.26553047
## 7  basin_Lake Victoria -0.37969426  0.10895805 -0.231601633 -0.12772004
## 8        basin_Pangani -0.03477895 -0.51353837  0.075264572 -0.10273304
## 9    basin_Wami / Ruvu  0.12985574 -0.02365658 -0.316719971  0.29133059
## 10       region_Iringa  0.13641520  0.03458970  0.262524000 -0.17685527
## 11  region_Kilimanjaro -0.04818027 -0.37948085  0.060506902 -0.12153959
## 12        region_Mbeya  0.07990353  0.19458539  0.008310275 -0.26606545
## 13     region_Morogoro  0.16192694  0.06070716 -0.109406661  0.25655965

Negative longitudinal values indicate that this principal component is more about the western regions of Tanzania. The latitudinal values are slightly less negative values than in PC1 which indicates that that PC2 is more about northern regions of Tanzania. Also, region area value is slightly less positive than PC1, indicating that the regions that load well onto PC2 are small to medium sized.

Let’s look at the target variable distribution for this variable.

The FNR wells seem to be mostly clustered in the northern and western regions of Tanzania. Looking at the functional well group, we can see the two tails on the negative x-axis further support the idea that functional wells tend to be located in southern and eastern regions of Tanzania rather than in the north and/or west.

Overall, this variable is placed third in the variable of importance plot.

Conclusion

Using XGBoost tree model, an algorithm that predicts the status of the well was created. The Kappa score for the test set was 0.599 indicating that XGBoost did a solid job in determining which wells were functional, needed repairs, or non-functional.

The trained model includes 12 predictors. After exploring the predictors via data visualizations, there seems to be evidence that the 6 least important variables in the importance plot may be removed for future projects dealing with this data. The 6 least important variables are:

dense_populated_wells
sparsely_populated_wells
literacy_rate_by_region
pct_unemployed_by_region
permit_yes
permit_unknown

Looking at the 6 remaining predictors, the following conclusions can be made:

Functional wells have more water available to waterpoint than FNR and/or non-functional wells.
Non-functional wells experience larger strain than functional and/or FNR wells.
Functional and functional need repair wells usually do not rely on handpump extraction methods and/or these wells are usually not shallow wells.
Functional need repair wells tend to use “other” extraction methods and/or the water source for these wells comes from either rivers or lakes.
In general, functional wells are more present in the southern and/or eastern parts of Tanzania.
Functional need repair wells are more common in the northern and/or western parts of Tanzania, and non-functional wells also follow this pattern but to a lesser degree.

How can these findings be used in the real-world

This project attempts to create an initial framework for dealing with water-shortage issues in Tanzania. One intervention strategy that may be useful is to continue monitoring the amount of water present at a given well (amount_tsh variable). For this variable to be useful in future predictions, the units of measurement would first need to be defined so that there are no errors during data collection. In addition, a threshold value would need to be set that can serve as indicator if a well is at risk for being non-functional. According to the data visualization for this variable, this threshold boundary is located somewhere among the outlier values for the amount_tsh variable.

Well strain is also a measurement variable that can indicate functionality of a given well. The data visualization of this variable shows that non-functional wells have larger outliers for this variable than functional and FNR wells.

Finally, indicator variables created by PCA can help highlight the geographic and demographic differences that may exist between the three classes of wells, and characteristics of a given well are also useful indicators. Unfortunately, due to missing data issues, it was difficult to address the differences between regions of Tanzania. This was because four regions of Tanzania had no valid observations for variables like amount_tsh. If addressing water quality by region is desired, then more accurate data collection is needed and is something that could be done for further analysis.

Future project direction

The following is a list of possible things to explore in the future for this project:

Convert this from a multiclass to a binary classification problem to address class imbalance and lack of variance between FNR and non-functional classes.
Address the water well functionality by regions. This requires for the data to be recollected (at least for the four regions where data is missing). The four regions are: Dodoma, Kagera, Mbeya, and Tabor
Try a new model that includes only the top 6 predictors in the variable of importance plot.

References

[1] “Tanzania’s Water Crisis - Tanzania’s Water in 2021.” Water.org, water.org/our-impact/where-we-work/tanzania/.
[2] DrivenData. “Pump It Up: Data Mining the Water Table.” DrivenData, www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/.
[3] “2012_Census_General_Report.pdf” National Bureau of Statistics Ministry of Finance Dar Es Salaam , Mar. 2013.
[4] Landis, J. Richard, and Gary G. Koch. “The Measurement of Observer Agreement for Categorical Data.” Biometrics, vol. 33, no. 1, 1977, pp. 159–174., doi:10.2307/2529310.
[5] Saraswat, Manish. “Beginners Tutorial on Xgboost and Parameter Tuning in R Tutorials & NOTES: Machine Learning.” HackerEarth, www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/.
[6] Greenwell, Brandon,M., and Bradley,C. Boehmke. “Variable Importance Plots—an Introduction to the Vip Package.” The R Journal, vol. 12, no. 1, 2020, doi:10.32614/rj-2020-013.
[7] “Understanding Latitude and Longitude.” Understanding Latitude and Longitude, University of Wisconsin-Madison, journeynorth.org/tm/LongitudeIntro.html.