Project Proposal

Nick Kachanyuk 2021-06-02 7 min read

Overview (1 paragraph)

According to Water.org, Tanzania is experiencing a water and sanitation crisis. According to their records 4 million people lack access to safe water resources. People often spend a significant amount of time traveling long distances to collect water, and this burden often falls on women and girls. There is an opportunity to explore the water crisis in Tanzania. Using the data provided for the Pump it Up competition by DrivenData along with Tanzania’s 2012 census report, it is possible to not only identify functionality of the water pumps in Tanzania but also identify socioeconomic indicators of the different regions that are contributing to the water crisis in Tanzania. Using data aggregation and machine learning techniques, the goal of this project is to identify water well functionality (functional, needs repairs, non-functional) and the socioeconomic factors (population, household composition, economic activity, etc) that may be contributing to the water crisis in a given region of Tanzania.

Context (2 paragraphs)

Background

Earlier this year, DrivenData hosted an educational data science competition that requires a machine learning approach that helps identify Tanzania’s water pump functionality. The data provided for the competition contains variables that focus on the specific features of these water wells such as location of the well, construction year, record date, and quality of the water. Identifying the functionality of water wells in Tanzania is important in addressing the water crisis in Tanzania but there is a greater opportunity to explore additional socioeconomic factors that may show why a particular water shortage is occurring in each region of the country.

What is the problem?

Tanzania is experiencing a water crisis where about 4 million people lack access to safe water resources. While there is data on both Tanzania’s water well quality and the 2012 census, there is no attempt to combine these resources to assess the overall water situation in the country. Solving the water crisis requires both a problem-specific approach (identifying functionality of the wells) and a broader analysis of how different regions in the country are affected. Although there is no clear cause and effect pattern, identifying which regions are doing well in relation to others may help develop preventative strategies for the future and can also shed light on the challenges that the citizens of this country are facing when trying to access safe drinking water.

Why is it important to solve?

In 2016, Water.org found that Tanzania is eligible for what they call a “water credit solution” which allows for lending programs to households and water companies. My project attempts at providing practical guidelines in solving the water crisis by first addressing the functionality of the water wells for each region in the country. Combining socioeconomic data into the analysis may also help answer questions such as how the water crisis is persistent for a given region. Examining the socioeconomic differences between regions may help get at issues like, which regions are experiencing water access and quality issues the most.

Proposal (3 paragraphs)

What is the specific question are you trying to answer?

For this project I will use the DrivenData and Tanzania 2012 census report data to create a machine learning algorithm that detects whether a water well is functional, needs repairs, or non-functional. In addition, I am curious to see how the problem varies by region. Learning about what factors are important in the success or failure of a given region in addressing the water crisis is important and provides opportunity to design a specific and needs-based approach for each region thus improving the overall water quality in the country.

How will you answer it?

What is your method?

The first task is to combine DrivenData water well data and the socioeconomic data for each region from the 2012 census. An exploratory data analysis will be conducted to examine how each region differs in socioeconomic factors such as population, household composition, economic activity, and the quality of the wells. Finally, a data science process will be used to feature engineer, select important variables, and create a model that can predict the functionality of the water wells.

What is your outcome variable(s)?

The outcome variable is the classification of whether a water well is functional, needs repairs, or non-functional. This outcome variable is important for answering the water crisis in Tanzania because water wells serve as a major resource of water supply for the country.

What types of features will you be looking at?

Dependent variables that describe a particular water well include: amount of water available at a given well, the date when the row was entered, who funded the well, altitude of the well, installer (organization that installed the well), longitude and latitude coordinates of the well, geographic water basin, region of the well, population around the well, who operates the water well, whether or not the waterpoint is permitted, year of construction for the well, kind of water extraction used, how the waterpoint is managed, cost of water, quality of water, quantity of water, source of water, and the type of waterpoint.

Socioeconomic variables include: total population by region, number of households by region, survival of parent (whether a parent in the household is alive or not) by region, education statistics by region, employment by region, and sources of drinking water by region.

What do you expect/hope to find?

The first obstacle is to understand how the regions differ by water well quality and socioeconomic indicators. For example, some regions have a disproportionately low access to water compared to others. Addressing as to why regions differ in access to water is crucial in helping solve the overall water crisis of Tanzania. Finding out whether a particular region suffers from access to water resources due to geographical location versus economic situation in the region is crucial in effectively tackling the challenge by the local government and humanitarian agencies.

In addition to EDA, a machine learning approach is also desired. The application of such technology can help identify which waterpoints are functional or not. This not only addresses the water crisis directly but can also offer a framework for decision making in terms of which regions are at a higher vulnerability to the water crisis.

Conclusion (2-3 paragraphs)

Summarize the problem and your solution

The approach in developing a machine learning algorithm to detect which waterpoints are functional or not is helpful in solving the water crisis in Tanzania. While the additional socioeconomic data can provide a more insightful explanation as to why there is a disproportionate access to water between the different Tanzania regions. Uncovering the underlying factors allows for a strategic approach to the problem that allows for proper risk management and resource allocation that is fair for the different populations living in Tanzania.

If applicable, describe how your solution could be important beyond your specific context–i.e. can it be generalized.

Understanding how socioeconomic and problem-related features contribute to the differences to water access for a given region is not only applicable to the water crisis. Other infrastructure quality assessment projects can be undertaken in the same way. This project is an opportunity to show how proper selection of features can help tackle a problem via the analytical and data-driven approach. Understanding whether or not an issue exists is important but for a corrective action to be successfully implemented, understanding the magnitude of the problem is crucial in the proper and fair allocation of resources for a given region.

Discuss limitations and potential future directions to take the project

One limitation for this project is that the data is obtained from different resources. This requires proper aggregation of data from the DrivenData and 2012 Tanzania census report. Thankfully a solution is available which includes combining of the data by a common region variable. There is also some concern about data being collected at different points in time. The DrivenData includes data that was collected from late 2002 to late 2013 while the census data was collected starting August 26th, 2012.

With this project there is a potential for different organizations to benefit from this analytical approach. As mentioned before, in 2016 Water.org began providing assistance to the local banking institutions in providing loans for water and sanitation needs. A proper identification of cost and needs can help such organizations make plans that don’t overlook one region and approach each need on its appropriate level. This can ensure a transparent view on how funding is used and ensure that resources are optimized to meet every need.