The Relationship Between Water Scarcity and Precipitation: A Linear Regression Project

I recently conducted a regression project on the relationship between precipitation and water scarcity in countries around the world. Precipitation was measured in millimeters in an area of a given size per year, which eliminated the possibility of a country’s area skewing the results. Water scarcity was measured in deaths due to thirst. I used the programming language R to visualize all of the data and obtained this data from two different sources. Although the data used was from 2017, the findings of this project can still be applied to the world today. The primary result was that precipitation and water scarcity deaths are not correlated. Below is a shortened version of the project summary.


Examining the Existence of an Association Between Precipitation and Water Scarcity in Countries Around the World

Introduction:

One of the most pressing issues humanity currently faces is water scarcity, which affects billions worldwide and has significant implications for the future of society. According to the United Nations International Children’s Fund, “four billion people…experience extreme water scarcity for at least one month each year,” and “half of the world’s population could be living in areas facing water scarcity by as early as 2025.” However, water scarcity between countries varies considerably; in a country like the United States, it only accounts for approximately 0.01% of the total deaths, while in Chad, the issue accounts for roughly 10.62% of the total deaths.


In order to solve this global issue, it is important to first examine which factors can be associated with water scarcity. Intuitively, it follows that if freshwater sources such as rivers, lakes, and groundwater are the predominant sources of drinking water, and they are replenished by precipitation, then lack of precipitation and water scarcity could be linked. This yields the question: is the amount of precipitation a country receives associated with the available drinking water in that country?


This project aims to determine if there is an association between these two variables by comparing the precipitation in 2017 in 179 countries around the world against their respective water scarcity deaths in 2017. Since the goal is to model water scarcity deaths, this will be the response variable; thus, precipitation will be the explanatory variable.


Data & Analysis: 

The following graphs illustrate the data used in the form of two histograms for each variable, a scatterplot illustrating their association and a residual plot. 

Histograms:

The histogram for the explanatory variable, precipitation, appears to indicate a unimodal distribution, with its peak at approximately 700 mm/year of precipitation. The distribution is skewed to the right, and there are no visible gaps in the data points. All of the points conform to this pattern, and there are therefore no outliers. The main cluster of points definitely appears to be closer to the left side of the graph, below 1500 mm/year of precipitation. According to the 5-number-summary, the median of these values is 1032.0 mm/year. The mean is 1172 mm/year. This follows what would be expected: since the distribution is skewed right, the mean is greater than the median. The range of the distribution is a key measurement of the spread, and was calculated to be 3221.9 mm/year.  

The interquartile range can be calculated by doing the following: 

IQR = Q3 – Q1 = 1708.5 mm/year – 563.5mm/year = 1145 mm/year

The fact that Q3 is further from the maximum value than Q1 is from the minimum value reinforces the right-skewed distribution. Finally, the standard deviation of the dataset is 803.0296 mm/year, which means that on average, a given datapoint is 803 units from the mean. 

The histogram for the response variable, water scarcity death percentage, also indicates a unimodal distribution. However, the peak is at the very first interval, between 0% and 1% of deaths accounted for by water scarcity. Like the histogram of precipitation, this distribution is heavily skewed to the right, and there are no significant gaps in the data. Each point follows this pattern, so there are no outliers again. The main cluster of points is on the left side of the graph, at below 1%  of deaths. According to the 5-number-summary, the median of these values is 0.290%. The mean is 1.642%. This supports the right-skewed observation. The range of the distribution is 10.620%.

The interquartile range is 2.735%.

Again, Q3 is much further from the maximum than Q1 is from the minimum. Finally, the standard deviation of the dataset is 2.3677%, indicating that on average, a given data point is 2.37 units from the mean.

Scatterplot: 

The integral part of this study is the scatterplot, comparing the precipitation amount to the water scarcity deaths. Visually observing the graph, the pattern of the graph is unclear. A positive association should indicate that the above-average and below-average values for each variable occur together, whereas a negative association would yield the opposite result. Simply observing this graph does not clearly indicate the direction of the association. The scatterplot does not follow an easily identifiable form, though the points are clustered near water scarcity death percentages of zero. Visually, there also don’t appear to be any outliers. 

Visually, the linear model does not seem appropriate for the scatterplot. The points have no shape, and therefore no constant rate of change. 

Based on the summary of the model, the LSRL equation is:

Given this equation, a country that gets no precipitation is expected to have 1.6% of deaths resulting from water scarcity. The slope is 6.212 * 10-5. The fact that the slope is positive indicates an increase in precipitation corresponds with an increase in water scarcity deaths. The interpretation of the slope value is: for every increase in 100 meters of precipitation in a country, the percentage of total deaths in a country due to water scarcity increases by 6. This is a massive amount of precipitation, as the maximum amount of rainfall was about 3.24 meters per year. An increase of 100 meters is virtually impossible, so the slope is insignificant. To measure the strength and direction of this linear model, the correlation r is used. Based on the r-squared value given by the model, the correlation r is calculated to be 0.02107. The small, positive value indicates that the linear relationship is positive but very weak, which is why it wasn’t clear from the scatterplot itself. The dataset, therefore, is not linear. The r2 value is 0.0004439. This means that 0.04439% of the variation in the percentage of water scarcity deaths is accounted for by the linear association with the millimeters per year of precipitation, and reinforces the idea that this data is not easily modeled linearly.  The regression line equation can predict the percentage of total deaths due to water scarcity in a country. For example, in Belize, the precipitation in 2017 was 1705 mm/year. According to the model, this yields a death percentage of 1.67%. The actual death percentage due to water scarcity was 0.46%. The equation overestimated the deaths by 1.21%.

Residual Plot: 

The residual plot is indicative of the appropriateness of the linear model. The points on the residual plot are randomly scattered, indicating the LSRL is the best model for this data. Because of the magnitude of the residuals, as well as the low r and r-squared values, the conclusion can be drawn that the linear regression model for the data will not predict the observed values with a high degree of certainty. There is no fanning, so the line does not better suit certain data points than others. The invisible “line” that the points hit at the bottom of the residual plot is because the regression line is so close to zero, and the water scarcity death percentage cannot be negative.


Conclusions:

The analyses of the histograms demonstrate a few notable characteristics: there is a wide disparity in the precipitations countries around the globe get, though the countries with higher amounts of precipitation skew the distribution right; water scarcity does not account for a high percentage of deaths in most countries, though it is still a pressing issue. 

The analysis embodies the idea that the amount of precipitation a country receives has no bearing on the water/thirst issues in that country, which seems counterintuitive. Based on the look of the scatterplot and the correlation value close to zero, an increase in precipitation has no correlation with water scarcity deaths. Despite these unexpected results, the residual plot does indicate that the relationship between precipitation and water scarcity is best modeled linearly, if at all. Therefore, the linear model is the best model, but the best isn’t very strong.

The analysis between these two variables could be improved if there were ways to incorporate other factors’ effects on water scarcity deaths and compare them to the effects of precipitation and conduct a multifactor analysis. This would help obtain the root cause of the global water crisis, and allow humanity to find the best measures to embark on the path to solving it. Another better way to analyze the data would be to consider only countries where water scarcity is a significant issue. Running a linear regression on the subset of data where water scarcity causes a relatively high percentage of deaths could yield better results. The “cutoff” point would need to be determined. 

For future studies, actually testing other factors such as the country’s economic facets up against the water scarcity deaths could yield noteworthy results, because the infrastructure in a country to transfer water from the freshwater sources to their populations could also be a factor in water scarcity deaths. The country’s infrastructure is directly associated with its economy, so the average annual income for a citizen could be a quantitative way to measure this. Regardless, obtaining information on how other aspects of a region potentially have an association with the water crisis would be beneficial. 

Counterintuitively, this project yields the conclusion that precipitation is not associated with water scarcity issues in a given country.


Appendix

Average precipitation in depth (mm per year) – Country Ranking. IndexMundi. (n.d.). Retrieved October 17, 2022, from https://www.indexmundi.com/facts/indicators/AG.LND.PRCP.MM/rankings

Morelli, B. (2017, April 30). Water crowding, precipitation shifts, and a new paradigm in water governance. Yale Environment Review. Retrieved October 24, 2022, from https://environment-review.yale.edu/water-crowding-precipitation-shifts-and-new-paradigm-water-governance-0 

Ritchie, H., & Roser, M. (2021, July 1). Clean Water. Our World in Data. Retrieved October 17, 2022, from https://ourworldindata.org/water-access

Water Scarcity. UNICEF. (n.d.). Retrieved October 23, 2022, from https://www.unicef.org/wash/water-scarcity#:~:text=Four%20billion%20people%20%E2%80%94%20almost%20two,by%20as%20early%20as%202025. 

Histogram Summary Statistics:

Statistic/VariablePrecipitationWater Scarcity Deaths
Minimum18.1mmyear0.000%
1st Quartile563.5mmyear0.030%
Median1032.0mmyear0.290%
3rd Quartile1708.5mmyear2.7655
Maximum3240.0mmyear10.620%
Mean1172.0mmyear1.642%
Interquartile Range1145.0mmyear2.735%
Standard Deviation803.0296mmyear2.3677%

Data Table (example portion):