Uncategorized

The Relationship Between Water Scarcity and Precipitation: A Linear Regression Project

I recently conducted a regression project on the relationship between precipitation and water scarcity in countries around the world. Precipitation was measured in millimeters in an area of a given size per year, which eliminated the possibility of a country’s area skewing the results. Water scarcity was measured in deaths due to thirst. I used the programming language R to visualize all of the data and obtained this data from two different sources. Although the data used was from 2017, the findings of this project can still be applied to the world today. The primary result was that precipitation and water scarcity deaths are not correlated. Below is a shortened version of the project summary.


Examining the Existence of an Association Between Precipitation and Water Scarcity in Countries Around the World

Introduction:

One of the most pressing issues humanity currently faces is water scarcity, which affects billions worldwide and has significant implications for the future of society. According to the United Nations International Children’s Fund, “four billion people…experience extreme water scarcity for at least one month each year,” and “half of the world’s population could be living in areas facing water scarcity by as early as 2025.” However, water scarcity between countries varies considerably; in a country like the United States, it only accounts for approximately 0.01% of the total deaths, while in Chad, the issue accounts for roughly 10.62% of the total deaths.


In order to solve this global issue, it is important to first examine which factors can be associated with water scarcity. Intuitively, it follows that if freshwater sources such as rivers, lakes, and groundwater are the predominant sources of drinking water, and they are replenished by precipitation, then lack of precipitation and water scarcity could be linked. This yields the question: is the amount of precipitation a country receives associated with the available drinking water in that country?


This project aims to determine if there is an association between these two variables by comparing the precipitation in 2017 in 179 countries around the world against their respective water scarcity deaths in 2017. Since the goal is to model water scarcity deaths, this will be the response variable; thus, precipitation will be the explanatory variable.


Data & Analysis: 

The following graphs illustrate the data used in the form of two histograms for each variable, a scatterplot illustrating their association and a residual plot. 

Histograms:

The histogram for the explanatory variable, precipitation, appears to indicate a unimodal distribution, with its peak at approximately 700 mm/year of precipitation. The distribution is skewed to the right, and there are no visible gaps in the data points. All of the points conform to this pattern, and there are therefore no outliers. The main cluster of points definitely appears to be closer to the left side of the graph, below 1500 mm/year of precipitation. According to the 5-number-summary, the median of these values is 1032.0 mm/year. The mean is 1172 mm/year. This follows what would be expected: since the distribution is skewed right, the mean is greater than the median. The range of the distribution is a key measurement of the spread, and was calculated to be 3221.9 mm/year.  

The interquartile range can be calculated by doing the following: 

IQR = Q3 – Q1 = 1708.5 mm/year – 563.5mm/year = 1145 mm/year

The fact that Q3 is further from the maximum value than Q1 is from the minimum value reinforces the right-skewed distribution. Finally, the standard deviation of the dataset is 803.0296 mm/year, which means that on average, a given datapoint is 803 units from the mean. 

The histogram for the response variable, water scarcity death percentage, also indicates a unimodal distribution. However, the peak is at the very first interval, between 0% and 1% of deaths accounted for by water scarcity. Like the histogram of precipitation, this distribution is heavily skewed to the right, and there are no significant gaps in the data. Each point follows this pattern, so there are no outliers again. The main cluster of points is on the left side of the graph, at below 1%  of deaths. According to the 5-number-summary, the median of these values is 0.290%. The mean is 1.642%. This supports the right-skewed observation. The range of the distribution is 10.620%.

The interquartile range is 2.735%.

Again, Q3 is much further from the maximum than Q1 is from the minimum. Finally, the standard deviation of the dataset is 2.3677%, indicating that on average, a given data point is 2.37 units from the mean.

Scatterplot: 

The integral part of this study is the scatterplot, comparing the precipitation amount to the water scarcity deaths. Visually observing the graph, the pattern of the graph is unclear. A positive association should indicate that the above-average and below-average values for each variable occur together, whereas a negative association would yield the opposite result. Simply observing this graph does not clearly indicate the direction of the association. The scatterplot does not follow an easily identifiable form, though the points are clustered near water scarcity death percentages of zero. Visually, there also don’t appear to be any outliers. 

Visually, the linear model does not seem appropriate for the scatterplot. The points have no shape, and therefore no constant rate of change. 

Based on the summary of the model, the LSRL equation is:

Given this equation, a country that gets no precipitation is expected to have 1.6% of deaths resulting from water scarcity. The slope is 6.212 * 10-5. The fact that the slope is positive indicates an increase in precipitation corresponds with an increase in water scarcity deaths. The interpretation of the slope value is: for every increase in 100 meters of precipitation in a country, the percentage of total deaths in a country due to water scarcity increases by 6. This is a massive amount of precipitation, as the maximum amount of rainfall was about 3.24 meters per year. An increase of 100 meters is virtually impossible, so the slope is insignificant. To measure the strength and direction of this linear model, the correlation r is used. Based on the r-squared value given by the model, the correlation r is calculated to be 0.02107. The small, positive value indicates that the linear relationship is positive but very weak, which is why it wasn’t clear from the scatterplot itself. The dataset, therefore, is not linear. The r2 value is 0.0004439. This means that 0.04439% of the variation in the percentage of water scarcity deaths is accounted for by the linear association with the millimeters per year of precipitation, and reinforces the idea that this data is not easily modeled linearly.  The regression line equation can predict the percentage of total deaths due to water scarcity in a country. For example, in Belize, the precipitation in 2017 was 1705 mm/year. According to the model, this yields a death percentage of 1.67%. The actual death percentage due to water scarcity was 0.46%. The equation overestimated the deaths by 1.21%.

Residual Plot: 

The residual plot is indicative of the appropriateness of the linear model. The points on the residual plot are randomly scattered, indicating the LSRL is the best model for this data. Because of the magnitude of the residuals, as well as the low r and r-squared values, the conclusion can be drawn that the linear regression model for the data will not predict the observed values with a high degree of certainty. There is no fanning, so the line does not better suit certain data points than others. The invisible “line” that the points hit at the bottom of the residual plot is because the regression line is so close to zero, and the water scarcity death percentage cannot be negative.


Conclusions:

The analyses of the histograms demonstrate a few notable characteristics: there is a wide disparity in the precipitations countries around the globe get, though the countries with higher amounts of precipitation skew the distribution right; water scarcity does not account for a high percentage of deaths in most countries, though it is still a pressing issue. 

The analysis embodies the idea that the amount of precipitation a country receives has no bearing on the water/thirst issues in that country, which seems counterintuitive. Based on the look of the scatterplot and the correlation value close to zero, an increase in precipitation has no correlation with water scarcity deaths. Despite these unexpected results, the residual plot does indicate that the relationship between precipitation and water scarcity is best modeled linearly, if at all. Therefore, the linear model is the best model, but the best isn’t very strong.

The analysis between these two variables could be improved if there were ways to incorporate other factors’ effects on water scarcity deaths and compare them to the effects of precipitation and conduct a multifactor analysis. This would help obtain the root cause of the global water crisis, and allow humanity to find the best measures to embark on the path to solving it. Another better way to analyze the data would be to consider only countries where water scarcity is a significant issue. Running a linear regression on the subset of data where water scarcity causes a relatively high percentage of deaths could yield better results. The “cutoff” point would need to be determined. 

For future studies, actually testing other factors such as the country’s economic facets up against the water scarcity deaths could yield noteworthy results, because the infrastructure in a country to transfer water from the freshwater sources to their populations could also be a factor in water scarcity deaths. The country’s infrastructure is directly associated with its economy, so the average annual income for a citizen could be a quantitative way to measure this. Regardless, obtaining information on how other aspects of a region potentially have an association with the water crisis would be beneficial. 

Counterintuitively, this project yields the conclusion that precipitation is not associated with water scarcity issues in a given country.


Appendix

Average precipitation in depth (mm per year) – Country Ranking. IndexMundi. (n.d.). Retrieved October 17, 2022, from https://www.indexmundi.com/facts/indicators/AG.LND.PRCP.MM/rankings

Morelli, B. (2017, April 30). Water crowding, precipitation shifts, and a new paradigm in water governance. Yale Environment Review. Retrieved October 24, 2022, from https://environment-review.yale.edu/water-crowding-precipitation-shifts-and-new-paradigm-water-governance-0 

Ritchie, H., & Roser, M. (2021, July 1). Clean Water. Our World in Data. Retrieved October 17, 2022, from https://ourworldindata.org/water-access

Water Scarcity. UNICEF. (n.d.). Retrieved October 23, 2022, from https://www.unicef.org/wash/water-scarcity#:~:text=Four%20billion%20people%20%E2%80%94%20almost%20two,by%20as%20early%20as%202025. 

Histogram Summary Statistics:

Statistic/VariablePrecipitationWater Scarcity Deaths
Minimum18.1mmyear0.000%
1st Quartile563.5mmyear0.030%
Median1032.0mmyear0.290%
3rd Quartile1708.5mmyear2.7655
Maximum3240.0mmyear10.620%
Mean1172.0mmyear1.642%
Interquartile Range1145.0mmyear2.735%
Standard Deviation803.0296mmyear2.3677%

Data Table (example portion):

Data Analytics in the Workplace: My First Experience

This summer, I’ve been working at American Molecular Laboratories as a data analyst. In the winter, I was able to get some lab experience, working as a lab assistant before transitioning to helping out with COVID testing. This time, I was stationed just outside the laboratory, working at a desk and analyzing the data provided by the lab workers. The computer I had utilized Next Generation Sequencing (NGS) to find mutations in the forward and reverse DNA strands. I would then look at the mutations and determine which ones were indicative of H-pylori positive results. It was my first time applying data analysis to a modern problem, and it was an informative and enjoyable experience.

For background, Helicobacter pylori (H-pylori) is a type of bacteria that inhabits the digestive tract when it enters a host. This can lead to ulcers and damage to the stomach lining and the upper small intestine. If these get infected, it can lead to stomach cancer. There are medicines that can treat H-pylori and soothe ulcers, but clean water and sanitation is the best way to prevent it in the first place. Next-generation sequencing is a DNA sequencing technology that yields the nucleotide sequence of entire genomes very quickly. Also called massive parallel sequencing, NGS has high sensitivity to low frequency mutations while operating at a fast rate.

To test for this, the lab uses fresh stool or tissue samples to obtain the DNA. Once the DNA is extracted, it is sequenced, then analyzed—the job I did. Analyzing the data wasn’t overly difficult; I utilized the research of other workers at the company as well as previous studies to figure out which DNA mutations were indicative of H-pylori. I compared these to the ones that were listed in the table by the NGS software and listed the common mutations in an Excel sheet. It was great to get some experience in the data analysis industry, and also apply some of the knowledge I learned in Biology this past school year.

Most of the samples I tested were clinical. That is, people that went to the lab for testing because they were sick. However, some of the samples I tested were from the Boston Children’s Hospital for a study conducted by Harvard, so it was an incredible feeling to know that I was helping with something much greater than myself.

Image Link:
https://www.labmanager.com/product-focus/the-third-wave-of-next-generation-sequencing-22898

Data in the Stanley Cup Finals: a Testament to the NHL’s Trend for more Data

One of the big themes in this blog so far has been the idea of the hockey world increasingly incorporating data and more advanced statistics into the sport. It has already been thought to increase consumer participation, which is one of the goals of the NHL. The Colorado Avalanche vs Tampa Bay Lightning Stanley Cup Final series recently came to a close, and there were a few key instances in which we saw data analytics being put on display throughout. Perhaps the most groundbreaking, game five was played in Denver’s Ball Arena, with the Lightning coming away with a crucial win to keep their season alive. What was truly significant about this game from a data standpoint, however, occurred in Tampa Bay’s Amalie Arena on Friday night. The Lightning used puck and player tracking technology to host a watch party in their own arena, attracting thousands of fans. If it is continued to be used, consumer engagement in the sport will definitely skyrocket.

A video that was taken at Amalie Arena during Game 5 of the Stanley Cup Finals

Data regarding the positions of the players and ball on the field is not something the hockey world had offered prior to this point. Sports such as football, golf, and baseball were more of the pioneers of this movement. However, it was inspiring to see the advances made on the ice, as demonstrated by the video above. 

Another relatively new piece of data analytics that was introduced in the 2021 Stanley Cup Playoffs is the idea of shot and save analytics. In hockey, many are taught that shooting low blocker has the highest scoring percentage chance, as well as five-hole. Often, there would be data visualizations accompanying these claims, with the different targets in a net being labeled with a percent, which indicated the number of goals scored there. However, these were simplistic and did not have much context for the goal scored. For example, many of the five-hole goals are on rebounds or in-tight to the net, shots from a medium range might be best suited for the low blocker corner, while longer shots may go top-shelf. Or, these metrics might change based on the situation in the game. With new advances in data analytics and technology, the 2021 playoffs unveiled a product of the NHL’s partnership with Amazon Web Services (AWS): advanced shot/save data. According to Joe Lemire, “The shooting metrics will include data on high shot-to-goal conversion rates from different spots on the ice and show how they change during power play situations. The save analytics will also pare down raw data into granular scenarios, such as shot location and type of save (stick, glove, pads, etc.).” 

As mentioned in my first post, hockey is trending down a path that is becoming more and more data-driven, which is awesome to see. Given that it’s been behind for much of recent history, the steps the sport has been making are colossal. Hopefully, puck and player tracking become increasingly used next season. 

https://sporttechie.com/nhl-aws-to-debut-shot-and-save-analytics-for-stanley-cup-playoffs

Always Play to Win…True or False?

Former NFL player Herm Edwards, who currently coaches football at Arizona State University, once remarked, “you play to win the game.” In most cases, this is true, and it has even become considered a sign of good sportsmanship to try your hardest to win the game. However, there have been rare occasions in which teams don’t follow this idea for various reasons. Most often, this happens when teams declare they’re in a rebuild, and don’t necessarily play to lose, but they do not play to win in hopes of securing a better draft pick the following year as a result of a worse record. This recent NFL season held a different example of this.

*Note the underlined word, embodying the idea that teams don’t necessarily play to win during a rebuild. 

The first thing to touch on is the most common example of teams not following Edwards’ logic. By the halfway point of the 2020-2021 NFL season, everyone thought the New York Jets were going to get the first-overall pick—the highly touted Trevor Lawrence. It was essentially a race for last place between them and the Jaguars. At that point, the Jaguars had one win while the Jets had none. It was a simple task for the Jets to “tank” for the first overall pick. Therefore, people were astonished when the Jets beat the Rams in week 15. This dropped the Jaguars into last place due to the tiebreaker being won by the Jets. As NFL fans and analysts were confused, many Jets fans were angry while fans of the Jaguars were grateful. These sentiments exemplify the idea that playing to win isn’t always expected. In fact, playing to lose is what was actually thought to be best for the Jets team.

However, the 2021-22 season saw a case in which teams did not need to play to win where both teams weren’t in a rebuild….

Week 18 of the past NFL season could have been a historic loss for sportsbooks around the world. Because the Jaguars had surprisingly beaten the Colts earlier in the day and the Steelers had beaten the Ravens, the Raiders and Chargers both needed a win or a tie to advance to the playoffs. This is because the Colts were eliminated from the playoffs, leaving spots for two teams, and the Steelers finished at 9-7-1. A tie would have put both teams also at 9-7-1, but both teams held the tiebreaker over the Steelers. For this reason, many speculated the teams would purposely play to a tie, and used this fact to inform their sports betting. A tie would have resulted in the greatest payout from sportsbooks of all time. I’ve attached a payoff matrix (from game theory in economics) to illustrate the potential outcomes. 

The Raiders ended up winning the game with a field goal in overtime. However, playing to win may not have been their intention. With under a minute to go, the Raiders were actually thinking about kneeling to run out the clock—a display of altruism towards the Chargers. However, when Brandon Staley, the Chargers head coach, called a timeout, it illustrated his desire to win the game and get the ball back with time on the clock. Had he not called a timeout, it is very possible that the Raiders would have let the clock run out. 

Of course, other factors come into play. Namely, the opponents of each team if they tie or win. Regardless, despite almost exhibiting a tactic that goes against Edwards’ ideal, the game ended in a Raiders victory, and acting selfishly in the face of the Prisoner’s Dilemma was once again the chosen course of action.

Although this post wasn’t as data/statistics focused as some of my previous ones, I still thought it was interesting to talk about. 

Inspired by: https://lukebenz.com/post/eng_bel/

Image Links:

https://www.cbssports.com/nba/news/different-paths-to-rebuilding-a-view-of-the-sixers-celtics-magic/

gametheorypod.com/post/raiders-chargers-nfl-playoff-tie-dilemma

Analytics in Playoff Hockey

It’s playoff season. The NHL’s New York Rangers and Tampa Bay Lightning are vying for a spot in the Stanley Cup Finals to face off against the Colorado Avalanche. The Boston Celtics are taking on the Golden State Warriors for the NBA title. These NHL games have reflected the growth of the data analytics field in the hockey world. We’ve seen shot speed radar animations above the net with each shot on goal, faceoff win probabilities, and more. When highlights with these statistics are posted to NHL’s Instagram, comments about them are plenty—an illustration of the increased consumer engagement that can result from sports analytics (see blog one!).

I recently saw a graphic that detailed each team’s odds to win the Stanley Cup. It looked like this: 

It made me wonder how this type of thing could be calculated using data analysis. Although it is a relatively new field in the hockey world, these charts have been around for a few years. It would be interesting to see the algorithm that processes this, and what factors are taken into consideration. At the time, these odds actually made sense to me on paper: the Avalanche were the “best” team remaining, while the Lightning and Oilers were somewhat strong, and the Rangers seemed to bring up the rear in terms of likelihood to win. However, the first two games of the Lightning-Rangers series saw the Rangers go up 2-0 (granted, it’s now 2-2 but still closer than I was expecting, and the pie chart made it seem). This brings up the question: how can we improve the algorithms? What type of factors aren’t included that have an impact on team success?

I also wanted to touch on the shot speed radar. Although I couldn’t find a picture of it, it looks like a little circle that appears above the net on certain shots. While this isn’t the most complicated piece of data, it’s likely something that will accomplish the task of increasing viewership and consumer enjoyment of watching hockey. Those little tidbits of understandable data might go volumes for someone who isn’t necessarily well-versed in the sport. 

Overall, I’m really excited to see the direction data analysis in hockey is trending, as progress has already been made since the last time I covered a similar topic. 

Image Link: 

https://www.sportingnews.com/us/nhl/news/nhl-playoff-bracket-2022-full-schedule-tv-channels-scores/uunmnezlz3igrezu8djwymck

The Faults in “Big Data”

Cathy O’Neil’s novel, titled Weapons of Math Destruction, delves into the idea that mathematical models aren’t as perfect as we think. O’Neil refers to these “weapons” as WMDs, for short. She defines these as algorithms that try to quantitatively rank certain characteristics, such as teaching skill and creditworthiness but actually have harmful outcomes, contributing to the inequality that permeates society; these algorithms keep the rich rich and the poor poor. 

O’Neil opens with a story: a group of teachers in a poor community is judged based on an algorithm to see which ones are best, and thus which ones get fired. One of the criteria that goes into this mathematical model is the student test scores on a standardized test. Specifically, which students show a stable or increase in test scores year-to-year when the difficulty is adjusted accordingly. There are obviously discrepancies to this: for one, student experiences outside the classroom have significant impacts on their in-class performance. However, despite its many flaws, the model was accepted as a good way to choose between teachers, because humans have inherent natural biases. When it led to many popular educators being fired, though, sentiments changed. Sarah Wysocki was one of them, and she investigated the standardized tests taken by her students. At the beginning of the year, she was pleasantly surprised when the fifth graders in her class had scored very well on their tests, with 29% of them being in the “advanced reading level.” However, when her students were struggling to read simple sentences, she was suspicious. O’Neil writes that a high rate of erasures on the answer sheets was indicative of the teachers fudging the scores so as to not get fired. This resulted in Wysocki having to maintain stable year-to-year test scores for students that had were not at the level their tests showed. 

How does this relate to WMDs? O’Neil writes,

“After the shock of her firing, Sarah Wysocki was out of a job for a few days. She had plenty of people, including her principal, to vouch for her as a teacher, and she promptly landed a position at a school in an affluent district in northern Virginia. So thanks to a highly questionable model, a poor school lost a good teacher, and a rich school, which didn’t fire people on the basis of their students’ scores, gained one.” 

This story illustrates the “dark side” of data science: one that fuels inequality across communities worldwide. 

As mathematical models are increasingly used worldwide, whether it’s in banking or the justice system, Cathy O’Neil describes WMDs as meeting three criteria: they are opaque and their effects are both widespread and harmful. What this means is that the actual algorithms that go into WMDs are secret, they affect a lot of people, and can damage lives/contribute to economic inequality. Despite their many flaws, she asserts that it would be difficult to remove WMDs from American society because of how connected everything is. So, we shouldn’t aim to get rid of them. Instead, the biases can be removed so that rather than increasing inequality, the models do what they’re intended to do: help people. 

I recently read an article in computer science class about the ethics of data science that resonated with me and my experience reading this book. It was a story about an African American woman who sought medical care for sickness, and a model was used to predict the amount of health care she would need. However, the primary component of this model was money spent on health care in previous years, perhaps because it would indicate a “healthier” person. Looking at the data, though, this model was biased; African Americans on average spent less money on health care for the same necessity than others, which ties into the idea of racial inequality and economic inequality being linked. This woman was inevitably given less health care than she actually needed, which could have led to catastrophic results. 

Although data analysis is often helpful, we must be careful that our machines don’t inherit the same biases we hold. Perhaps math modeling is the way of the future, but there’s a lot to be ironed out before then.

Image Links:

https://blogs.iadb.org/ideas-matter/en/big-data-in-the-age-of-the-coronavirus/

https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

The Future of Hockey Analytics and Its Current Complications

The 2022 MIT Sports Analytics Conference held a panel titled, “Data, Snipe, Celly: The State of Hockey Analytics.” It was moderated by Greg Wyshynski and the speakers were Dominic Moore, Namita Nandakumar, Meghan Chayka, and Brant Berglund. This post will highlight some key ideas brought up in the panel and look at the future of hockey analytics.

Watching Tom Brady throw a touchdown on TV is a pretty common occurrence, and it’s pretty easy to see where the ball is traveling. Once the play is over, though, there’s no way to exactly where every player and the ball had been. That was the case until 2017, which is when the NFL decided to implant chips in footballs to expand data collection. In 2014, the NFL instated player tracking through quarter-sized chips as well. 

Hockey Data in the Past:

In the past, sports like golf, football, and baseball have paved the way for sports data analytics. Hockey, with its continuous game flow, has lagged behind. This is largely because in other sports, individual plays are easy to distinguish from each other. In football, there are whistles and play-clocks to indicate when a play begins and ends. In golf, every shot is a new “play.” For baseball, each pitch is a new play. 

Conversely, hockey is essentially one continuous play, pockmarked with faceoffs—the only tangible and trackable aspect of the sport for a long time. For this reason, some of the best data from hockey to this point is about faceoffs, as they occur when the play finally stops. Brant Berglund offered the analogy: “if football is like a bunch of short sentences, hockey is one long run-on.” He reinforced the idea of faceoffs being the focus of hockey analytics by introducing a faceoff win probability metric that was put into place recently. By taking the possibilities of which players are taking the faceoff, the algorithm could use prior data to determine a probability of winning for each team. 

There’s a common story within the hockey community that one of the reasons Wayne Gretzky grew to be so dominant is because as a kid, he’d watch NHL games and draw where the puck went. The dark shaded areas would be places for him to take note of during his games. Until recently, this was pretty much the extent of puck tracking technology. 

Looking Ahead:

Over this past offseason, the NHL implemented chips and cameras to revolutionize puck and player tracking technology. In fact, the All-Star game featured a digital blue ring around the puck when consumers watched it on TV—an example of puck tracking. Soon, perhaps the NHL will be able to create graphics like this: 

One other topic the panelists touched upon was the idea of consumer engagement with the sport. Typically, people who aren’t avid fans won’t sit down and watch an entire hockey game and enjoy it. Meghan Chayka brought up the idea of little tidbits of data that would be interesting for a consumer. For example, with the development of faceoff probabilities, the potential for sports betting on it. Or the percentage of 6v5 conversions if a team needs a goal late in the game. These are little pieces of information that might make hockey more enjoyable to watch for the consumer. Berglund expanded on this, asserting that an increase in hockey betting based on advances in data would be instrumental in increasing consumer support for the sport. 

While there have been complications with data analytics in hockey in the past, the sport is quickly gaining ground in this field, with the data available to teams at an all-time high. Making valuable insights like never before, the NHL is trending in the right direction for more consumer enjoyment. 

Images: 

https://wall.alphacoders.com/by_sub_category.php?id=146229&name=Hockey+Wallpapers&filter=4K+Ultra+HD