Home

Data,

Intelligence

&

Magic

Current controversy of the 2013 FINA World Championships

Published on 11-Dec-2021

The article explains to the readers the potential of data science in solving complex real-world problems, irrespective of the field of subject, with the help of a case study. The case that we are investigating today is related to sports, specifically swimming, which happened in the year 2013. FINA is the international federation recognized by the International Olympic Committee for administering international competitions in water sports and it organizes the biennial World Championships including swimming. After the World Championships of 2013 held in Barcelona, rumours surfaced that the performance of swimmers were affected by lane bias caused due to swirling currents in the pool used for the competition. In other words, some swimmers got an unfair advantage depending on the lane they swam in. This was serious considering that the World Championship is one of the most prestigious events in the sport and needed to be investigated. What hindered the investigation of the issue was the fact that the pool was a temporary construction and was de-constructed after the competition was over. Because the actual pool became history without leaving any ways to physically simulate & investigate the presence of a swirling current, anybody who was interested to investigate the issue was left with very few options. It was then a group of Data Analysts from Indiana University attempted an investigation using statistics. The idea was simple – the pool doesn’t exist anymore but the data it generated during the competition still remains.

Before proceeding further, it needs to be highlighted that the article is based on the analysis report ‘Analysis of the 2013 FINA World Swimming Championships’ published in the ‘Medicine & Science in Sport & Exercise’ journal and the Datacamp course ‘Case Studies in Statistical Thinking’ which is again based on the aforementioned analysis report. The data used in this analysis is made freely available by OMEGA – the official time keeper of the 2013 World Championship. Before we can investigate the problem, it is important to have some background knowledge about how the swimming competitions are held. This is called domain knowledge which is crucial for any Data Science problem.

The competitions are held in a 50-meter pool which means for races longer than 50 meters, the swimmers need to turn around after touching the ends of the pool. The pool is divided into 10 lanes numbered from 0 to 9 & the swimmers swim in lanes 1 through 8. Competitions of short distances which includes 50m, 100m & 200m are held in multiple heats, two semi-finals & a final whereas the long-distance races namely 400m, 800m & 1500m have multiple heats & a final (there is no semi-final).

There are four different strokes of swimming competitions namely Free-style, Butterfly, Breast-stroke & Back-stroke. Due to the differences in the mechanics used for the strokes, the time taken to complete the same distance is different for each stroke. So, any event is defined by three parameters – gender, distance & stroke. That explains the structure of the pool and the format of the competition, which leaves us with the criteria of lane allotment.

The fastest swimmer in the semi-finals (or heats for long distance competitions) is allotted lane 4 in the final. The next fastest swimmer is allotted lane 5, the third fastest swimmer is allotted lane 3 and so on.

The problem of swirling current

In a fair pool, the swimmers in lanes 4 & 5 will be the fastest in the final and typically there won’t be much differences among swimmers in the outer lanes. But, if a swirling current exists like the one shown above, swimmers in lane 8 will be swimming with the current while those in lane 1 will be swimming against the current. The effects of a swirling current are felt the most on outer lanes because the currents will be faster towards the edges of the pool.

The data science journey

Let’s begin our data analysis journey & to start with, let us look at the medal tally of swimmers in 50m events against the lanes in which they swam during the recent few world championships. There are 3 medals to be won in each event and the championship has 4 strokes & 2 genders. Thus, there are a total of 24 medals to be won in the 50m events. It should be noted that swimmers in outer lanes 1-3 (low lanes) have the same probability of winning medals as swimmers in outer lanes 6-8 (high lanes) because the performance difference among them is small. Below is the chart created as part of the Exploratory Data Analysis (E.D.A) on the medals tally (between the years 2005 & 2017).

It is clear after the E.D.A that the year 2013 is an outlier considering the medals won by swimmers in lanes 1-3. Typically, if there are x number of medals to be won by swimmers in outer lanes (lanes 1-3 & 6-8), the number of medals won by swimmers in both low & high lanes are binomially distributed with a probability of 50%. In other words, because there isn’t much difference in performance between swimmers in low & high lanes, the medals won should be similar to a coin-toss experiment with the coin being a fair one. In 2013, 11 medals were won by swimmers in lanes 6-8 compared to just 1 by those in lanes 1-3. How probable is that, considering the outcomes are binomially distributed? Let’s find out using hacker statistics.

In hacker statistics, we simulate the winning of medals (in 50m events) using the medal tally of swimmers in outer lanes alone and observe the distribution of medals in the low lanes (1-3) and high lanes (6-8). The process is simulated 10,000 times using a method known as bootstrap sampling and the results of simulation show that the probability of swimmers in low lanes winning less than 2 medals in a championship is less than 0.4%. This is highly improbable in the case of a fair pool and we have started to see evidences of a lane bias. But we don’t know whether the results of 2013 championships were a case of random chance.

Methodology of investigation & a problem

The methodology we are using for our further investigation focuses on two analyses:

For the 2013 championships, is there an improvement in the time clocked by a swimmer if he/she switches from a low lane (1-3) to a high lane (6-8)?
If there is any improvement, what is the probability that the improvement is only due to random chance?

For our investigation, we will focus on the 50-meter final events because these are completed in a one length of the pool and the effects of a swirling current (if any) can be better observed. The data of the 50-meter final events is a good starting point as we will be studying the fastest swimmers. But there is a problem. In the final, the lanes are allotted to each swimmer based on their performance in the semi-finals and there is no way to simulate the switching of lanes of the swimmers.

Overcoming the problem by investigating a different case

Having said that, there can be cases in which the lanes of swimmers get switched from low to high when moving from semi-final to final and vice-versa. If we prove that swimmers exhibit the same level of performance in semi-final & final rounds, we can use the data of aforementioned cases to find evidences of fractional improvements when moving from a low lane (1-3) to a high lane (6-8). We are not considering the data of the heats because the elite swimmers may not utilize their full potential in the heats while trying to conserve energy for the final round. And we are looking at fractional improvement because the time clocked varies depending on the stroke – back stroke events take longer in comparison to free-style events for the same distance.

So, we first investigate the below analyses using data of a different championship – the 2015 World Championship held in Kazan, Russia when there were no reports of lane bias.

3. For the 2015 championships, is there an improvement in the time clocked by a swimmer when advancing from the semi-final to final?

4. If there is any improvement, what is the probability that the improvement is only due to random chance?

We will use the data of female swimmers in the semi-final & final for 50-, 100- & 200-meter events of all strokes. We have the arrays ‘semi_times’ & ‘final_times’ storing the times clocked by the swimmers such that semi_times[i] & final_times[i] correspond to the same swimmer, distance & stroke combination. Using these arrays, we calculate the fractional improvement when going from semi-final to final as follows:

Fractional improvement = [semifinal time] – [final time] / [semifinal time]

The Empirical Cumulative Distribution Function(ECDF) for the observed fractional improvements is displayed in the below chart.

The observed mean(average) fractional improvement is 0.00041, which is very low. The confidence interval of the mean fractional improvements simulated using bootstrap sampling is [-0.00094, 0.00173]. This means that there is a 95% chance that the fractional improvement will lie in the aforementioned range if the events were organized 10,000 times. Apparently, we don’t see significant fractional improvement which means the athletes compete at the same level of performance in semi-final & final. We will test a null hypothesis to prove this.

Null hypothesis: There is no difference between the semi final & final times for a swimmer-distance-stroke combination.

To simulate data assuming that the above null hypothesis is true, we scan the entries in the two arrays – semi_times & final_times and swap the records with a 50% probability. We then calculate the fractional improvements of each swimmer and the mean fractional improvement. The test statistic is the mean fractional improvement. The point we are trying to prove here is, if we make the times clocked by swimmers indistinguishable between semi-final & final events, calculate the fractional improvements and repeat the process over & over again, what is the probability(p-value) that the mean fractional improvement is at least the observed value, 0.00041. If the p-value is greater than 0.1 (10% probability), it is said to be statistically significant and we can conclude that the null hypothesis is true.

The process was simulated 10,000 times and it was found that the p-value for a fractional improvement of at least 0.00041 to be observed is 0.2725 (27%). The visualization of the simulated data is shown below.

Thus, it can be concluded that there is no difference in performance of swimmers between semi-final & final and the observed fractional improvement of 0.00041 is only due to random chance. We can now wrap up our investigation of analyses 3 & 4 which serves as the starting point for the investigation of analyses 1 & 2.

Back to our main investigation

For our investigation, we use the data of swimmers (only for 50-meter events) whose lanes got switched from high to low when going from semi-final to final and vice-versa. We have two arrays – high_times & low_times such that high_times[i] & low_times[i] correspond to the same swimmer-stroke combination. We have the data of 26 such swimmers. Using these arrays, we calculate the fractional improvement when going from low lane to high lane as follows:

Fractional improvement = [low lane time] – [high lane time] / [low lane time]

The Empirical Cumulative Distribution Function(ECDF) for the observed fractional improvements is displayed in the below chart.

The observed mean fractional improvement is 0.01051, which is comparatively high. It is quite evident that there is an unfair advantage for swimmers in high lanes (6-8) because all but 3 swimmers swam faster in a high lane compared to a low lane. Now let us test the null hypothesis.

Null hypothesis: In the 2013 championships, the mean fractional improvement of a swimmer when switching from a low lane to a high lane is zero.

We will use the mean fractional improvement as the test statistic.

To simulate data under the assumption that the null hypothesis is true, we shift the array of fractional improvements (observed values) such that the mean of the shifted array is zero. Then we draw 10,000 samples using bootstrap sampling, calculate the mean fractional improvement (test statistic) for each sample and calculate the p-value for which the simulated test statistic is at least the observed value of 0.01051. If the p-value is less than 0.1 (10% probability), then it can be concluded that the null hypothesis is false and there is clearly an improvement if a swimmer switches the lane from low to high.

The simulations show that the p-value for a fractional improvement of at least 0.01051 to be observed is 0.0003 (0.03%). The visualization of the simulated data is shown below.

If we test a similar null hypothesis for the 2015 championships (the mean fractional improvement of a swimmer when switching from a low lane to a high lane is zero), we get a high p-value of 0.29 (29%) which means the hypothesis is true and there was no lane bias in 2015. The visualization of the simulated data of 2015 is shown below.

To confirm our findings, a second null hypothesis was tested.

Null hypothesis: The swim time of a swimmer has no bearing on the lane allotted to him/her.

To simulate data for this hypothesis, we swapped the swim times of swimmers with a 50% probability and calculated the p-value for having a mean fractional improvement of at least 0.01051. The test statistic was the mean fractional improvement.

The p-value for the hypothesis was found to be 0 which means there is a 0% probability that we will observe a fractional improvement of 0.01051 if the swim time of swimmers had no bearing on the lane allotted to him/her. This proves that there was a lane bias in favour of lanes 6-8 during the 2013 FINA championships.