Additional Datasets considered

CDC ILINet

As analysis continued, we found that the FluNet was at times difficult to model due to the gradual increase in reported positive cases of influenza over time. In particular, after the 2009 influenza pandemic, far more influenza samples were detected as governments around the world took the threat of another influenza outbreak increasingly seriously. To alleviate this issue, alternative sources of data were investigated.

We found the Centers for Disease Control and Prevention (CDC), a government agency of the United States of America, monitors the spread of influenza-like illnesses (ILI) 1 via ILINet 2, and makes the data freely available online. Unlike the FluNet data, did not have a constantly increasing background. However, this data was only available for the US and so was deemed unsuitable for our usecase.

Physicians

A factor that we believed may affect the spread of influenza was the number of physicians per unit population, which was provided in the health_indicators dataset. While the data was clean when provided, there were numerous periods with missing data, resulting in the need for interpolation.

With the assumption that the measurements were made in the January of every year, monthly data was created using quadratic spline interpolation. This interpolation method was chosen for a number of reasons; we wanted to ensure that the interpolation equalled the actual measurements at the points the measurements were made, that the interpolation did not overfit the data, and that the interpolation was smooth, as the data would often go up and down. Quadratic spline interpolation satisfies all of this due to it being a smooth interpolation method with very few parameters.

There were also cases when there were only two measurements in total, in which case quadratic spline interpolation was not sensible so linear interpolation was used instead. Finally, in the case there was only one datapoint for a country, that value was set for the January of that year, and all other values were recorded as not available.

While this data was promising, we found that although it was a good variable to consider globally, for the region we investigated, Europe, the number of physicians was not a good explanatory variable as each nation had a slightly different healthcare system making each physician more or less effective.

Healthcare Expenditure

While it is well-known that life expectancy correlates with total healthcare expenditure 3, we also wanted to investigate its effects on controlling influenza. This was found by combining the domestic government healthcare expenditure per capita and domestic private healthcare expenditure per capita adjusted for purchasing power parity in current international dollars. This data was not interpolated as these budgets are usually set on an annual basis, and so the value for any time in each year was taken to be the value measured for that year.

We discarded this dataset for the same reason we discarded the dataset on physicians.

Smoking prevalence

Although not commonly recognised as a risk factor for influenza, there have been small-scale studies that have indicated that it increases both the risk for contracting influenza and severity of such infections 4. The data was extracted from the health_indicators dataset. As smoking rates were linearly decreasing with time around the world, linear interpolation was found to be a good fit and was used between the given measurements to find the smoking rate at any given time.

This dataset was discarded as the data was relatively consistent within Europe and there were a large number of missing values.

Number of hours worked

So-called presenteeism, when ill workers come into work due to societal pressure and spread disease, can contribute to the spread of disease, with a study estimating that presenteeism costing the U.S. economy a staggering $150 billion a year 5. We wanted to factor in presenteeism culture in different countries into our models; presumably, the higher the degree of presenteeism, the faster the spread of influenza. However, without expensive primary research, it is near impossible to estimate the degree of presenteeism and even then, it is not possible to extrapolate this data to the past.

Instead, we looked at the number of hours worked as a proxy for this. If there is a high degree of presenteeism, this should manifest in the number of hours that people work. This data was found for OECD countries in the form of number of hours worked per year 6. The value was processed so that the number of hours worked was constant through the calendar year as the measurements given were in the form of hours worked per year; it didn’t make sense to divide the data any further as in reality there is seasonality to the number of hours worked per month.

We attempted to model influenza spread with a classical time-series model with this dataset as an explanatory variable as this appeared to be a promising approach after feature selection with ElasticNet. However, after much effort, we concluded that this model would not achieve our high standards and decided to cancel the project. More details on this are also included in the appendix.