Malaria

Introduction

the objective of this blog is to demostrate three informative visualization about Malaria from the dataset

I mainly focus on two datasets:

  • Malaria_death_age.csv which describe number of Malaria deaths incidence from 1990 until 2016 over 200 countries across different age groups
  • Malaria_death.csv which describe Malaria deaths by over 200 countries for all ages across the world and over time

Visualization 1

First, I would like to understand which age group has the highest number of malaria death from 1990 to 2016.

fig, ax = plt.subplots(figsize=(15,10))
sns.lineplot(x='year',y='deaths',hue='age_group',data=malaria_death_age,ax=ax)
ax.set_title('malaria death by age group')
fig.savefig('age.png')

image alter text

From the above graph, I can conclude that babies (ages under 5) have the highest number of the deaths over 26 years.

Visualization 2

Second, I would like to understand which continent has the highest death Rate from 1990 to 2016. Since there are no continent columns. I find a csv file that contains all the country around the world and its corresponding continent. I then merge the table with malaria death table as following

image alter text

I then plot the death rate by each region from 1990 until 2016.

fig,ax = plt.subplots(figsize=(20,15))
sns.lineplot(x='Year',y='Rate',hue='region',data=malaria_death,ax=ax)
ax.set_title('malaria death rate each year by region')
fig.savefig('region.png')

image alter text

From the above graph, I can conclude that Africa has the highest death rate from 1990 until 2016.

Visualization 3

Last, I would like to predict the trend of the country that has the highest death rate from 2017 until 2021.

I first need to find out the country that has the highest death rate over the past 20 years.

idx = malaria_death[['Entity','Year','Rate']].groupby(['Year'])['Rate'].transform(max) == malaria_death['Rate']
malaria_death[idx].sort_values('Year')['Entity']

after running the above code, Burkina Faso has the highest death rate since 2010. Thus, I would like to predict its death rate from 2017 until 2021. I plot the death rate of Burkina Faso from 1990 until 2016.

df = malaria_death[malaria_death['Entity'] == 'Burkina Faso'][['Year','Rate']]
df.Year=pd.to_datetime(df.Year,format='%Y')
df.Rate = df.Rate.astype('float')
fig,ax = plt.subplots(figsize=(10,8))
df.plot.line(x='Year',y='Rate',ax=ax)
ax.set_xlabel('year')
ax.set_ylabel('Rate')
ax.set_title('Burkina Faso death rate over 20 year')

image alter text

The death rate over the years does not appear to be stationary. I would like to test the stationarity of the death rate

def test_stationarity(timeseries):
    '''plot the rolling mean vs rolling standard deviation vs original timseries
       test stationarity of a timeseries with dicky fuller test

    Parameters
    ----------
    timesereis: pandas.Series
    
    returns
    -------
    None
    
    '''
    #determine rolling statistics, make sure the time series is not white noise 
    rolmean = timeseries.rolling(3).mean()
    rolstd = timeseries.rolling(3).std()
    #plot rolling statistics
    fig , ax = plt.subplots(figsize = (10,8))
    original = ax.plot(timeseries,color='blue',label='original')
    mean = ax.plot(rolmean,color='red',label='rolling_mean')
    std = ax.plot(rolstd,color='black',label='rolling_std')
    ax.legend(loc='best')
    ax.set_title('original vs rolling_mean vs rolling_std')
    plt.savefig('original_vs_rolling_mean_vs_rolling_std')
    
    #perform dicky-fuller test:
    print("perform dicky fuller test: ")
    dftest = adfuller(timeseries,autolag='AIC')
    
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

The dicky fuller test statistics shows that the test statistic is 0.055 > 0.05. We can’t reject the null hypothesis. the data does not have a unit root and is not stationary.

image alter text

from the above graph, the standard deviation is relatively constant and there appears to have serrial correlation at different timestamps and the rolling mean is not 0. Thus, the timeseries is not a white noise.

Determine the p,d,q term for ARIMA model

d term

i decide to use the ndiffs from the utils to determine the differencing term parameter

from pmdarima.arima.utils import ndiffs
y = df.values
ndiffs(y, test='adf')

the result shows that d = 2.

p term and q term

I use the ACF and PACF plot to determine the auto regression lags and moving average lags.

image alter text

from the graph, a good starting point for p and q term is 2.

build ARIMA model

model = ARIMA(df['Rate'],order=(2,2,2))
ARIMA = model.fit(disp=0)
fig, axes = plt.subplots(2,1,figsize=(15,10))
axes[0].plot(df,label='original')
axes[0].plot(ARIMA.fittedvalues, color='red',label='fitted_value')
axes[0].legend()
residuals = pd.DataFrame(ARIMA.resid)
residuals.plot(kind='kde',ax=axes[1])

I also plot the original data vs the fitted value and I have also plot the kernal density plot of the residual.

image alter text

The above plot shows that the ARIMA model does not have a great fit but the residual appears to be normal and centered at 0 which indicates that there are no more information that can be learned with ARIMA model.

predict the death rate with ARIMA model

import datetime
predict_series = {}
prediction,_,_ = ARIMA.forecast(steps=5)
for i in range(2017,2021):
    predict_series[datetime.datetime.strptime(str(i),'%Y')] = prediction[i-2017]
#prediction from 2017 to 2021
predict_df = pd.DataFrame(predict_series.items(),columns=['Year','Rate'])
predict_df.index = predict_df.Year
predict_df.drop('Year',axis=1,inplace=True)
#plot the prediction vs original
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(df,label='original')
ax.plot(predict_df,label='prediction')
ax.set_xlabel('year')
ax.set_ylabel('deaths')
ax.legend()
ax.set_title('Burkina Faso death rate prediction vs original')

image alter text

The above plot shows that the number of Malaria death in Burkina Faso will continue to decrease since 2016.

You can find my code here. You can also find more details about time series here