COVID-19 Data Analysis Using Python

Neha suryagouni
7 min readAug 27, 2020

In this project, you will learn how to preprocess and merge datasets to calculate needed measures and prepare them for an Analysis. In this project, we are going to work with the COVID19 dataset, published by John Hopkins University, which consists of the data related to the cumulative number of confirmed cases, per day, in each Country. Also, we have another dataset consist of various life factors, scored by the people living in each country around the globe. We are going to merge these two datasets to see if there is any relationship between the spread of the virus in a country and how happy people are, living in that country. Notes: This project works best for learners who are based in the North America region. We’re currently working on providing the same experience in other regions.

SKILLS YOU WILL DEVELOP

Python Programming

Data Analysis

Pandas

Seaborn

Statistics

Learn step-by-step

  1. Importing the COVID19 dataset and preparing it for the analysis by dropping columns and aggregating rows.
  2. Deciding on and calculating a good measure for our analysis.
  3. Merging two datasets and finding correlations among our data.
  4. Visualizing our analysis results using Seaborn.

Now we will build (project goals). We will accomplish it in by completing each task in the project:

  • Introduction
  • Importing Covid19 dataset
  • Finding a good Measure
  • Importing and preparing World happiness report dataset
  • Merge two datasets and find correlations among your data
  • Visualize your results using Seaborn

While you are watching me work on each step, you will get a cloud desktop with all the required software pre-installed. This will allow you to follow along the instructions to complete the above-mentioned tasks. After all, we learn best with active, hands-on learning

Data about COVID-19 cases is available online, and it’s updated frequently.

To parse the data, you first must download it:

https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv

Load the data directly into a Pandas DataFrame. Pandas provide a function, read_csv(), which can take a URL and give back a DataFrame object, as shown below:

import pycountry
import plotly.express as px
import pandas as pd
URL_DATASET = r'https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv'
df1 = pd.read_csv(URL_DATASET)
print(df1.head(3)) # Get first 3 entries in the dataframe
print(df1.tail(3)) # Get last 3 entries in the dataframe

The top row of the data set contains column names:

  1. Date
  2. Country
  3. Confirmed
  4. Recovered
  5. Deaths

The output of the head query includes a unique identifier (not listed as a column) plus an entry for each column:

0 2020-01-22 Afghanistan 0 0 0
1 2020-01-22 Albania 0 0 0
1 2020-01-22 Algeria 0 0 0

The output of the tail query is similar but contains the tail end of the data set:

12597 2020-03-31 West Bank and Gaza 119 18 1
12598 2020-03-31 Zambia 35 0 0
12599 2020-03-31 Zimbabwe 8 0 1

More Python Resources

From the output, you can see that the DataFrame (df1) has the following columns:

  1. Date
  2. Country
  3. Confirmed
  4. Recovered
  5. Dead

Further, you can see that the Date column has entries starting from January 22 to March 31. This database is updated daily, so you will have current values.

2. Select data for India

In this step, we will select only those rows in the DataFrame that include India. This is shown in the script below:

#### ----- Step 2 (Select data for India)----
df_india = df1[df1['Country'] == 'India']
print(df_india.head(3))

3. Plot data

Here we create a bar chart. We will put the dates on the X-axis and the number of confirmed cases and the number of deaths on the Y-axis. There are a few noteworthy things about this part of the script which are as follows:

  • The line of code: plt.rcParams[“figure.figsize”]=20,20 is meant only for Jupyter. So remove it if you are using some other IDE.
  • Notice the line of code: ax1 = plt.gca(). To ensure that both the plots i.e. for confirmed cases as well as for deaths are plotted on the same graph, we need to give to the second graph the ax object of the plot. So we use gca() to do this. (By the way, ‘gca’ stands for ‘get current axis’).

The complete script is given below:

#  Author:- Anurag Gupta # email:- 999.anuraggupta@gmail.com
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
#### ----- Step 1 (Download data)----
URL_DATASET = r'https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv'
df1 = pd.read_csv(URL_DATASET)
# print(df1.head(3)) # Uncomment to see the dataframe
#### ----- Step 2 (Select data for India)----
df_india = df1[df1['Country'] == 'India']
print(df_india.head(3))
#### ----- Step 3 (Plot data)----
# Increase size of plot
plt.rcParams["figure.figsize"]=20,20 # Remove if not on Jupyter
# Plot column 'Confirmed'
df_india.plot(kind = 'bar', x = 'Date', y = 'Confirmed', color = 'blue')
ax1 = plt.gca()
df_india.plot(kind = 'bar', x = 'Date', y = 'Deaths', color = 'red', ax = ax1)
plt.show()

The entire script is available on GitHub.

Creating an animated horizontal bar graph for five countries

Note for Jupyter: To run this in Jupyter as a dynamic animation rather than as a static png, you need to add a magic command at the beginning of your cell, namely: %matplotlib notebook. This will keep the figure alive instead of displaying a static png file and can hence also show animations. If you are on another IDE, remove this line.

1. Download the data

This step is exactly the same as in the previous script, and therefore, it need not be repeated.

2. Create a list of all dates

If you examine the data you downloaded, you notice that it has a column Date. Now, this column has a date value for each country. So the same date is occurring a number of times. We need to create a list of dates with only unique values. This will be used on the X-axis of our bar charts. We have a line of code like: list_dates = df[‘Date’].unique(). The unique() method will pick up only the unique values for each date.

3. Pick five countries and create an ax object

Take a list of five countries. (You can choose whatever countries you prefer, or even increase or decrease the number of countries). I have also taken a list of five colors for the bars of each country. (You can change this too if you like). One important line of code here is: fig, ax = plt.subplots(figsize=(15, 8)). This is needed to create an ax object.

4. Write the call back function

If you want to do animation in Matplotlib, you need to create an object of a class called matplotlib.animation.FuncAnimation. The signature of this class is available online. The constructor of this class, apart from other parameters, also takes a parameter called func, and you have to give this parameter a callback function. So in this step, we will write the callback function, which is repeatedly called in order to render the animation.

5. Create FuncAnimation object

This step has partly been explained in the previous step.

Our code to create an object of this class is:

my_anim = animation.FuncAnimation(fig = fig, func = plot_bar,
frames= list_dates, blit=True,
interval=20)

The three important parameters to be given are:

  • fig, which must be given a fig object, which we created earlier.
  • func, which must be the call back function.
  • frames, which must contain the variable on which the animation is to be done. Here in our case, it will be the list of dates we created earlier.

6. Save the animation to an mp4 file

You can save the animation created into an mp4 file. But for this you need ffmpeg. You can download this using pip by pip install ffmpeg-python, or using conda (on Jupyter) install -c conda-forge ffmpeg.

And finally, you can run the animation using plt.show(). Please note that on many platforms, the ffmpeg may not work properly and may require further “tweaking.”

%matplotlib notebook
# Author:- Anurag Gupta # email:- 999.anuraggupta@gmail.com
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from time import sleep
#### ---- Step 1:- Download data
URL_DATASET = r'https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv'
df = pd.read_csv(URL_DATASET, usecols = ['Date', 'Country', 'Confirmed'])
# print(df.head(3)) # uncomment this to see output
#### ---- Step 2:- Create list of all dates
list_dates = df['Date'].unique()
# print(list_dates) # Uncomment to see the dates
#### --- Step 3:- Pick 5 countries. Also create ax object
fig, ax = plt.subplots(figsize=(15, 8))
# We will animate for these 5 countries only
list_countries = ['India', 'China', 'US', 'Italy', 'Spain']
# colors for the 5 horizontal bars
list_colors = ['black', 'red', 'green', 'blue', 'yellow']
### --- Step 4:- Write the call back function
# plot_bar() is the call back function used in FuncAnimation class object
def plot_bar(some_date):
df2 = df[df['Date'].eq(some_date)]
ax.clear()
# Only take Confirmed column in descending order
df3 = df2.sort_values(by = 'Confirmed', ascending = False)
# Select the top 5 Confirmed countries
df4 = df3[df3['Country'].isin(list_countries)]
# print(df4) # Uncomment to see that dat is only for 5 countries
sleep(0.2) # To slow down the animation
# ax.barh() makes a horizontal bar plot.
return ax.barh(df4['Country'], df4['Confirmed'], color= list_colors)
###----Step 5:- Create FuncAnimation object---------
my_anim = animation.FuncAnimation(fig = fig, func = plot_bar,
frames= list_dates, blit=True,
interval=20)
### --- Step 6:- Save the animation to an mp4
# Place where to save the mp4. Give your file path instead
path_mp4 = r'C:\Python-articles\population_covid2.mp4'
# my_anim.save(path_mp4, fps=30, extra_args=['-vcodec', 'libx264'])
my_anim.save(filename = path_mp4, writer = 'ffmpeg',
fps=30,
extra_args= ['-vcodec', 'libx264', '-pix_fmt', 'yuv420p'])
plt.show()

--

--