Beautiful Data Visualization in Python

Aboli Marathe
6 min readAug 8, 2020

--

Giving EDA a glam makeover.

This week our new admits to PISB started their journey in machine learning with their first task: Exploratory Data Analysis, and we were amazed by how much they learnt in just 3 days. The students were mentored by Aboli Marathe and Ayush Das. Out of all the contributors, Kirti Palve and Neil Deshpande, stood out with their excellent submission that wowed our judges. Here we would like to present their work, in the form of an short tutorial, covering the basics of EDA. For the challenge we have used the Student Performance dataset available on Kaggle.

We will start by importing the necessary libraries, and in this tutorial will be using some the most popular libraries of Python: Numpy, Pandas, Matplotlib, Seaborn and Scipy.

Now we are loading our dataset into a Pandas dataframe. In this work, we could simply load it using pandas read_csv(), but if your csv or excel sheet is throwing a UnicodeDecode error while reading, you can set encoding = “ISO-8859–1” or “UTF-8”, as an additional argument.

Let’s look at the basic features like the features, shape and statistical measures of central tendency. To start, we will observe the data by sampling the first 5 rows.

Then we will check the shape, i.e. number of rows and columns in the dataset.

Here we can check the statistical measures like mean, count, maximum central tendency, dispersion and shape of a dataset’s distribution, excluding nan values.

Now we will look at the distribution graphically, by plotting the features individually.

Checking for missing data is really important, so let’s do that first. Luckily we got a perfect dataset without any missing values, but if you have any, you can use dropna(), fillna() or a more interesting approach like KNN imputation, depending on the type of dataset you have.

To find out the number of values or in this case, students lying in the different categories, we can use the following function, to analyze the data.

Let’s make a cool nested plot describing the race and gender representation in the data set.

Now we will process the data a bit, so as to extract meaningful information that can be easily interpreted. We will replace values that belong to minority categories with main labels, and find measures like averages and ranges.

Let’s visualize these newly sorted ranges in a stacked bar chart.

The chart shows us the gender representations of test performance in the test given by the students.

We have also converted some categorical labels to numeric values, and created a correlation heat map to find the hidden relationships in the data.

We can see that students who are good at reading are also good at writing.

Another method of looking at the values is a scatter plot, which shows the values of different features over different axes.

To find out how the test preparation affects the test scores, we have drawn a plot density graph for the Math scores in the data set.

Taking the Preparatory course does not seem to have a significant effect on students math scores.

To visualize the 3 scores simultaneously, we shall use scatter plots.

Writing and Reading Scores have more correlation with each other than with Math Score.

To check out the outliers and distributions in detail we can use box plots.

You can observe the outliers, as the singular points that trail after the whiskers of the box plots.

For understanding the relationship between features, we can use pair plots to visualize them in detail.

If you observe the data set, you can see scores of students marked for 3 subjects. Thus, we can exploit this feature, convert the scores to grades, and visualize the results.

For a creative representation of the features we can use Joint Plots as well.

Finally, to end the analysis, we made a small gif that shows all the students’ math scores and grades dynamically, which could be used by teachers to see the results.

Overall Conclusions

  • Students with ‘standard’ lunch performed better than those with ‘free’ lunch. (probably due to better food habits and a balanced diet)
  • Female students performed better than male students.
  • Students who scored high were children of parents with college degrees.
  • High reading scores usually meant high writing scores and vice-versa.

References

Thanks for reading this article. The complete code is available on Github, with even more visualizations. We thought that this was an interesting EDA and we hope that you liked our work!

--

--

Aboli Marathe

Machine Learning Engineer @ Omdena| AI for Social Good | Writer @ The Innovation