Exploring Nobel Prize winners with R: how to visualise data using ggplot2
Data visualisation is the art of converting raw numbers into useful information. It helps businesses to extract valuable insights from an ever increasing amount of data.
In the same fashion, being able to share stories with a non-technical audience is the aim of every BI/Data Analyst and today there are a number of BI tools that support these needs – but they can be costly. On the other hand, R programming offers a set of free inbuilt libraries and functions to extract, manipulate data and create elegant visualisations.
In this tutorial, I will show how to load the Nobel Prize Winners dataset to the R Studio environment in two different ways, and transform it using dplyr to build appealing graphs with the help of ggplot2.
To make the tutorial more engaging, I had to choose a dataset on a topic that was very well known, represented a topical subject and that could generate valuable conclusions when analysed. Searching on kaggle.com, I found an interesting dataset on Nobel Prize laureates between 1901 and 2016 and I immediately thought that it met all the characteristics I was looking for (well known, topical, relevant).
The 2018 Ceremony will take place on the 10th of December in Stockholm and Oslo (where the Nobel Peace Prize is awarded). For this reason, I decided to update the dataset to include all the laureates from the last two years. I also added the age of the winners (at the time of the award) as well as five different age ranges.
The final dataset contains 935 observations (split into 908 Laureates and 27 Organisations) and 18 variables. As the official website states, “A small number of individuals and organisations have been honoured more than once, which means that 904 individuals and 24 unique organisations have received the Nobel Prize in total. ”
The first step consists of removing the pre-existent objects from R Studio environment and loading the three libraries that will be used (DBI, dplyr and ggplot2). Remember to install them first, using the install.packages() command.
The easiest way to read data in R is by using the read.csv() function, specifying that a header is present and the strings should not all be treated as factors. A more general function to read a wider pool of formats is read.table() but, in this case, the separator character needs to be added.
Once read, the dataset is saved in the nobel_prize_winners object. The variable names can be displayed using the names() function, while the structure of the object can be explored using head() and str().
You could also decide to attach() a specific dataset to the environment to avoid retyping the data frame name over and over again. Only three variables (Gender, Category and Age_Range) are then converted into factors and our object is now ready to be manipulated:
Let’s suppose for a moment that the dataset you planned to use was not available in a csv format but stored in a table hosted in a relational database instead – and this is often the case in the industry. How could data be loaded into R? One of the many available options is to create a connection with a specific server through the dbConnect() function from the DBI package and then to query from a table using dbGetQuery() that includes a simple SELECT * FROM SQL statement. In the following example, I show how to create a connection with MySQL server. In this case, a host address is not necessary but usually you will need it in a business context.
When the connection is successful, no error is returned. Exploring the table contained in the data_analysis database is possible through the dbListTables() command (in this case, as you can see only the np_data table was found).
When a Postgres DB is used instead, the server should be replaced with RPostgres::Postgres(). If, like us at Marketinvoice, you are using AWS Redshift you should follow the instruction on Amazon’s website.
After loading the dataset in R Studio, I used dplyr and ggplot2 to answer the following questions:
1. Which country gave birth to the highest number of Nobel Prize laureates? Has the trend changed over time?
In order to answer the first question, I created a new dataset nb_grouped showing the total number of Nobel Prizes assigned by Birth Country (n_prizes) and the split between prizes awarded to males (n_male) and prizes awarded to females (n_female). Only the top 10 countries in the ranking have been returned:
Doing this with dplyr is extremely intuitive: the function select() has been used to select only three variables and those have been grouped by Birth_Country. Two new columns have then been created with mutate() and eventually a count and a sum have been applied through summarize(). Note the use of the pipe operator (%>%) to connect the actions. When displayed, the new data frame looks like the following:
The nb_grouped dataset can now be used to build the first visualisation with ggplot2. At first sight, the code that follows might seem complicated but in fact, the basic syntax always consists of three layers: data + aesthetics + geometries. In this case, inside aes() I specified Birth Country on the x axis, n_prizes on the y axis and mapped the fill options to the Birth Country variable (a different filling for each country). By using geom_bar(), I have chosen to plot a bar chart with a grey outline (col = “grey25”).
The real power of ggplot stems from the fact that you could keep on adding as many layers as you wish (including multiple geoms) to improve the quality and detail of the visualisation. For example, the plot title as well as the axis labels can easily be added using labs(), xlab() and ylab(), while its elements can be altered inside the theme() layer.
Once the script is executed, the following visualisation is displayed:
The USA is clearly the country that gave birth to the highest number of Nobel Prize laureates, followed by the United Kingdom and Germany (the Birth_Country variable has been updated to display the name of the country based on the current national borders).
In order to analyse whether the trend has changed over time, I saved the top 10 countries in a separate object, that has then been used in the filter() function to generate the nb_top_countries dataframe:
To visualise the Nobel Prize distribution over time, I have used geom_point() which generates a scatter plot with Year on the x axis and Birth_Country on the y axis. Note that by using this type of geom, a color is assigned through the col() aesthetic, instead of the fill() one. A position equal to “jitter” has been specified to add some random noise and avoid overplotting, while the the aesthetic alpha() has been used to control the transparency of the elements in the plot:
While the trend of laureates born in the United Kingdom and in Germany appears to be fairly consistent over time, with a balanced number of winners in the period 1901-2018, other countries like Canada, Japan and the USA have been giving birth to an incremental number of laureates. This is particularly true for the USA where, 177 out of 271 Nobel Prizes (65.3%) have been assigned after 1975.
2. Is the average age of Nobel Prizes Winners similar across categories? Has the trend changed over time?
To verify whether the average age range of laureates is similar across the six categories (Chemistry, Economics, Literature, Medicine, Peace, Physics) or an evident difference can be spotted, I have plotted a histogram for each category displaying the Age_Range on the x axis and the count of laureates falling in a specific range on the y axis. The factor has been reordered to display younger laureates first. The code behind the plot that follows is very similar to the one used for the first visualisation, with the difference that this time facet_wrap() has been added to display a single bar chart for each Category. In general, faceting is used to divide a plot in subplots based on the values of one or more discrete variables. Moreover, a new palette has been selected inside scale_fill_brewer() to change the standard bar filling (you can find a complete list of palettes here).
It seems like the two age ranges in which the Nobel Prize winners tend to fall with a higher frequency are 45-59 and 60-74, with an overall average age of 57. However, the plot offers additional insights across categories:
- The Nobel Prize for Economic Sciences has never been assigned to individuals falling in the 30-44 age range or younger.
- All the individuals younger than 30 who won the Nobel Peace Prize are women.
- The Physics category presents the highest proportion of laureates younger than 45 (31.5% or 66 out of 210 awarded prizes).
- At a high level, sciences (Chemistry, Physics, Medicine) seems to have younger prize winners than arts / social sciences (Literature, Peace, Economics).
To verify whether the average age has changed over time across categories, I calculated the field avg_age using mean() inside the summarize() function and excluded the missing values by setting na.rm = TRUE (no age is assigned to Organisations):
The results have been grouped by Category and Year and nb_by_age has been plotted using a combination of geom_point(), geom_smooth() and geom_hline().The first geometry generates a scatterplot, the second adds a trend line following a linear model, while the third displays the average age for each category in the period 1901-2018:
The visualisation obtained using a facet_wrap() function shows that:
- The average age of Nobel Prize laureates in Physics has been growing noticeably over time but is still the category with the lowest mean.
- A similar tendency can be identified for Chemistry, Medicine, Literature and Economics Nobel Prizes, despite the trend line slope varying across categories and not being as clear in some of these.
- The Nobel Peace Prize category is the only one for which the average age has been decreasing over time.
3. Has the proportion of winners (by gender) changed across categories over time? Could these changes be connected with major political/social factors?
If you have been so brave (…or curious) to reach this point of the tutorial, the aim of the last visualisation is to explore how the proportion of male and female laureates has changed across categories over time. To create this plot, the now well known geom_point() has been used in combination with facet_grid(). I took advantage of the latter function to facet into rows based on Category:
This plot is particularly powerful in highlighting a large number of details about the Nobel Prize winners and the historical relationship between genders, and in particular that:
- All the 27 Nobel Prizes assigned to an Organisation, fall in the Nobel Peace Prize category. The dataset contains 24 unique Organisations, of which 3 have been awarded twice.
- Apart from the Organisations, 856 prizes were assigned to (853 unique) males and 52 to (51 unique) females. Marie Curie was the first woman to be awarded in 1903 (Physics Prize) and the only woman to be awarded twice. The number of female laureates has grown noticeably after 1975 (in 2009 out of 13 winners, 5 were women), although it remains the case that a large majority of prizes are awarded to men.
- So far, women seem more likely to be awarded prizes in the Peace, Literature and Medicine fields, categories in which fall 43 out of 52 Nobel Prizes (82.7%) for women, followed by Chemistry, Physics and Economics (where there is a sole female laureate).
- The Nobel prize in Economic Sciences was only created in 1968 and assigned in 1969 for the first time.
- There are some years when the Nobel Prizes have not been awarded for some or all the categories and as expected, most of them during World War I (1914-1918) and II (1939-1945). Because of this, the Nobel Peace Prize has been the most affected, as it remained unassigned on 19 occasions.
In this tutorial, I have shown how easy and convenient it is to load, manipulate and visualise data using R (particularly if you think that this software environment is open source!). For this scope, I have used the Nobel Prize winners dataset that has been first read as a csv file, exploiting the standard R functions or alternatively queried from a MySQL database using DBI package capabilities. After that, dplyr verbs (select, filter, group by, arrange, summarize) have been applied to manipulate the dataset, readying it to be visualised with ggplot2.
Ggplot2 (basic) syntax takes advantage of layering (data + aesthetics + geometries) to generate appealing plots quickly and effectively. It could be a powerful solution to explore large datasets for many data driven businesses that, like Marketinvoice, use a number of BI and data analysis tools to meet their need for valuable insights.
If you’d like to leave your feedback about this tutorial get in touch with me by emailing firstname.lastname@example.org