Evaluating Rise and Fall of Programming Languages using Tidyverse.
With a myriad of programming languages/ technologies available at our disposal today, we see that every language undoubtedly has its peculiarity and complexity. It is essential to know which languages are expanding and which ones are shrinking to get a better idea of the languages worth investing time in.
In this project from Datacamp, the main objective is to leverage the data extracted from Stack Overflow and visualize it using ggplot2, dplyr and readr to obtain an approximate idea of how many people are using a particular technology. This will be done by evaluating the number of questions that are asked about each technology on Stack Overflow, which is a great source of data for this project.
Those of you who are unaware of Stack Overflow, is basically a programming question and answer site with more than 16 million questions on several programming topics.
Every Stack Overflow question has a tag which categorizes a question to specify its technology. For example, there is a tag for languages like Python or Java and also for packages like ggplot and pandas.
I have loaded readr, dplyr and ggplot2 packages to begin with. I am going to use the open data from Stack Exchange Data Explorer. This dataset primarily contains number of questions asked for a particular tag in that year and the total number of questions asked in that year.
Disclaimer: The data used from Stack Overflow provides the number of questions asked about a particular technology, this does not necessarily indicate the popularity of the technology and may in fact showcase the difficulty of the technology instead. If we had data regarding the number of users that submitted the questions then the assumption that more questions equals higher popularity would be more accurate. Thus, this project’s focus is to visualize a sample of publicly available complex data, using tidyverse for easier understanding to predict a trend.
Here is the glimpse of the dataset -
The above table data has one observation for each pair of tag and a year, displaying the number of questions asked for that tag in a particular year. For example, there were 5910 questions asked for .net out of 58390 total questions for the year 2008. Instead of merely analyzing the counts, we can add a fraction column in the table which will denote the fraction of questions for a specific tag in a particular year.
Since I am doing this project in R, I would like to dig deeper to understand the popularity of R language and the packages included in it.
Has R language been expanding or diminishing?
From the graph above, we can see that R has been growing over the last decade, woot!
Here is the snapshot of the table r_over_time:
I was now excited to see what do the insights have to say for dplyr and ggplot2 which I have already used in this project so far.
From the above graph, I saw that dplyr and ggplot2 may not have so many questions than R, but they seem to be growing rapidly.
Using group_by and summarize functions now, let’s visualize the popularity of some most sought after tags over time.
Impressive! From the above graph, we can see that C# is getting fewer questions than it used to while Python’s popularity has upped the chart over the last few years.
Stack Overflow’s data is extremely resourceful. It lets us easily analyze any technology, programming language that we would like to visualize over time. Considering the reliability of R and its libraries, we can now obtain insights about any technology.
Since this code's reusability is high, it can be used as an instrument to derive insights about any technology/ programming language just by changing the tag names.