How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?
One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.
Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.
We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.
# Load libraries
library(readr)
library(dplyr)
# Load dataset
by_tag_year <- read_csv("datasets/by_tag_year.csv")
# Inspect the dataset
by_tag_year
Parsed with column specification:
cols(
year = ďż˝[32mcol_double()ďż˝[39m,
tag = ďż˝[31mcol_character()ďż˝[39m,
number = ďż˝[32mcol_double()ďż˝[39m,
year_total = ďż˝[32mcol_double()ďż˝[39m
)
year | tag | number | year_total |
---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> |
2008 | .htaccess | 54 | 58390 |
2008 | .net | 5910 | 58390 |
2008 | .net-2.0 | 289 | 58390 |
2008 | .net-3.5 | 319 | 58390 |
2008 | .net-4.0 | 6 | 58390 |
2008 | .net-assembly | 3 | 58390 |
2008 | .net-core | 1 | 58390 |
2008 | 2d | 42 | 58390 |
2008 | 32-bit | 19 | 58390 |
2008 | 32bit-64bit | 4 | 58390 |
2008 | 3d | 73 | 58390 |
2008 | 64bit | 149 | 58390 |
2008 | abap | 10 | 58390 |
2008 | absolute | 1 | 58390 |
2008 | abstract | 5 | 58390 |
2008 | abstract-class | 27 | 58390 |
2008 | abstract-syntax-tree | 6 | 58390 |
2008 | accelerometer | 3 | 58390 |
2008 | access | 1 | 58390 |
2008 | access-control | 12 | 58390 |
2008 | accessibility | 26 | 58390 |
2008 | access-vba | 50 | 58390 |
2008 | access-violation | 4 | 58390 |
2008 | accordion | 9 | 58390 |
2008 | acl | 11 | 58390 |
2008 | acrobat | 10 | 58390 |
2008 | action | 10 | 58390 |
2008 | actionlistener | 4 | 58390 |
2008 | actionmailer | 3 | 58390 |
2008 | actionscript | 136 | 58390 |
... | ... | ... | ... |
2018 | yaml | 648 | 1085170 |
2018 | yarn | 357 | 1085170 |
2018 | yeoman | 36 | 1085170 |
2018 | yesod | 41 | 1085170 |
2018 | yield | 69 | 1085170 |
2018 | yii | 269 | 1085170 |
2018 | yii2 | 1181 | 1085170 |
2018 | yii2-advanced-app | 209 | 1085170 |
2018 | yocto | 288 | 1085170 |
2018 | youtube | 676 | 1085170 |
2018 | youtube-api | 473 | 1085170 |
2018 | youtube-api-v3 | 223 | 1085170 |
2018 | youtube-data-api | 203 | 1085170 |
2018 | yui | 5 | 1085170 |
2018 | yum | 98 | 1085170 |
2018 | z3 | 124 | 1085170 |
2018 | zend-db | 11 | 1085170 |
2018 | zend-form | 13 | 1085170 |
2018 | zend-framework | 188 | 1085170 |
2018 | zend-framework2 | 108 | 1085170 |
2018 | zeromq | 168 | 1085170 |
2018 | z-index | 107 | 1085170 |
2018 | zip | 410 | 1085170 |
2018 | zipfile | 115 | 1085170 |
2018 | zk | 35 | 1085170 |
2018 | zlib | 89 | 1085170 |
2018 | zoom | 196 | 1085170 |
2018 | zsh | 175 | 1085170 |
2018 | zurb-foundation | 182 | 1085170 |
2018 | zxing | 95 | 1085170 |
# These packages need to be loaded in the first `@tests` cell.
library(testthat)
library(IRkernel.testthat)
# Then follows one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
run_tests({
test_that("the answer is correct", {
expect_true("readr" %in% .packages(), info = "Did you load the readr package?")
expect_true("dplyr" %in% .packages(), info = "Did you load the dplyr package?")
expect_is(by_tag_year, "tbl_df",
info = "Did you read in by_tag_year with read_csv (not read.csv)?")
expect_equal(nrow(by_tag_year), 40518,
info = "Did you read in by_tag_year with read_csv?")
})
})
1/1 tests passed
This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess
tag in 2008, out of a total of 58390 questions in that year.
Rather than just the counts, we're probably interested in a percentage: the fraction of questions that year that have that tag. So let's add that to the table.
# Add fraction column
by_tag_year_fraction <- mutate(by_tag_year, fraction = number / year_total)
# Print the new table
by_tag_year_fraction
year | tag | number | year_total | fraction |
---|---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> | <dbl> |
2008 | .htaccess | 54 | 58390 | 9.248159e-04 |
2008 | .net | 5910 | 58390 | 1.012160e-01 |
2008 | .net-2.0 | 289 | 58390 | 4.949478e-03 |
2008 | .net-3.5 | 319 | 58390 | 5.463264e-03 |
2008 | .net-4.0 | 6 | 58390 | 1.027573e-04 |
2008 | .net-assembly | 3 | 58390 | 5.137866e-05 |
2008 | .net-core | 1 | 58390 | 1.712622e-05 |
2008 | 2d | 42 | 58390 | 7.193013e-04 |
2008 | 32-bit | 19 | 58390 | 3.253982e-04 |
2008 | 32bit-64bit | 4 | 58390 | 6.850488e-05 |
2008 | 3d | 73 | 58390 | 1.250214e-03 |
2008 | 64bit | 149 | 58390 | 2.551807e-03 |
2008 | abap | 10 | 58390 | 1.712622e-04 |
2008 | absolute | 1 | 58390 | 1.712622e-05 |
2008 | abstract | 5 | 58390 | 8.563110e-05 |
2008 | abstract-class | 27 | 58390 | 4.624079e-04 |
2008 | abstract-syntax-tree | 6 | 58390 | 1.027573e-04 |
2008 | accelerometer | 3 | 58390 | 5.137866e-05 |
2008 | access | 1 | 58390 | 1.712622e-05 |
2008 | access-control | 12 | 58390 | 2.055146e-04 |
2008 | accessibility | 26 | 58390 | 4.452817e-04 |
2008 | access-vba | 50 | 58390 | 8.563110e-04 |
2008 | access-violation | 4 | 58390 | 6.850488e-05 |
2008 | accordion | 9 | 58390 | 1.541360e-04 |
2008 | acl | 11 | 58390 | 1.883884e-04 |
2008 | acrobat | 10 | 58390 | 1.712622e-04 |
2008 | action | 10 | 58390 | 1.712622e-04 |
2008 | actionlistener | 4 | 58390 | 6.850488e-05 |
2008 | actionmailer | 3 | 58390 | 5.137866e-05 |
2008 | actionscript | 136 | 58390 | 2.329166e-03 |
... | ... | ... | ... | ... |
2018 | yaml | 648 | 1085170 | 5.971415e-04 |
2018 | yarn | 357 | 1085170 | 3.289807e-04 |
2018 | yeoman | 36 | 1085170 | 3.317453e-05 |
2018 | yesod | 41 | 1085170 | 3.778210e-05 |
2018 | yield | 69 | 1085170 | 6.358451e-05 |
2018 | yii | 269 | 1085170 | 2.478874e-04 |
2018 | yii2 | 1181 | 1085170 | 1.088309e-03 |
2018 | yii2-advanced-app | 209 | 1085170 | 1.925966e-04 |
2018 | yocto | 288 | 1085170 | 2.653962e-04 |
2018 | youtube | 676 | 1085170 | 6.229439e-04 |
2018 | youtube-api | 473 | 1085170 | 4.358764e-04 |
2018 | youtube-api-v3 | 223 | 1085170 | 2.054978e-04 |
2018 | youtube-data-api | 203 | 1085170 | 1.870675e-04 |
2018 | yui | 5 | 1085170 | 4.607573e-06 |
2018 | yum | 98 | 1085170 | 9.030843e-05 |
2018 | z3 | 124 | 1085170 | 1.142678e-04 |
2018 | zend-db | 11 | 1085170 | 1.013666e-05 |
2018 | zend-form | 13 | 1085170 | 1.197969e-05 |
2018 | zend-framework | 188 | 1085170 | 1.732447e-04 |
2018 | zend-framework2 | 108 | 1085170 | 9.952358e-05 |
2018 | zeromq | 168 | 1085170 | 1.548145e-04 |
2018 | z-index | 107 | 1085170 | 9.860206e-05 |
2018 | zip | 410 | 1085170 | 3.778210e-04 |
2018 | zipfile | 115 | 1085170 | 1.059742e-04 |
2018 | zk | 35 | 1085170 | 3.225301e-05 |
2018 | zlib | 89 | 1085170 | 8.201480e-05 |
2018 | zoom | 196 | 1085170 | 1.806169e-04 |
2018 | zsh | 175 | 1085170 | 1.612651e-04 |
2018 | zurb-foundation | 182 | 1085170 | 1.677157e-04 |
2018 | zxing | 95 | 1085170 | 8.754389e-05 |
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
run_tests({
test_that("the answer is correct", {
expect_is(by_tag_year_fraction, "tbl_df",
info = "Did you create the by_tag_year_fraction object?")
expect_true("fraction" %in% colnames(by_tag_year_fraction),
info = "Did you use mutate() to add a fraction column?")
expect_equal(by_tag_year_fraction$fraction,
by_tag_year_fraction$number / by_tag_year_fraction$year_total,
info = "Check how you computed the fraction column: is it the number divided by that year's total?")
})
# You can have more than one test
})
1/1 tests passed
So far we've been learning and using the R programming language. Wouldn't we like to be sure it's a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?
Let's look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.
# Filter for R tags
r_over_time <- filter(by_tag_year_fraction, tag=='r')
# Print the new table
r_over_time
year | tag | number | year_total | fraction |
---|---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> | <dbl> |
2008 | r | 8 | 58390 | 0.0001370098 |
2009 | r | 524 | 343868 | 0.0015238405 |
2010 | r | 2270 | 694391 | 0.0032690516 |
2011 | r | 5845 | 1200551 | 0.0048685978 |
2012 | r | 12221 | 1645404 | 0.0074273552 |
2013 | r | 22329 | 2060473 | 0.0108368321 |
2014 | r | 31011 | 2164701 | 0.0143257660 |
2015 | r | 40844 | 2219527 | 0.0184021190 |
2016 | r | 44611 | 2226072 | 0.0200402323 |
2017 | r | 54415 | 2305207 | 0.0236052554 |
2018 | r | 28938 | 1085170 | 0.0266667895 |
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
run_tests({
test_that("the answer is correct", {
expect_is(r_over_time, "tbl_df",
info = "Did you create an r_over_time object with filter()?")
expect_equal(nrow(r_over_time), 11,
info = "Did you filter just for the rows with the 'r' tag?")
expect_true(all(r_over_time$tag == "r"),
info = "Did you filter just for the rows with the 'r' tag?")
})
# You can have more than one test
})
1/1 tests passed
Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.
# Load ggplot2
library(ggplot2)
# Create a line plot of fraction over time
ggplot(r_over_time, aes(x=year, y=fraction)) + geom_line()
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}
run_tests({
test_that("the answer is correct", {
expect_true("ggplot2" %in% .packages(), info = "Did you load the ggplot2 package?")
# expect_true("scales" %in% .packages(), info = "Did you load the scales package?")
p <- last_plot()
expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")
aesthetics <- get_aesthetics(p)
expect_equal(rlang::quo_name(aesthetics$x), "year",
info = "Did you put year on the x-axis?")
expect_equal(rlang::quo_name(aesthetics$y), "fraction",
info = "Did you put fraction on the y-axis?")
# expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")
# expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
})
})
1/1 tests passed
Based on that graph, it looks like R has been growing pretty fast in the last decade. Good thing we're practicing it now!
Besides R, two other interesting tags are dplyr and ggplot2, which we've already used in this analysis. They both also have Stack Overflow tags!
Instead of just looking at R, let's look at all three tags and their change over time. Are each of those tags increasing as a fraction of overall questions? Are any of them decreasing?
# A vector of selected tags
selected_tags <- c("r", "dplyr", "ggplot2")
# Filter for those tags
selected_tags_over_time <- filter(by_tag_year_fraction, tag %in% selected_tags)
# Plot tags over time on a line plot using color to represent tag
ggplot(selected_tags_over_time, aes(x=year, y=fraction, color=tag)) + geom_line()
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}
run_tests({
test_that("the answer is correct", {
expect_true("ggplot2" %in% .packages(), info = "Did you load the ggplot2 package?")
expect_is(selected_tags_over_time, "tbl_df",
info = "Did you create a selected_tags_over_time data frame?")
expect_equal(nrow(selected_tags_over_time), 28,
info = "Did you filter for r, dplyr, and ggplot2 and save it to selected_tags_over_time?")
expect_equal(sort(unique(selected_tags_over_time$tag)), c("dplyr", "ggplot2", "r"),
info = "Did you filter for r, dplyr, and ggplot2 and save it to selected_tags_over_time?")
p <- last_plot()
aesthetics <- get_aesthetics(p)
expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
expect_equal(p$data, selected_tags_over_time, info = "Did you create your plot out of selected_tags_over_time?")
expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")
expect_true(!is.null(aesthetics$x), info = "Did you put year on the x-axis?")
expect_equal(rlang::quo_name(aesthetics$x), "year",
info = "Did you put year on the x-axis?")
expect_true(!is.null(aesthetics$y), info = "Did you put fraction on the y-axis?")
expect_equal(rlang::quo_name(aesthetics$y), "fraction",
info = "Did you put fraction on the y-axis?")
expect_true(!is.null(aesthetics$colour), info = "Did you put color on the x-axis?")
expect_equal(rlang::quo_name(aesthetics$colour), "tag",
info = "Did you map the tag to the color?")
# expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")
# expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
})
# You can have more than one test
})
1/1 tests passed
It's sure been fun to visualize and compare tags over time. The dplyr and ggplot2 tags may not have as many questions as R, but we can tell they're both growing quickly as well.
We might like to know which tags have the most questions overall, not just within a particular year. Right now, we have several rows for every tag, but we'll be combining them into one. That means we want group_by()
and summarize()
.
Let's look at tags that have the most questions in history.
# Find total number of questions for each tag
sorted_tags <- by_tag_year %>%
group_by(tag) %>% summarize(tag_total = sum(number)) %>% arrange(desc(tag_total))
# Print the new table
sorted_tags
tag | tag_total |
---|---|
<chr> | <dbl> |
javascript | 1632049 |
java | 1425961 |
c# | 1217450 |
php | 1204291 |
android | 1110261 |
python | 970768 |
jquery | 915159 |
html | 755341 |
c++ | 574263 |
ios | 566075 |
css | 539818 |
mysql | 522287 |
sql | 445419 |
asp.net | 334479 |
ruby-on-rails | 293432 |
objective-c | 284451 |
c | 279915 |
.net | 269578 |
arrays | 266578 |
angularjs | 252951 |
r | 243016 |
json | 236552 |
sql-server | 234713 |
node.js | 229843 |
iphone | 219161 |
swift | 196253 |
ruby | 195860 |
regex | 190061 |
ajax | 188184 |
xml | 173524 |
... | ... |
impala | 1011 |
box-api | 1010 |
drawrect | 1010 |
expo | 1010 |
package.json | 1010 |
credit-card | 1009 |
data-conversion | 1009 |
omnet++ | 1009 |
c-strings | 1008 |
google-docs-api | 1008 |
publishing | 1008 |
jogl | 1007 |
node-red | 1007 |
postgresql-9.4 | 1007 |
uinavigationitem | 1007 |
playframework-2.1 | 1006 |
cakephp-2.1 | 1005 |
device-driver | 1005 |
jasperserver | 1004 |
webdeploy | 1004 |
cat | 1003 |
date-formatting | 1003 |
java-2d | 1003 |
lattice | 1003 |
directory-structure | 1002 |
relation | 1002 |
doctype | 1001 |
rvest | 1001 |
tableviewcell | 1000 |
yahoo | 1000 |
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
run_tests({
test_that("the answer is correct", {
expect_is(sorted_tags, "tbl_df",
info = "Did you create a selected_tags_over_time data frame?")
expect_equal(colnames(sorted_tags), c("tag", "tag_total"),
info = "Did you group by tag and summarize to create a tag_total column?")
expect_equal(nrow(sorted_tags), length(unique(by_tag_year$tag)),
info = "Did you group by tag and summarize to create a tag_total column?")
expect_equal(sorted_tags$tag_total,
sort(sorted_tags$tag_total, decreasing = TRUE),
info = "Did you arrange in descending order of tag_total?")
})
})
1/1 tests passed
We've looked at selected tags like R, ggplot2, and dplyr, and seen that they're each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.
# Get the six largest tags
highest_tags <- head(sorted_tags$tag)
# Filter for the six largest tags
by_tag_subset <- by_tag_year_fraction %>% filter(tag==highest_tags)
# Plot tags over time on a line plot using color to represent tag
ggplot(by_tag_subset, aes(x = year, y = fraction, color = tag)) + geom_line()
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}
run_tests({
test_that("the answer is correct", {
expect_equal(sort(unique(by_tag_subset$tag)), sort(head(sorted_tags$tag, 6)),
info = "Did you filter by_tag_year_fraction for only the 6 most asked-about tags, and save it as by_tag_subset?")
expect_equal(colnames(by_tag_subset), colnames(by_tag_year_fraction),
info = "Did you filter by_tag_year_fraction for only the 6 most asked-about tags, and save it as by_tag_subset?")
p <- last_plot()
expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
expect_equal(p$data, by_tag_subset, info = "Did you create your plot out of by_tag_subset?")
expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")
aesthetics <- get_aesthetics(p)
expect_equal(rlang::quo_name(aesthetics$x), "year",
info = "Did you put year on the x-axis?")
expect_equal(rlang::quo_name(aesthetics$y), "fraction",
info = "Did you put fraction on the y-axis?")
expect_equal(rlang::quo_name(aesthetics$colour), "tag",
info = "Did you map the tag to the color?")
# expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")
# expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
})
})
1/1 tests passed
Wow, based on that graph we've seen a lot of changes in what programming languages are most asked about. C# gets fewer questions than it used to, and Python has grown quite impressively.
This Stack Overflow data is incredibly versatile. We can analyze any programming language, web framework, or tool where we'd like to see their change over time. Combined with the reproducibility of R and its libraries, we have ourselves a powerful method of uncovering insights about technology.
To demonstrate its versatility, let's check out how three big mobile operating systems (Android, iOS, and Windows Phone) have compared in popularity over time. But remember: this code can be modified simply by changing the tag names!
# Get tags of interest
my_tags <- c("android", "ios", "windows-phone")
# Filter for those tags
by_tag_subset <- by_tag_year_fraction %>% filter(tag==my_tags)
# Plot tags over time on a line plot using color to represent tag
ggplot(by_tag_subset, aes(x = year, y = fraction, color = tag)) + geom_line()
# one or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
get_aesthetics <- function(p) {
unlist(c(list(p$mapping), purrr::map(p$layers, "mapping")))
}
run_tests({
test_that("the answer is correct", {
expect_equal(sort(my_tags), c("android", "ios", "windows-phone"),
info = "Did you create a vector my_tags of just android, ios, and windows-phone?")
expect_equal(sort(unique(by_tag_subset$tag)), c("android", "ios", "windows-phone"),
info = "Did you filter by_tag_year_fraction for only ios, android, and windows-phone?")
expect_equal(colnames(by_tag_subset), colnames(by_tag_year_fraction),
info = "Did you filter by_tag_year_fraction for only the three requested tags, and save it as by_tag_subset?")
p <- last_plot()
expect_is(p, "ggplot", info = "Did you create a ggplot figure?")
expect_equal(p$data, by_tag_subset, info = "Did you create your plot out of by_tag_subset?")
expect_equal(length(p$layers), 1, info = "Did you create a plot with geom_line()?")
expect_is(p$layers[[1]]$geom, "GeomLine", info = "Did you create a plot with geom_line()?")
aesthetics <- get_aesthetics(p)
expect_equal(rlang::quo_name(aesthetics$x), "year",
info = "Did you put year on the x-axis?")
expect_equal(rlang::quo_name(aesthetics$y), "fraction",
info = "Did you put fraction on the y-axis?")
expect_equal(rlang::quo_name(aesthetics$colour), "tag",
info = "Did you map the tag to the color?")
# expect_equal(length(p$scales$scales), 1, info = "Did you add scale_y_continuous?")
# expect_equal(p$scales$scales[[1]]$labels(.03), "3.00%", info = "Did you make the y-axis a percentage?")
})
})
1/1 tests passed