Introduction to openalexR: a webinar

Author

Trang Le

Published

July 24, 2024

🌻 Introduction

Welcome to the webinar on openalexR! To view the video recording of the webinar, go to https://youtu.be/AUd6-mK76a0. Today, we will explore how to use the openalexR package to fetch data from OpenAlex, a free and open database of the entire research landscape. With openalexR, you can easily access and analyze scholarly data directly from R.

🌱 Installation and setup

If you haven’t already installed the openalexR package, you can do so from CRAN:

install.packages("openalexR")

Before we go any further, we highly recommend you set the openalexR.mailto option so that your requests go to the polite pool for faster response times. If you have OpenAlex Premium, you can add your API key to the openalexR.apikey option as well. To do so, you can open .Renviron with file.edit("~/.Renviron") and add:

openalexR.mailto = example@email.com
openalexR.apikey = EXAMPLE_APIKEY

We will now load in openalexR and the tidyverse to use for the rest of this webinar.

library(openalexR)
library(tidyverse)
packageVersion("openalexR")
[1] '1.4.0'

Throughout the webinar, you will see this symbol, |>. Simply put, x |> f() is equivalent to f(x). Using this R base pipe allows us to chain functions together in a more readable way.

🌿 Basic usage

The main function of openalexR is oa_fetch().

?oa_fetch

Fetching information from identifiers

Let’s start by fetching data on a specific scholarly work using its OpenAlex ID:

When you know the OpenAlex ID of the work, you do not need to specify entity = "works" because entity can be inferred from the first character of id. However, I specify it here for clarity. In other use cases (search, filter), you will almost always want to specify entity. For a list of all supported entities, see oa_entities().

work <- oa_fetch(entity = "works", id = "W2741809807", verbose = TRUE)
Requesting url: https://api.openalex.org/works/W2741809807
work
# A tibble: 1 × 38
  id                title display_name author ab    publication_date so    so_id
  <chr>             <chr> <chr>        <list> <chr> <chr>            <chr> <chr>
1 https://openalex… The … The state o… <df>   Desp… 2018-02-13       PeerJ http…
# ℹ 30 more variables: host_organization <chr>, issn_l <chr>, url <chr>,
#   pdf_url <chr>, license <chr>, version <chr>, first_page <chr>,
#   last_page <chr>, volume <chr>, issue <lgl>, is_oa <lgl>,
#   is_oa_anywhere <lgl>, oa_status <chr>, oa_url <chr>,
#   any_repository_has_fulltext <lgl>, language <chr>, grants <lgl>,
#   cited_by_count <int>, counts_by_year <list>, publication_year <int>,
#   cited_by_api_url <chr>, ids <list>, doi <chr>, type <chr>, …

Now, we can view the output tibble/dataframe, work, interactively in RStudio or inspect it with base functions like str or head.

str(work, max.level = 2)
tibble [1 × 38] (S3: tbl_df/tbl/data.frame)
 $ id                         : chr "https://openalex.org/W2741809807"
 $ title                      : chr "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles"
 $ display_name               : chr "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles"
 $ author                     :List of 1
 $ ab                         : chr "Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, u"| __truncated__
 $ publication_date           : chr "2018-02-13"
 $ so                         : chr "PeerJ"
 $ so_id                      : chr "https://openalex.org/S1983995261"
 $ host_organization          : chr "PeerJ, Inc."
 $ issn_l                     : chr "2167-8359"
 $ url                        : chr "https://doi.org/10.7717/peerj.4375"
 $ pdf_url                    : chr "https://peerj.com/articles/4375.pdf"
 $ license                    : chr "cc-by"
 $ version                    : chr "publishedVersion"
 $ first_page                 : chr "e4375"
 $ last_page                  : chr "e4375"
 $ volume                     : chr "6"
 $ issue                      : logi NA
 $ is_oa                      : logi TRUE
 $ is_oa_anywhere             : logi TRUE
 $ oa_status                  : chr "gold"
 $ oa_url                     : chr "https://peerj.com/articles/4375.pdf"
 $ any_repository_has_fulltext: logi TRUE
 $ language                   : chr "en"
 $ grants                     : logi NA
 $ cited_by_count             : int 785
 $ counts_by_year             :List of 1
 $ publication_year           : int 2018
 $ cited_by_api_url           : chr "https://api.openalex.org/works?filter=cites:W2741809807"
 $ ids                        :List of 1
 $ doi                        : chr "https://doi.org/10.7717/peerj.4375"
 $ type                       : chr "article"
 $ referenced_works           :List of 1
 $ related_works              :List of 1
 $ is_paratext                : logi FALSE
 $ is_retracted               : logi FALSE
 $ concepts                   :List of 1
 $ topics                     :List of 1

openalexR also provides the show_works function to simplify the result (e.g., remove some columns, keep first/last author) for easy viewing. Let us define print_oa() to wrap the output table in knitr::kable() to be displayed nicely on the webpage, but you will most likely not need this function.

print_oa <- function(x, fun = show_works) {
  x |>
    select(-any_of("url")) |>
    fun() |>
    knitr::kable() |>
    identity()
}
print_oa(work)
id display_name first_author last_author so is_oa top_concepts
W2741809807 The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles Heather Piwowar Stefanie Haustein PeerJ TRUE Citation, License, Bibliometrics

🛤️ Coding challenges

Challenge 1: Advanced filters

According to the OpenAlex API documentation, what is the filter we should use to:

  • Get funders with a description containing “engineering”?
  • Get topics with more than 1000 works?
  • Get 10 institutions located in Asia?
oa_fetch(
  entity = "funders",
  _____ = "engineering"
)

oa_fetch(
  entity = "topics",
  _____ = ">1000"
)

oa_fetch(
  entity = "institutions",
  _____ = "Asia",
  options = list(_____ = 10)
)
oa_fetch(
  entity = "funders",
  description.search = "engineering"
)

oa_fetch(
  entity = "topics",
  works_count = ">1000"
)

oa_fetch(
  entity = "institutions",
  continent = "Asia",
  options = list(sample = 10, seed = 1)
)

Challenge 2: Humpback whale

Identify works on a specific topic (e.g., “humpback whale”) that have been cited more than 100 times. When were these works published? Where are the authors based? Create a bar plot showing the number of works at each institution.

work <- oa_fetch(
  entity = "works",
  # decide whether you want search/title.search/abstract.search/etc.
  ______
)

______
humpback <- oa_fetch(
  entity = "works",
  title.search = "humpback whale",
  cited_by_count = ">100",
  options = list(sort = "cited_by_count:desc")
)
print(humpback$title[1:10])
 [1] "Songs of Humpback Whales"                                                                            
 [2] "Leading-edge tubercles delay stall on humpback whale (<i>Megaptera novaeangliae</i>) flippers"       
 [3] "Hydrodynamic design of the humpback whale flipper"                                                   
 [4] "Dynamics of two populations of the humpback whale, Megaptera novaeangliae (Borowski)"                
 [5] "Network-Based Diffusion Analysis Reveals Cultural Transmission of Lobtail Feeding in Humpback Whales"
 [6] "Microplastic in a macro filter feeder: Humpback whale Megaptera novaeangliae"                        
 [7] "Genetic tagging of humpback whales"                                                                  
 [8] "9. The Seasonal Migratory Cycle of Humpback Whales"                                                  
 [9] "Abundant mitochondrial DNA variation and world-wide population structure in humpback whales."        
[10] "Male Competition in Large Groups of Wintering Humpback Whales"                                       
n_authors <- sapply(humpback$author, nrow)
hb_authors <- humpback$author |>
  bind_rows() |> 
  mutate(weight = 1/unlist(lapply(n_authors, \(x) rep(x, x))))

pal <- c("#D3B484", "#F89C74", "#C9DB74", "#87C55F", "#B497E7","#66C5CC")
hb_authors |>
  drop_na(institution_display_name) |>
  group_by(
    inst = institution_display_name, 
    country = institution_country_code
  ) |>
  summarise(n = sum(weight),  .groups = "drop") |> 
  arrange(desc(n)) |> 
  filter(n > 0.5) |>
  ggplot() +
  aes(x = n, y = fct_reorder(inst, n), fill = country) +
  geom_col() +
  scale_fill_manual(values = rev(pal)) +
  coord_cartesian(expand = FALSE) +
  labs(
    x = "Weighted value of most cited works",
    y = NULL,
    title = "Institutions with most cited works on humpback whale"
  )

🪴 Example analyses

Heterocycles authors

How many authors published in Heterocycles?

LIVE DEMO

Prolific authors

Goal: Analyze the open access status of datasets published by researchers at Dartmouth college over the years.

w_dartmouth <- oa_fetch(
  "works",
  authorships.institutions.lineage = "i107672454",
  publication_year = ">2000",
  publication_year = "<2024",
  type = "dataset",
  options = list(
    select = c("id", "open_access", "publication_year")
  #   sample = 500, seed = 1 # sample for a faster response
  ),
  # count_only = TRUE,
  verbose = TRUE
)
Requesting url: https://api.openalex.org/works?filter=authorships.institutions.lineage%3Ai107672454%2Cpublication_year%3A%3E2000%2Cpublication_year%3A%3C2024%2Ctype%3Adataset&select=id%2Copen_access%2Cpublication_year
Getting 21 pages of results with a total of 4121 records...
w_dartmouth
# A tibble: 4,126 × 6
   id    is_oa_anywhere oa_status oa_url any_repository_has_f…¹ publication_year
   <chr> <lgl>          <chr>     <chr>  <lgl>                             <int>
 1 http… TRUE           gold      https… TRUE                               2014
 2 http… TRUE           hybrid    https… TRUE                               2014
 3 http… TRUE           gold      https… TRUE                               2020
 4 http… TRUE           hybrid    https… TRUE                               2021
 5 http… TRUE           gold      https… TRUE                               2019
 6 http… TRUE           gold      https… TRUE                               2015
 7 http… TRUE           gold      https… TRUE                               2017
 8 http… TRUE           gold      https… TRUE                               2019
 9 http… TRUE           gold      https… TRUE                               2023
10 http… TRUE           gold      https… TRUE                               2023
# ℹ 4,116 more rows
# ℹ abbreviated name: ¹​any_repository_has_fulltext
w_dartmouth |> 
  mutate(
    oa_status = oa_status %>% 
      fct_relevel("gold", "hybrid", "bronze", "green", "closed") |> 
      fct_recode(
        "Gold: Published in an OA journal" = "gold",
        "Green: Toll-access on publisher page, but free copy in an OA repository" = "green",
        "Bronze: Free to read on publisher page, but no identifiable license" = "bronze",
        "Hybrid: Free under an open license in a toll-access journal" = "hybrid",
        "Closed" = "closed"
      )
  ) |>
  count(publication_year, oa_status) |> 
  ggplot() +
  aes(y = as.factor(publication_year), x = n, fill = oa_status) +
  geom_col() +
  scale_fill_manual(values = c("#FFD700", "#1E90FF", "#CD7F32", "#32CD32", "#A9A9A9")) +
  labs(
    x = "Number of datasets", y = NULL,
    # title = "Open access status of datasets published by Dartmouth researchers",
    ) +
  coord_cartesian(expand = FALSE) 

Journal clocks

Goal: Visualize big journals’ topics.

We first download all records of journals with the most works/citation count, then visualize their scored concepts:

jours_all <- oa_fetch(
  entity = "sources",
  works_count = ">200000",
  verbose = TRUE
)
Requesting url: https://api.openalex.org/sources?filter=works_count%3A%3E200000
Getting 1 page of results with a total of 42 records...

The following is a lot of code but it is mainly for processing the data and customizing the final plot.

Code
clean_journal_name <- function(x) {
  x |>
    gsub("\\(.*?\\)", "", x = _) |>
    gsub("Journal of the|Journal of", "J.", x = _) |>
    gsub("/.*", "", x = _)
}

jours <- jours_all |>
  filter(type == "journal") |>
  slice_max(cited_by_count, n = 9) |>
  distinct(display_name, .keep_all = TRUE) |>
  select(jour = display_name, topics) |>
  tidyr::unnest(topics) |>
  filter(name == "field") |>
  group_by(id, jour, display_name) |> 
  summarise(score = (sum(count))^(1/3), .groups = "drop") |> 
  left_join(concept_abbrev, by = join_by(id, display_name)) |>
  mutate(
    abbreviation = gsub(" ", "<br>", abbreviation),
    jour = clean_journal_name(jour),
  ) |>
  tidyr::complete(jour, abbreviation, fill = list(score = 0)) |>
  group_by(jour) |>
  mutate(
    color = if_else(score > 10, "#1A1A1A", "#D9D9D9"), # CCCCCC
    label = paste0("<span style='color:", color, "'>", abbreviation, "</span>")
  ) |>
  ungroup()

jours |>
  ggplot() +
  aes(fill = jour, y = score, x = abbreviation, group = jour) +
  facet_wrap(~jour) +
  geom_hline(yintercept = c(25, 50), colour = "grey90", linewidth = 0.2) +
  geom_segment(
    aes(x = abbreviation, xend = abbreviation, y = 0, yend = 55),
    color = "grey95"
  ) +
  geom_col(color = "grey20") +
  coord_polar(clip = "off") +
  theme_bw() +
  theme(
    plot.background = element_rect(fill = "transparent", colour = NA),
    panel.background = element_rect(fill = "transparent", colour = NA),
    panel.grid = element_blank(),
    panel.border = element_blank(),
    axis.text = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  ggtext::geom_richtext(
    aes(y = 75, label = label),
    fill = NA, label.color = NA, size = 3
  ) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  labs(y = NULL, x = NULL, title = "Journal clocks")

🌵 Advanced topics

Other parameters of oa_fetch()

So far, we have seen the argument options, which is a list of additional parameters that can be passed to the API. Some of these options include:

  • select: Top-level fields to show in output.3

  • sort: Attribute to sort by, e.g.: “display_name” for sources or “cited_by_count:desc” for works.4

  • sample: number of (random) records to return.5

  • seed A seed value in order to retrieve the same set of random records in the same order when used multiple times with sample.

Another helpful argument is output. By default, oa_fetch() returns a tibble with pre-specified columns that we think are useful for most users. We’re working on better documenting for this process in this pull request. However, if you need more control over the output (e.g., you want more fields that are not returned in the dataframe), you can set output = "list" to get the raw JSON (R list) object returned by the API. Please make an issue if you think a field is important to return in the dataframe output. 6

Building your own query

Behind the scene, oa_fetch() composes three functions below so the user can execute everything in one step, i.e., oa_query() |> oa_request() |> oa2df()

  • oa_query(): generates a valid query, written following the OpenAlex API syntax, from a set of arguments provided by the user.

  • oa_request(): downloads a collection of entities matching the query created by oa_query or manually written by the user, and returns a JSON object in a list format.

  • oa2df(): converts the JSON object in classical bibliographic tibble/data frame.

Therefore, instead of using oa_fetch(), you can use oa_query() and oa_request() separately to build your own query and make an API request, then (optionally) convert the result to a dataframe using oa2df(). This way, you have more control over the process.

oa_generate()

oa_generate() is a generator for making request to OpenAlex API, returning one record at a time. This is useful when you want to process a large number of records without loading them all into memory at once. You will need to first build your own query (e.g., using oa_query() or interactively from the interface provided directly by OpenAlex) to use as an argument in oa_generate().

example(oa_generate)
Output

o_gnrt> if (require("coro")) {
o_gnrt+   # Example 1: basic usage getting one record at a time
o_gnrt+   query_url <- "https://api.openalex.org/works?filter=cites%3AW1160808132"
o_gnrt+   oar <- oa_generate(query_url, verbose = TRUE)
o_gnrt+   p1 <- oar() # record 1
o_gnrt+   p2 <- oar() # record 2
o_gnrt+   p3 <- oar() # record 3
o_gnrt+   head(p1)
o_gnrt+   head(p3)
o_gnrt+ 
o_gnrt+   # Example 2: using `coro::loop()` to iterate through the generator
o_gnrt+   query_url <- "https://api.openalex.org/works?filter=cited_by%3AW1847168837"
o_gnrt+   oar <- oa_generate(query_url)
o_gnrt+   coro::loop(for (x in oar) {
o_gnrt+     print(x$id)
o_gnrt+   })
o_gnrt+ 
o_gnrt+   # Example 3: save records in blocks of 100
o_gnrt+   query_url <- "https://api.openalex.org/works?filter=cites%3AW1160808132"
o_gnrt+   oar <- oa_generate(query_url)
o_gnrt+   n <- 100
o_gnrt+   recs <- vector("list", n)
o_gnrt+   i <- 0
o_gnrt+ 
o_gnrt+   coro::loop(for (x in oar) {
o_gnrt+     j <- i %% n + 1
o_gnrt+     recs[[j]] <- x
o_gnrt+     if (j == n) {
o_gnrt+       # saveRDS(recs, sprintf("rec-%s.rds", i %/% n))
o_gnrt+       recs <- vector("list", n) # reset recs
o_gnrt+     }
o_gnrt+     i <- i + 1
o_gnrt+   })
o_gnrt+   head(x)
o_gnrt+   j
o_gnrt+   # 398 works total, so j = 98 makes sense.
o_gnrt+ 
o_gnrt+   # You can also manually call the generator until exhausted
o_gnrt+   # using `while (!coro::is_exhausted(record_i))`.
o_gnrt+   # More details at https://coro.r-lib.org/articles/generator.html.
o_gnrt+ 
o_gnrt+ }
Loading required package: coro
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'coro'

oa_snowball()

The user can also perform snowballing with oa_snowball(). Snowballing is a literature search technique where the researcher starts with a set of articles and find articles that cite or were cited by the original set. oa_snowball() returns a list of 2 elements: nodes and edges. Similar to oa_fetch(), oa_snowball() finds and returns information on a core set of articles satisfying certain criteria, but, unlike oa_fetch(), it also returns information the articles that cite and are cited by this core set.

🌾 N-grams

OpenAlex offers (limited) support for fulltext N-grams of Work entities (these have IDs starting with "W"). Given a vector of work IDs, oa_ngrams() returns a dataframe of N-gram data (in the ngrams list-column) for each work.

ngrams_data <- oa_ngrams(
  works_identifier = c("W1964141474", "W1963991285"),
  verbose = TRUE
)
ngrams_data
# A tibble: 2 × 4
  id                               doi                              count ngrams
  <chr>                            <chr>                            <int> <list>
1 https://openalex.org/W1964141474 https://doi.org/10.1016/j.conb.…  2733 <df>  
2 https://openalex.org/W1963991285 https://doi.org/10.1126/science…  2338 <df>  
lapply(ngrams_data$ngrams, head, 3)
[[1]]
                                       ngram ngram_count ngram_tokens
1                 brain basis and core cause           2            5
2                     cause be not yet fully           2            5
3 include structural and functional magnetic           2            5
  term_frequency
1   0.0006637902
2   0.0006637902
3   0.0006637902

[[2]]
                                         ngram ngram_count ngram_tokens
1          intact but less accessible phonetic           1            5
2 accessible phonetic representation in Adults           1            5
3       representation in Adults with Dyslexia           1            5
  term_frequency
1   0.0003756574
2   0.0003756574
3   0.0003756574
ngrams_data |>
  unnest(ngrams) |>
  filter(ngram_tokens == 2) |>
  select(id, ngram, ngram_count) |>
  group_by(id) |>
  slice_max(ngram_count, n = 10, with_ties = FALSE) |>
  ggplot(aes(ngram_count, fct_reorder(ngram, ngram_count))) +
  geom_col(aes(fill = id), show.legend = FALSE) +
  scale_fill_manual(values = c("#A16928", "#2887a1")) +
  facet_wrap(~id, scales = "free_y") +
  labs(
    title = "Top 10 fulltext bigrams",
    x = "Count",
    y = NULL
  )

oa_ngrams() can sometimes be slow because the N-grams data can get pretty big, but given that the N-grams are “cached via CDN”, you may also consider parallelizing for this special case.7

🌸 Q&A

TODO: to fill after webinar

🍁 Conclusion

Thank you for participating in the openalexR webinar! We hope you found the session informative and engaging. For more information, visit the openalexR GitHub page reach out to me on the OpenAlex Community Google Group with any questions.