Scraping for a Booklist of the Chinese Classics
May 8, 2018
Rvest scraping Chinese classics Ctext.orgLast week I was considering a project that would be interesting and unique. I decided I would like to do a text analysis on classical Chinese texts, but wasn’t sure what kind of analysis regarding which texts. I decided to keep it small - and use five of the “core” Chinese classics - The Analects, The Mengzi, Dao De Jing, Zhuangzi, and Mozi. While there are many books in Confucianism, Daoism, and Moism, these texts are often used as the most representative examples of each “genre”.
Of course, the first key question was, from where can I get the data? One website with a rich amount of Chinese text data regarding the classics is ctext.org.
But when looking at the site design, I wondered “How can I get this in R?” Scraping wasn’t entirely feasible due to the terms outlawing this practice. Secondly, scraping is a bit of a delicate operation - if text isn’t formated uniformly across pages then you might be in for a headache. You also don’t want to give the server unnecessary stress. In the end, if you opt for it, you’ll have to make your functions work with the site structure - and as evidenced by the screenshot, it seemed a bit… messy. (Well, it turns out that actually it wasn’t.)
To get the text of the Chinese classics into R, the solution was to build an API. There is an API avaialble on ctext.org’s website, but it’s made in Python. I’ve never built an API or proto-API functions before, but the latter was easier than I thought. Right now I’ll save that for a future post.
To wrap up this post - Many of the key functions in the site API revolve around passing a book or chapter as the args. So, it turned out scraping was a necessary evil. Therefore I kept it limited and not too demanding.
Without ado, here is the (very limited) scraping I did to create a book list with chapters, which I put to use later in my homemade API.
library(tidyverse)
library(rvest)
library(stringr)
First Scrape
## 1st Scrape - Get list of books available on ctext website.
url <- "https://ctext.org/"
path <- read_html(url)
genre_data <- path %>%
html_nodes(css = ".container > .etext") %>%
html_attr("href")
##Delete first observation which is not a genre
genre_data <- genre_data[-1] %>% tibble("genre" = .)
##Append the base url to the sub-links
genre_data <- genre_data %>%
mutate(genre_links = paste("https://ctext.org", "/", genre_data[[1]], sep = ""))
## Warning: package 'bindrcpp' was built under R version 3.3.2
Next I set up a scraping function which needs to iterate over each book from the “genre_data” dataframe just created. Note the “Sys.sleep” call at the end to avoid overloading the server and play nicely with the website.
Function - Preparing for the 2nd scrape
##2nd Scrape - Make function to apply to each book, to get chapters
scraping_function <- function(genre, genre_links) {
url <- genre_links[[1]]
path <- read_html(url)
data <- path %>%
html_nodes(css = "#content3 > a") %>%
html_attr("href")
genre <- genre
data <- data_frame(data, genre)
##Some string cleaning with stringr and mutate commands
data <- data %>% mutate(book = str_extract(data, "^[a-z].*[\\/]")) %>%
mutate(book = str_replace(book, "\\/", ""))
data <- data %>%
mutate(chapter = str_extract(data, "[\\/].*$")) %>%
mutate(chapter = str_replace(chapter, "/", ""))
data <- data %>%
mutate(links = paste("https://ctext.org/", book, "/", chapter, sep = ""))
data <- data %>% select(-data) %>%
filter(complete.cases(.))
Sys.sleep(2.5)
data
}
If there was one takeaway from writing that function, it was that I should deepen my proficiency in regex. Finding the right regular expressions to capture the book and chapter names wasn’t HARD, but I did have to make several attempts before getting it all right. Previously web content was clean enough that I didn’t have to do this. Anyway, let’s apply the hard work to our original genre dataframe so that we can get a dataframe of books and their chapters. It’s going to be a big one.
Apply the function and get the data.. I have come to love purrr
for this.
##Apply function to genre_data dataframe, create a data frame of books and chapters
all_works <- map2(genre_data$genre, genre_data$genre_links, ~ scraping_function(..1, ..2))
book_list <- all_works %>% do.call(rbind, .)
And here it is. The final variable “book_list” is a collection of books and chapters of each book, as listed on Ctext.org.
head(book_list)
## # A tibble: 6 x 4
## genre book chapter
## <chr> <chr> <chr>
## 1 confucianism analects xue-er
## 2 confucianism analects wei-zheng
## 3 confucianism analects ba-yi
## 4 confucianism analects li-ren
## 5 confucianism analects gong-ye-chang
## 6 confucianism analects yong-ye
## # ... with 1 more variables: links <chr>
It is clearly in long format (convenient but not necessary, in fact this more a side effect of my scraping)
str(book_list)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5869 obs. of 4 variables:
## $ genre : chr "confucianism" "confucianism" "confucianism" "confucianism" ...
## $ book : chr "analects" "analects" "analects" "analects" ...
## $ chapter: chr "xue-er" "wei-zheng" "ba-yi" "li-ren" ...
## $ links : chr "https://ctext.org/analects/xue-er" "https://ctext.org/analects/wei-zheng" "https://ctext.org/analects/ba-yi" "https://ctext.org/analects/li-ren" ...
It is quite lengthy at nearly 6,000 rows and 130 different books. And this is an important dataframe which I will use in my API that I make, to pull textual data into R from Ctext.org.
Next post, I plan on sharing the process and results of my Chinese Classics text analysis.