A Tidytext Analysis of 3 Chinese Classics

May 29, 2018

ctextclassics tidytext tmcn

For a long time I’ve admired the tidytext package and its wonderful companion book Text Mining with R. After reading it I thought, “Why not undertake a project of Chinese text analysis?” I am deeply interested in Chinese philosophy but I decided to keep the analysis narrow by selecting just three works - The Analects, Zhuangzi, and the Mozi.

Following similar pace with Tidytext, I first download my data. Here I use my package ctextclassics and specifically, the function get_books(c(...)). But I want to point out the API limit is very low and I had to download my books between two different days. For information on ctextclassics, check out my previous post or type install_github("Jjohn987/ctextclassics")

library(tidyverse)
library(stringr)
library(ctextclassics)
library(tmcn)
library(tidytext)
library(topicmodels)
library(readr)

my_classics <- read_csv("~/Desktop/anything_data/content/post/my_classics.csv")

With any text analysis, tokenizing the text and filtering out stop words is fundamental. Tidytext segments English quite naturally, considering words are easily separated by spaces. However, I’m not so sure how it performs with Chinese characters.

There are specific segementers for Chinese text - one main tool is jiebaR, which is also included in the tmcn package.

However, when comparing the two methods, I noticed that JiebaR segments text in a way most suitable for modern Chinese (Mostly 2 character words). Since I’m dealing with classical Chinese here, Tidytext’s one character segmentaions are more preferable.

tidytext_segmented <- my_classics %>% 
  unnest_tokens(word, word)

For dealing with stopwords, JiebaR offers a useful stopword list, but obviously more should be added since we’re dealing with classical Chinese. Many of the words I added are amorphous grammar particles, but there’s other low value phrases amongst these works such as “子曰” (“The Master said”), “天下” (Tian Xia, a common but amorphous concept roughly meaning a country, realm, or the world), and more.

Let’s filter out those words and make 2 data frames - word frequencies for each book and each chapter.

stopwordsCN <- data.frame(word = c(tmcn::stopwordsCN(),
"子曰", "曰", "於", "則","吾", "子", "不", "無", "斯","與", "為", "必",
"使", "非","天下", "以為","上", "下", "人", "天", "不可", "謂", "是以",
"而不", "皆", "不亦", "乎", "之", "而", "者", "本", "與", "吾", "則",
"以", "其", "為", "不以", "不可", "也", "矣", "子", "由", "子曰", "曰",
"非其", "於", "不能", "如", "斯", "然", "君", "亦", "言", "聞", "今",
"君", "不知", "无"))

## Add a column that converts traditional Chinese to simplified Chinese
## Count words by book, then word frequency to account for different book lengths. 

counts_by_book <- tidytext_segmented %>%
  ungroup() %>%
  mutate(simplified = tmcn::toTrad(word, rev = TRUE), pinyin = tmcn::toPinyin(word)) %>%
  anti_join(stopwordsCN) %>%
  count(book, word, pinyin, simplified) %>%
  group_by(book) %>%
  mutate(word_freq = `n`/sum(`n`)) %>%
  arrange(-n) %>%
  ungroup()

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

Now let’s do the familiar ritual of examining the top 10 words in each book (e.g, counts_by_book) and plot them.

book_top_words <- counts_by_book %>%
  ungroup() %>%
  group_by(book) %>%
  top_n(10) %>%
  ungroup()

##format the above dataframe for a pretty display with kable
formatted_words <- book_top_words %>%
  group_by(book) %>%
  transmute(word, simplified, n, word_freq, order = 1:n()) %>%
  arrange(book, -word_freq) %>%
  select(-order)

##Set format for kable 
options(knitr.table.format = "html") 

knitr::kable(formatted_words) %>%
  kableExtra::kable_styling(font_size = 15, full_width = T) %>% kableExtra::row_spec(1:10, color = "white", background = "#232528") %>% kableExtra::row_spec(11:20, color = "white", background = "#6A656B") %>% kableExtra::row_spec(21:30, color = "white", background = "#454d4c") %>%
kableExtra::row_spec(0, bold = F, color = "black", background = "white")  %>% kableExtra::scroll_box(width = "100%", height = "350px")

book	word	simplified	n	word_freq
analects	問	问	110	0.0154321
analects	君子	君子	108	0.0151515
analects	仁	仁	76	0.0106622
analects	孔子	孔子	68	0.0095398
analects	行	行	57	0.0079966
analects	知	知	54	0.0075758
analects	路	路	52	0.0072952
analects	見	见	51	0.0071549
analects	民	民	45	0.0063131
analects	子貢	子贡	44	0.0061728
mozi	民	民	257	0.0068141
mozi	治	治	229	0.0060717
mozi	利	利	227	0.0060187
mozi	墨	墨	207	0.0054884
mozi	知	知	200	0.0053028
mozi	說	说	197	0.0052232
mozi	行	行	192	0.0050907
mozi	欲	欲	190	0.0050376
mozi	長	长	189	0.0050111
mozi	國	国	175	0.0046399
zhuangzi	夫	夫	313	0.0099356
zhuangzi	知	知	302	0.0095864
zhuangzi	見	见	222	0.0070469
zhuangzi	物	物	217	0.0068882
zhuangzi	大	大	204	0.0064756
zhuangzi	行	行	176	0.0055868
zhuangzi	邪	邪	165	0.0052376
zhuangzi	德	德	164	0.0052059
zhuangzi	道	道	164	0.0052059
zhuangzi	心	心	142	0.0045075

So far so good - these results are very intuitive. Of course, plotting them can accomplish this in even greater detail. Let’s plot the top 10 words and their respective frequencies from each of these texts, in calligraphy inspired colors!

ink_colors <- rev(c("ivory", "#454d4c", "#6A656B", "#232528"))

ggplot(book_top_words, aes(x = reorder(word, word_freq), y = word_freq, fill = book)) +
  geom_col(show.legend = FALSE) + 
  geom_text(aes(label = pinyin), color = "white", position = position_stack(vjust = 0.5)) + 
  facet_wrap(~book, scales = "free", labeller = labeller(labels)) + 
  coord_flip() +
  scale_fill_manual(values = ink_colors) +
  theme_dark(base_family= "HiraKakuProN-W3") + 
  theme(axis.text.x = element_text(color = "#232528", angle = 90)) +
  theme(axis.text.y = element_text(color = "#232528", size = 15)) +
  theme(panel.background = element_rect(fill = "#87969B"), plot.background = element_rect(fill = "ivory"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + 
  labs(x = NULL, y = NULL) +
  ggtitle("Word Frequencies") +
  theme(plot.title = element_text(size = 20, color = "#232528", hjust = 0.5))

Summary

In this post, I…

Used ctextclassics to download classic Chinese texts
Split the text into character tokens of 1
Filtered out common stop words
Grouped the data by book and word, calculating total words and word frequencies
Made a (calligraphy inspired) bar plot of the top 10 most frequent words in each text.

Conclusion

These word frequencies are very pleasing!

The Analects has prevalant usage of words such as “Benevolence” (仁), “Gentleman” (君子) and “Confucius” (孔子). (I capitalize these terms to show they are uniquely different from contemporary English equivalents.)

The Zhuangzi, a Taoist text, mentions cosmological concepts such as the Tao(道), morality (德), and evil(邪).

The Mozi seems to have terms that are mostly civic related, such as country (国), citizen (民) and govern（治).

These frequencies do a good job of capturing the context of the works - e.g., regarding the Analects, Benevolence and the Gentleman are often mentioned - one examplary sentence may be:

“君子而不仁者有矣夫。未有小人而仁者也.” My own (shorthand) translation:

Of Gentlemen, there are some who do not possess Benevolence; but of Villians, there is not a single one that possesses it.

In my next post, I would like to either follow the same procedure but with bigrams, and/or apply LDA (Latent Dirichlet Allocation) to see whether chapters can be distinguished from one another.

Although frequent words are very different among texts, I’m not so sure that each book can be completely distinguished from others (There are many shared words - Dao isn’t solely mentioned in Taoist texts, and each text includes civic related concepts related to proper governance, plus, the Mozi is likely authored by different people!)

On that note, to be continued!

Plotting Word Bigrams with 3 Chinese Classics

June 1, 2018

tmcn ctextclassics tidytext

Ctextclassics, my First Package

May 17, 2018

ctextclassics API httr text analysis Chinese