Linguistic characteristics of Chinese register based on the Menzerath—Altmann law and text clustering

Hou, R; Huang, CR; Ahrens, K; Lee, YMS

Title:	Linguistic characteristics of Chinese register based on the Menzerath—Altmann law and text clustering
Authors:	Hou, R Huang, CR Ahrens, K Lee, YMS
Issue Date:	Apr-2020
Source:	Digital scholarship in the humanities, Apr. 2020, v. 35, no. 1, p. 54-66
Abstract:	This article explores the linguistic features of different registers in Chinese through text clustering driven by the Menzerath–Altmann (MA) law. We propose to calculate the average word length distribution according to clause length. The MA law predicts that texts from different registers will show differences in terms of average word length distribution in texts. As predicted by the MA law, analysis result demonstrates that average word length decreases with the increase of clause length in each register and that their relationship can be fitted by the formula y = axbe−cx⁠. We hypothesize that it is the situation type, i.e. whether the text is dialectic or monologue, that is the linguistic characteristic behind the dichotomy of word length distribution. To confirm these register-distinguishing linguistic features, texts were represented by the average word length distribution and the fitted parameters using the vector space model and clustered according to their register categories. Good clustering results show that average word length distribution in certain length clauses and their fitted parameters can be used as the distinctive characteristics of these three registers.
Publisher:	Oxford University Press
Journal:	Digital scholarship in the humanities
ISSN:	2055-7671
EISSN:	2055-768X
DOI:	10.1093/llc/fqz005
Appears in Collections:	Journal/Magazine Article

Access

View full-text via PolyU eLinks

Show full item record