Gender Prediction for Japanese Authors
نویسندگان
چکیده
We compare the performance of several automatic classification systems across a collection of different feature sets in detecting the gender of Japanese authors. The Japanese language is notable for the distinctiveness of the different manners and modes of speech used by the two genders--speakers of one gender often use different verb forms, sentence-final particles, and even personal pronouns than speakers of the other. However, this “gendered speech” predominates in informal conversation, and Japanese formal and narrative texts typically contain little or no direct indication of the author’s gender. We investigate the possibility that gender nonetheless has an influence on the writing style of Japanese authors, and that writing style can be used to predict the author of the text. We explored several subtle features for guessing an author’s gender as determined by Naïve Bayesian, SVM, and Logistic Regression classification models. Dataset We hand-collected two different corpora of Japanese text from online sources: firstly, a collection of 30 essays written by Japanese middle schoolers on a common topic, and secondly, 485 installments of fantasy novels posted online by 40 different authors. These were selected specifically to provide similarity of content across documents, since our intended focus was on lexical and grammatical features rather than choice of content. Further, this lowered the likelihood of mistaking features associated with genre or agerange as indicators of gender. We chose fictional texts for the second genre in order to lessen the impact of firstperson references--in Japanese, there are a variety of first-person pronouns which speakers may opt to use, and their usage is often divided by gender. For this reason, occurrences of gender-based pronouns tended to dominate our classification of firstperson essays. Passages of fiction provided a useful environment in which first-person references were less trustworthy (though perhaps not entirely uninformative, since it is plausible that authors may favor writing from the perspective of a character of like gender). Methodology Japanese Tokenization Our first obstacle was to render our raw Japanese data in a tokenized form that would be accessible to classification systems. Since Japanese uses no overt word separators in text like the space used in many European languages, we required a more hands-on approach. We chose to tokenize our data using the ChaSen morphological parser developed by the Matsumoto laboratory at the Nara Institute of Science and Technology, available online at . ChaSen tokenization provided not only word separation but other morphological input as well, including: Stem (an individual morpheme of a Japanese word) Lemma (the “dictionary-form” or uninflected rendering of a morpheme) Part of speech Pronunciation Feature Set Personal pronouns: The most obvious feature for gender classification is the occurrence of gender-specific first-person pronouns such as masculine 僕 (boku) and 俺 (ore) versus feminine-neutral 私 (watashi). However, we also explored less overt features that we hoped would be more stable in noisy environments like fiction. Word choice: Since we restricted our classifiers to texts of similar topic and content, we chose to treat the occurrence of all words as potentially informative features in order to test whether diction and word-choice within a common subject might prove indicative of gender. Pronunciation and Word shape: The Japanese language utilizes three distinct writing systems: hiragana, a phonetic system used for native Japanese words as well as most inflectional suffixes, katakana, a phonetic system used for foreign loanwords and onomatopoeia among other functions, and kanji, logograms primarily borrowed from Chinese which typically occur as content words and verb stems. A given Japanese word may often be represented in a number of ways: purely in hiragana, in a combination of hiragana and kanji, or even as katakana. Previous research on Japanese gendered speech indicates that males use more words of Chinese origin, and thus more words written in kanji, than females. We thus hypothesized that documents written by male authors would have a larger total number of kanji on average than those written by female authors. In Japanese, there is also a prevalent cultural association between the phonetic script hiragana and femininity. Thus, we hypothesized that, given a word that can be written either in kanji or with hiragana, females would be more likely than males to write the word in hiragana. We designed features to test both of these hypotheses. Use of quotations: While parsing ChaSen output, our feature extractor tagged words that occurred between quotation marks. Initially we intended to use this information to filter out quoted speech, which might be taken from outside sources and therefore not be indicative of the author. However, we also found that the number of words inside quotation marks and outside quotation marks in the text is a very poor indicator of the author’s gender. Thus, we used this as a baseline feature so that we could compare performance of our chosen features both to chance, and to features which were not indicative of gender. Part of speech: One defining feature of Japanese gendered speech is differences in usage of sentence-final particles, some of which the ChaSen tokenizer glosses as different parts of speech. Therefore, we theorized that male and female authors may produce sufficiently different counts in particular part-of-speech tallies to be indicative of gender. Lemma: One other possibility was that male and female authors would use different words on average to express themselves. For our purposes, the “lemma” was simply the dictionary form of the stem as parsed by Chasen. Some lemmas that we felt might be particularly indicative of the author’s gender include personal pronouns such as “boku,” “ore,” and “watashi,” politeness markers such as “desu” or the “masu” verb stem, and content words that may indicate different author preferences in the content of the story (e.g. “battle”). For our purposes, we counted each lemma as a separate feature, however, since some lemmas are much more predictive than others, this also increased the risk of overfitting our data. Classifiers Naïve Bayesian Classifier: We first adopted a Naive Bayes approach to text classification. Naive Bayes is known as the “bag-of-words” approach, in which the probability of each word occurring in each class in the training data is used to compute the probability of the class of a new document given the words in that document. At the heart of Naive Bayes is the “naive” assumption that the features are conditionally independent given the class. This assumption is clearly false in the case of the words contained in a single document, since as mentioned in lecture, wordsense disambiguation can be aided by the presence of words as far as 10,000 words away in the same text. Nonetheless, Naive Bayes Classifiers can be a simple but effective tool for text classfication, and get surprisingly accurate results. In our case also, Naive Bayes proved surprisingly accurate, and was the simplest way to obtain decent results without worrying about parameter estimation. Fig 1. The Naive Bayes Assumption Logistic Regression: The second approach we took to data analysis was logistic regression. Logistic regression, as with other machine learning algorithms, involves training a vector of weights that scale the relative contributions of the features to maximize the probability of the training data. We adapted a logistic regression program that had been written by one of us previously to accept binary input features, so that it would accept real value input features. Logistic regression is biased to overestimate the importance of features given a small sample size. In our case, we found that it was extremely sensitive to the number of training epochs and in many cases, seemed to converge on around a 50% probability of either gender. Support Vector Model: Next, we analyzed our dataset using the LIBSVM tool produced by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University. The fundamental motivation for the SVM classifier is to consider documents as points in n dimensional space, where n is the number of feature values, segregated into two (or conceivably more) classes. A hyperplane is then found which divides the segregated sets with the greatest possible margin. Test data is then plotted onto this space and classified according to the side of the hyperplane on which it falls. However, our specific application of SVM required some optimization, as certain constants used in training may be problem specific--namely, the constant C for penalizing divisions which do not make entirely clean separations of known-class data points, and γ, a parameter of the kernel function K used to map values into a transformed space, in order to accommodate problems for which the class separation is non-linear in the original space. To select ideal C and γ choices for our application, we used a grid-search method with cross-validation data from our corpus; an initial sweep over possible parameter values was checked for regions of high accuracy in cross-validation data, which was then checked again using finer increments until suitably well-performing parameters were found. Our intent here was to prevent over-fitting of the training data; by selecting a more lenient error-penalty parameter C, we were able to account for the anticipated fuzziness of our dataset, which (in the absence of stark divisions like boku/watashi usage in passages written from the perspective of the author) we did not expect to exhibit a clear gulf between the male and female classes. The C parameter therefore needed to permit the possibility of training set points falling on the incorrect side of the hyperplane in order to prevent the classifier from selecting a division which cleanly but narrowly separated the data. Additionally, to prevent feature values which inherently accumulated much larger counts across all documents from washing out the effects of relevant but low-frequency features, we scaled all feature counts in the training data to the range [-1,1], and applied an identical transformation to the feature counts tabulated for the testing data.
منابع مشابه
A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملNondestructive Determination of the Total Volatile Basic Nitrogen (TVB-N) Content Using hyperspectral Imaging in Japanese Threadfin Bream (Nemipterusjaponicus) Fillet
Background and Objectives: Considering the importance of safety evaluation of fish and seafood from capture to purchase, rapid and nondestructive methods are in urgent need for seafood industry. This study aimed to assess the application of hyperspectral imaging (HSI: 430-1010 nm) for prediction of total volatile basic nitrogen (TVB-N) in Japanese-threadfin bream (Nemipterusjaponicus) fillets, ...
متن کاملEffects of Salbutamol on growth performance and carcass characteristics of Japanese quail (Coturnix japonica)
The effects of feeding diets containing Salbutamol (0, 1, 3, 5 and 7 mg/kg diet) from 21 to 49 days of ageon growth performance and carcass characteristics in 180 male and 180 female Japanese quails (Coturnixjaponica) were studied using a factorial arrangement based on completely randomized design. Gender hadsignificant effect on weight gain (Pwere not statistically significant (P>0.05). Salbut...
متن کاملGenerating a Set of Rules to Determine The Gender of a Speaker of a Japanese Sentence
Some work has been reported on the problem of automatically determining the gender of a document’s author as a part of researches to extract features of a document’s author. Japanese language has expressions called masculine/feminine expression, and they can often indicate the gender of a speaker of a conversational sentence. The computer system needs this mechanism in order to make or understa...
متن کاملHanding the Microphone to Women: Changes in Gender Representation in Editorial Contributions Across Medical and Health Journals 2008-2018
The editorial materials in top medical and public health journals are opportunities for experts to offer thoughts that might influence the trajectory of the field. To date, while some studies have examined gender bias in the publication of editorial materials in medical journals, none have studied public health journals. In this perspective, we studied the gender ratio ...
متن کامل