Development of concept and methodology for multi-level monitoring of the state of interethnic relations with the data from social media
Project Head: Olessia Koltsova
The major result of this project is a conceptual approach to and a methodology of system devised to monitor ethnic relations on the Post-Soviet space. Its main goal is to trace, in a semi-automatic way, distribution of discussions about ethnicity in the Russian-language social media over time and space. The primary task of this tracing is early prevention of emerging inter-ethnic conflicts. Our conceptual approach is based on a large number of experiments resulting in an integral vision of a sequence of steps necessary for accomplishment of all monitoring tasks. Those steps been translated into a system of concrete methods and algorithms, and they in turn have been implemented in a user-friendly software available online. This easy to use system is accompanied by methodological recommendations that contain both the description of the underlying approach and a practical guide for analysts and researchers interested in monitoring ethnic relations.
- Online system is available at: https://topicminer.hse.ru/
- Technical user manual is available at: https://topicminer.hse.ru/docs/
- Methodological recommendations are available at: https://linis.hse.ru/rnf2015/
- Offline desktop version of the system: available on the enclosed CD.
- Project website with publications and other materials: https://linis.hse.ru/rnf2015/ .
Our system contains the following functionality and components.
First, the methodology takes into account that a user may have access only to noisy, unfiltered data with a low proportion of texts about ethnicity (e.g. raw dumps of social media messages). Our system does not collect data, but it contains a number of unique instruments for text preprocessing, whose core is a methodology that filters texts non-relevant to the topic of ethnicity. All our experiments have shown that detection of any ethnicity-related trends in large collections of texts is impossible without pre-filtering. The methodology consists of two components: text selection based on a lexicon of ethnonyms containing 3680 individual words and 12670 bigrams (precision up to 74%) and a machine learning based selection (precision and recall around 74%). We recommend to combine these two approaches to increase recall.
Second, based on such collection enrichment, our system allows to extract topics, or contexts in which ethnical issues are discussed and which are not known to researchers beforehand. For this, we offer a number of improvements for topic modeling algorithms whose quality has been tested both manually and with a specially developed quality metric – tf-idf coherence. Our experiments have shown that a basic pLSA algorithm with our lexicon of ethnonyms yields the best results among all BigARTM algorithms. It is best suited for revealing the entire range of ethnicity-related topics existing in a given collection, for comparison of those topics by their volume, and for detection of topics devoted simultaneously to several ethnic groups. To extract contexts related to a single pre-defined ethnic group, a better option is our other algorithm with a more aggressive partial supervision – ISLDA which also exceeds basic LDA both by the proportion of ethnically relevant topics and by their tf-idf coherence.
Introduced algorithms were tested on different collections containing from 100,000 to 9,000,000 texts, with different average text length and with different share of relevant texts. Good results were achieved with collections containing a certain proportion of relevant and long texts. The main contribution into quality of the tested models came from our lexicon of ethnonyms. The overall conclusion from the experiments is that although topic modeling cannot be used for extraction of relevant texts from collections with a low proportion of such texts (and this task was solved via supervised classification), topic modeling nevertheless works well for detection of contexts in which ethnicity is discussed. All listed above algorithms are implemented in our system which also has functionality of tipping on ethnically relevant topics based on comparison of topics’ top words with our lexicon of ethnonyms.
Third, the system is able to yield distributions of ethnically relevant topics over time and space and visualize them on a time scale or on the map of Russia, respectively. Besides simply summing the probabilities of a given topic over all texts of a given region or time period, our methodology includes specially tuned multimodal algorithms of topic modeling where timestamps and geolocation tags are made a separate modality. Our experiments have shown that this approach works better than simple summing for revealing topics concentrated in time, although it penalizes topics evenly distributed over time. For obtaining a more precise distribution by the Russian regions we have also calculated a set of correction coefficients accounting for uneven penetration of social networks across Russian subjects of Federation.
Fourth, our methodology allows revealing a number of aspects of attitudes to the problems of ethnicity. This part of methodology is based on algorithms trained with a marked-up collection containing 15,000 messages about 115 post-Soviet ethnic groups. Experiments were run with the following set of attitude aspects: text-level aspects (1) general problematization of the topic in the text (does the text contain negative / positive sentiment); (2) conflict presence (does the text mention inter-ethnic conflict or positive inter-ethnic interaction?). Instance-level aspects: (3) general attitude (What is the general attitude of the text author to a given ethnic group? Negative / positive / neutral); (4) perception of ethnic hierarchy (Does the author treat a given ethnic group as superior / inferior?); (5) danger perception (Does the author perceive a given ethnic group as dangerous?); (6) blame attribution (In case of conflict, does the author present a given ethnic group as a victim / an aggressor?); (7) call for violence (Does the author call for violence against a given ethnic group?). For aspects 5 and 7, there was no sufficient data to train a classifier. Other instance-level aspects have produced mixed results, of them the best quality was obtained for classes “superior” and “aggressor”. At the text level, negative aspects – conflict presence and negative sentiment – are predicted better than positive ones; algorithms trained to predict these two aspects have been integrated into our online system. Besides this, the system was equipped with a function of sentiment analysis of topics based on comparison of topics’ top-words with our sentiment lexicon.
Our experiments have also shown that doubling the size of the marked-up collection, although it improves quality of classification, does not solve the problem radically; furthermore, the quality seems to be unrelated to the level of inter-coder agreement. This suggests that ways of further improvement of classification of attitudes to ethnic issues should be searched for via extracting specific grammatical constructions.
In addition to the development of our methodology, a number of social media text collections has been constructed and analyzed. First, we created a number of samples from the earlier collected posts of popular LiveJournal bloggers, containing from 100,000 to 1.58 million messages, of which 2,000 were manually marked-up. Second, we constructed a sample of 74,000 random VKontakte users from each Russia’s subject of Federation which prdiuced 9 million posts and 1 million comments. Third, we obtained a collection of all messages containing one of keywords denoting post-Soviet ethnic groups from all Russian-language social networks for two years; after filtering, this collection comprised 2.7 million texts. The marked-up collection of 15,000 posts has been sampled mostly from this keyword collection.
In the random VK sample, the proportion of texts with ethnonyms constitutes only a fraction of a per cent, and the most frequent are not the post-Soviet ethnic groups, but nations with global influence (Americans, Germans). This sample is dominated by short texts and recreational topics. In the keyword ethnonym-based sample, despite it being dominated by VK, texts are on average 20 times longer, and the prevailing topics are those related to public affairs and leaning to negative sentiment. All this suggests the overall problematization of discussions of ethnic issues and confirms the importance of their monitoring. Occurrence of post-Soviet ethnonyms and quasi-ethnonyms in the latter sample is also very uneven (the most frequent are: Russians, Ukrainians, Jews, Slavs, Asians, Europeans, Tatars and Chechens). Furthermore, regional distribution is highly uneven: in “national republics” local frequency ratings, the respective titular ethnic groups gain on average 45 positions compared to the general frequency rating. This confirms the importance of regional dimension of the monitoring system.
Around 45% of texts mentioning ethnonyms contain more than one ethnonym; in the marked-up collection this proportion amounts to 66%. The co-occurrence analysis shows that most often co-occuring ethnic groups are those that are culturally or geographically proximate, not those that are in conflict with each other as we initially assumed. Conflicting pairs are also sometimes mentioned together, but the fact that only 6% of texts contain opposite attitudes to the mentioned ethnic groups, suggests that such pairs are indeed a minority. At the same time, another 15% of texts contains a combination of a neutral and tonal attitude, and all this complicates automatic detection of attitudes to ethnic groups in texts. It has also been revealed that these attitudes are unevenly distributed among different ethnic groups. For instance, manual analysis of LiveJournal posts has shown that Caucasian ethnic groups attract a significantly larger number of negative attitude aspects than Central Asians which is in accordance with earlier survey-based studies.
It should be noted that direct calls for violence against any ethnic groups occur in less than 1% of ethnically relevant texts. Beyond LiveJournal positive aspects of attitude prevail over negative ones, although this may be explained with over-representation of small nationalities in the marked-up sample. Simultaneously, these marked-up texts are more characterized by generalized vision of ethnic groups, negative sentiment and conflict mentioning than by positive sentiment, mentioning of positive inter-ethnic interaction and of concrete persons of a given ethnicity. In other words, users problematize the topic of ethnicity in general more often than they express a direct negative attitude to certain ethnic groups or persons.
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.