• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Improving the Methodology of Automatic Text Analysis

 

 

 

Project leader: Sergei Koltcov

Project participants (from different times): Sergey Nikolenko, Konstantin Vorontsov, Murat Apishev, Vladimir Filippov, Maxim Koltsov, Vera Ignatenko

 

Topic modeling is a promising instrument for computational social science and digital humanities as it allows to automatically reveal the topic structure of large text collections – an immensely important task in the era of big Internet data. However, topic modeling has a number of problems that prevent its efficient use by social scientists, including social media analysts. First, it does not yield reproducible results and fluctuates greatly from one algorithm run to another. Second, it gives no clues on optimization of its parameters, such as the number of topics and other parameters of the model, and, third, there exist no reliable quality metrics that could be used for such optimization and for the assessment of the algorithm performance. A possible solution can be searched by the application of concepts from statistical physics.

This project represents LINIS constant effort aimed at solving these problems.

First, the project tests existing topic modeling quality metrics and seeks to develop new ones. It also develops approaches to metric testing and theoretical concepts of topic modeling quality and ground truth. One of the project’s publications proposes a metric of tf-idf coherence that performs better than ordinary coherence and is easily generalizable from evaluating a single topic to evaluating the entire solution. 

Second, the project aims to regularize topic modeling algorithms so as to improve their stability. The team offers various approaches, such as sampling neighbor words from texts (gLDA – granulated LDA), seedword-based semi-supervised solutions (ISLDA – interval semi-supervised LDA) and experiments with additive regularization of pLSA (in collaboration with Konstantin Vorontsov’s team at HSE Moscow). 

Third, the project develops methods to efficiently detect an optimal number of topics given that parameter optimization is a computationally intensive task. The project lays theoretical foundations for greedy algorithms based on the concepts from thermodynamics, such as non-extensive entropy and free energy. This approach allows to look at the problem of ambiguity of stochastic decomposition from a new angle and to formulate the task of topic number optimization in terms of finding the entropy minimum and the information maximum. One of the project’s last publications proposes a new metric based on Renyi entropy for determining the optimal number of topics (Koltcov, 2018).

Fourth, the project team invests a lot of effort in developing and maintaining TopicMiner, a GUI-based research software for topic modeling. While freeing researchers from coding and scripting, the software allows them to concentrate on substantial topic modeling tasks: first, it lets the computer and NLP scientists to quickly apply and evaluate various models, and second, it lets social scientists and humanities scholars to efficiently examine and interpret results of topic modeling. In its current version, TopicMiner implements basic pLSA, LDA (E-M algorithm and Gibbs sampling), BigARTM-based models, a number of quality metrics and modeling progress visualization. It also contains a user-friendly preprocessing module and a module to work with output (visualization, scrolling through and sorting millions of documents, and output export). 

 

Download TopicMiner software

Download TopicMiner manual (Russian)

 

Publications:

 


 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.