LiveJournal Topical Structure Analysis
Project Head: Olessia Koltsova
Project Participants: Kirill Maslinsky, Sergey Koltcov, and Iskander Yasaveev.
This project is a major part of the project ‘A Study of the Construction of Social Problems in BlogswihAdvancedMethods of Text and Network Analysis’, carried out as part of the thematic plan of the Centre for Basic Research in 2012.
Thismethodological project was aimed at social scientist instruments for sociological research of socially important issues discussed in blogs. The research identiiedy discussion themes in blogs during a particular time period.
The empirical base of the research was collected automatically ; it contains posts and their characteristics, comments, and authors of the blog platform LiveJournal—several hundred thousands of posts totally and roughly twenty times as many comments. As a result, there were collected unique full-text databases of LiveJournal, that can be used for a variety of studies. Several samples with different parameters were extracted from the bases. Technological chains were createdfor downloading, sample extraction, preprocessing (preparing for automatic analysis), and analysis, which would help social researchers work with big data without resorting to software developing.
The thematic structure of posts was determined automatically via topical modeling algorithms. The commenting communities were found via community-detection algorithms in graphs. The researchers conducted a series of experiments with different algorithms, and came to the conclusion that the problem of an optimal quantity of themes in topic modeling was far from being resolved.
An approach was suggested towards the optimization of the quantity of topics and the quantity of clusters during cluster analysis.A softwarewas developed that helped confirm that the algorithms of topic modeling and clusterization had roughly the same optimum in the same data.
The research described the main statistical characteristics of LiveJournal’s top blog posts, their distribution and relationships, and various indexes for measuring blogger activity and success rates were suggested. The study showed that the number of comments hardlycorrelated with the number of posts. This gave a possibility to introduce a blogger efficiency index, based on the calculation of the average number of comments pera blogger’s post. Though blog topics are numerous, in several blogs certain topic groups prevail that make it possible to create thematic profiles of bloggers, and to cluster bloggers according to these profiles., The research revealed the variation range of bloggers’ activity according to the day of the week and time of day (on weekends the activity decreases by a quarter), which helped calculate adjustment coefficients for the correct identification of activity peaks.
During the network analysis, the researchers found that the co-commenting communities depended to some extent on the author of the posts. In the course of the project the knowledge of the thematic structure of posts and its changes over time was refined (the initial results were obtained in course of the research project “Developing a Methodology ofthe Network and Semantic Analysis of Blogs for Sociological Purposes”).
BlogMiner is a software product for downloading the database created in the Laboratory for Internet Studies, implemented on Delphi 7, and synchronized with a standard shell (Microsoft SQL server) for creating and working with databases.
Stanford Topic Modeling Toolbox is a topical modeling program for sociologists and other researchers who conduct text data analysis.
gCluto (George Karypis Lab) is a graphical version of the program Cluto, academic software for offline clustering of large tex collectionss, based on the bag-of-words approach. The program contains 17 algorithms, including flat and hierarchical clustering, and graph-based algorithms.
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.