Применен термодинамический подход к решению проблемы выбора числа кластеров/тем в тематическом моделировании. Сформулированы основные положения подхода, и исследуется поведение тематических моделей при вариации температуры. При помощи термодинамического формализма показано существование энтропийного фазового перехода в тематических моделях и сформулированы критерии выбора оптимального числа тем/кластеров.
В статье приведен отчет о проведении 9-ой Российской летней школы по информационному поиску 9th Russian Summer School in Information Retrieval (RuSSIR 2015).
Applicability limits of the particle-in-сell (PIC) method for the calculation of jet gasdynamic flows under conditions of pressure variations by four or five orders of magnitude are studied. Three approaches permitting one to determine real limits of the model adequacy from the side of low pressures are considered. Based on the analysis of the results, it is shown that the PIC method adequately operates in the pressure range of 5–105 Pa in spite of the fact that, formally, the PIC method can operate also at lower pressures.
Buffering architectures and policies for their efficient management constitute one of the core ingredients of a network architecture. In this work we introduce a new specification language, BASEL, that allows to express virtual buffering architectures and management policies representing a variety of economic models. BASEL does not require the user to implement policies in a high-level language; rather, the entire buffering architecture and its policy are reduced to several comparators and simple functions. We show examples of buffer management policies in BASEL and demonstrate empirically the impact of various settings on performance.
Purpose – The paper addresses the problem of what drives the formation of latent discussion communities, if any, in the blogosphere: topical composition of posts or their authorship? The purpose of this paper is to contribute to the knowledge about structure of co-commenting.
Design/methodology/approach – The research is based on a dataset of 17,386 full text posts written by top 2,000 LiveJournal bloggers and over 520,000 comments that result in about 4.5 million edges in the network of co-commenting, where posts are vertices. The Louvain algorithm is used to detect communities of co-commenting. Cosine similarity and topic modeling based on latent Dirichlet allocation are applied to study topical coherence within these communities.
Findings – Bloggers unite into moderately manifest communities by commenting roughly the same sets of posts. The graph of co-commenting is sparse and connected by a minority of active non-top commenters. Communities are centered mainly around blog authors as opinion leaders and, to a lesser extent, around a shared topic or topics.
Research limitations/implications – The research has to be replicated on other datasets with more thorough hand coding to ensure the reliability of results and to reveal average proportions of topic-centered communities.
Practical implications – Knowledge about factors around which co-commenting communities emerge, in particular clustered opinion leaders that often attract such communities, can be used by policy makers in marketing and/or political campaigning when individual leadership is not enough or not applicable.
Originality/value – The research contributes to the social studies of online communities. It is the first study of communities based on co-commenting that combines examination of the content of commented posts and their topics.
We study topic models designed to be used for sentiment analysis, i.e., models that extract certain topics (aspects) from a corpus of documents and mine sentiment-related labels related to individual aspects. For both direct applications in sentiment analysis and other uses, it is desirable to have a good lexicon of sentiment words, preferably related to different aspects in the words. We have previously developed a modification for several popular sentiment-related LDA extensions that trains prior hyperparameters β for specific words. We continue this work and show how this approach leads to new aspect-specific lexicons of sentiment words based on a small set of “seed” sentiment words; the lexicons are useful by themselves and lead to improved sentiment classification.
Efficient packet classification is a core concern for network services. Traditional multi-field classification approaches, in both software and ternary content-addressable memory (TCAMs), entail tradeoffs between (memory) space and (lookup) time. TCAMs cannot efficiently represent range rules, a common class of classification rules confining values of packet fields to given ranges. The exponential space growth of TCAM entries relative to the number of fields is exacerbated when multiple fields contain ranges. In this work, we present a novel approach which identifies properties of many classifiers which can be implemented in linear space and with worst-case guaranteed logarithmic time and allows the addition of more fields including range constraints without impacting space and time complexities. On real-life classifiers from Cisco Systems and additional classifiers from ClassBench (with real parameters), 90–95% of rules are thus handled, and the other 5–10% of rules can be stored in TCAM to be processed in parallel.
High-mass-resolution imaging mass spectrometry promises to localize hundreds of metabolites in tissues, cell cultures, and agar plates with cellular resolution, but it is hampered by the lack of bioinformatics tools for automated metabolite identification. We report pySM, a framework for false discovery rate (FDR)-controlled metabolite annotation at the level of the molecular sum formula, for high-mass-resolution imaging mass spectrometry (https://github.com/alexandrovteam/pySM). We introduce a metabolite-signal match score and a target–decoy FDR estimate for spatial metabolomics.
A new variant of the method of probability density distribution recovery for solving topical modeling problems is described. Disadvantages of the Gibbs sampling algorithm are considered, and a modified variant, called the “granulated sampling method,” is proposed. Based on the results of statistical modeling, it is shown that the proposed algorithm is characterized by higher stability as compared to other variants of Gibbs sampling.