Leveraging Topic Specificity and Social Relationships for Expert Finding in Community Question Answering Platforms

Maddalena Amendola [email protected] 0000−0001−6556−4032 IIT-CNR, ISTI-CNRPisaItaly Andrea Passarella [email protected] 0000−0002−1694−612X IIT-CNRPisaItaly  and  Raffaele Perego [email protected] 0000−0001−7189−4724] ISTI-CNRPisaItaly
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Online Community Question Answering (CQA) platforms have become indispensable tools for users seeking expert solutions to their technical queries. The effectiveness of these platforms relies on their ability to identify and direct questions to the most knowledgeable users within the community, a process known as Expert Finding (EF). EF accuracy is crucial for increasing user engagement and the reliability of provided answers. Despite recent advancements in EF methodologies, blending the diverse information sources available on CQA platforms for effective expert identification remains challenging. In this paper, we present TUEF, a Topic-oriented User-Interaction model for Expert Finding, which aims to fully and transparently leverage the heterogeneous information available within online question-answering communities. TUEF integrates content and social data by constructing a multi-layer graph that maps out user relationships based on their answering patterns on specific topics. By combining these sources of information, TUEF identifies the most relevant and knowledgeable users for any given question and ranks them using learning-to-rank techniques. Our findings indicate that TUEF’s topic-oriented model significantly enhances performance, particularly in large communities discussing well-defined topics. Additionally, we show that the interpretable learning-to-rank algorithm integrated into TUEF offers transparency and explainability with minimal performance trade-offs. The exhaustive experiments conducted on six different CQA communities of Stack Exchange show that TUEF outperforms all competitors with a minimum performance boost of 42.42% in P@1, 32.73% in NDCG@3, 21.76% in R@5, and 29.81% in MRR, excelling in both the evaluation approaches present in the previous literature.

Expert Finding, Community Question Answering, Evaluation methodology, Social User Interactions, Learning to Rank, Explainable models
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Do Not Use This Code Generate the Correct Terms for Your Paperccs: Do Not Use This Code Generate the Correct Terms for Your Paperccs: Do Not Use This Code Generate the Correct Terms for Your Paperccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

Online Community Question-Answering (CQA) platforms, such as StackOverflow and AskUbuntu, have become indispensable tools for users seeking expert solutions to their technical queries. These platforms are built upon the collaborative efforts of users who pose questions and provide answers, thus creating an extensive repository of shared knowledge. The success of CQA platforms relies on their ability to effectively identify and direct questions to the most knowledgeable experts within the community, thus engaging the users. This crucial process, known as Expert Finding (EF), is vital for ensuring the accuracy and reliability of the answers provided.

The EF task focuses on identifying and recognizing users with a high level of expertise who can offer accurate and timely responses to posted questions. It significantly enhances user engagement, trust, and satisfaction by routing questions to the most knowledgeable community members. Despite advancements in EF methodologies, there are still challenges in effectively integrating the diverse information sources available on CQA platforms. Current proposals for addressing the EF task in CQA contexts rely on information derived from the textual content of questions and answers and from features and signals that model the users’ engagement with this content and their interactions within the networked community. The complexity of such interactions and the dynamic nature of expertise necessitates a comprehensive approach that can adapt to varying contexts and user behaviors, which, to the best of our knowledge, still needs to be fully addressed in the literature.

In this paper, we comprehensively address the EF task by proposing TUEF, a Topic-oriented User-Interaction model for Expert Finding, which aims to fully and transparently utilize the vast amount of heterogeneous information available within online question-answering communities. TUEF integrates content and social data by constructing a topic-based multi-layer graph that maps out user relationships based on their topical answering patterns. By combining these sources of information, TUEF aims to identify the most relevant and knowledgeable users for any given question. TUEF operates by first generating the multi-layer graph, where each layer represents a major topic within the community. These topics are automatically identified through the analysis of tags from past questions. Within each layer, nodes represent active users in topic discussions, and edges model the similarities and relationships among these users. When a question is posed, TUEF uses the multi-layer graph and a ranking model to determine the relevant topics and corresponding graph layers. In each relevant layer, TUEF selects candidate experts from two perspectives: (i) Network Perspective: identifying central users who have significant influence within the community; (ii) Content Perspective: identifying users who have previously answered similar questions. Navigating the graph from seed nodes identified according to these criteria, TUEF explores and collects candidate experts through appropriate exploration policies. It then extracts static and query-dependent features based on text and graph relationships for these candidates. Finally, TUEF applies Learning-to-Rank techniques to score and rank the candidates by their expected relevance to the question.

This work builds upon a previous paper (Amendola et al., 2024), extending it with the following main new contributions:

  • We introduce a new categorization of state-of-the-art EF models based on the evaluation approach they adopt: Expert Ranking and Expert Subsample Ranking. Such categorization contributes to the definition of good practices for a fair EF model evaluation.

  • We conduct an ablation study of TUEF across six scientific communities: StackOverflow, Unix, AskUbuntu, Server Fault, Physics, and Mathematics. This analysis allows us to understand the contribution of the various CQA information sources and TUEF components to the end-to-end predictive performance and to validate the generality of TUEF performance.

  • We integrate an interpretable Learning-to-Rank algorithm (Lucchese et al., 2022) into the TUEF framework and provide examples of interpretation, showing how we can make the model fully transparent without losing its accuracy.

  • We compare TUEF with state-of-the-art competitor models within the Expert Ranking category, to which TUEF belongs. Moreover, we adapt TUEF for a fair comparison with state-of-the-art models in the Expert Subsample Ranking category, accommodating this different evaluation configuration. We thus highlight the overall advantage of TUEF concerning relevant alternatives in the state of the art.

  • We study the ability of TUEF to scale with larger datasets by considering four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data.

Our findings indicate that the multi-layer graph significantly enhances performance when the topics discussed in the community involve a well-defined clustering of question tags, as in StackOverflow. In the cases of small communities characterized by a less broad distribution of discussion topics, TUEF using single- or multi-layer graphs exhibit comparable performance. As discussed in (Amendola et al., 2024), content information consistently emerges as the most crucial component for EF in all communities. Additionally, we show that integrating an interpretable Learning-to-Rank solution into TUEF typically results in a slight performance decrease. However, particularly in some communities, the reduction in prediction performance is minimal, making the interpretable version of TUEF a valuable option for gaining insights into the decision-making process. Finally, the results of extensive experiments conducted on CQA communities with varying characteristics show that TUEF surpasses all baseline models, excelling in both the Expert Ranking and Expert Subsample Ranking evaluation categories.

The article is structured as follows: in Section 2, we present state-of-the-art models divided based on two different points of view: (i) the information used (Text-, Feature-, and Network-based methods) and (ii) the evaluation approach adopted (Expert Ranking and Expert Subsample Ranking). Next, Section 3 details the TUEF framework. Following, in Section 4, we present the experimental setup and detail the communities examined, the settings of TUEF hyperparameters, and the metrics used to assess its performance. Finally, in Section 5, we present the results of the TUEF ablation study across all communities and the comparison between TUEF and the state-of-the-art baselines under the two evaluation-based categories.

2. Related Work

In this section, we categorize state-of-the-art models from two perspectives: the type of information the models use (textual, features, and network) and the evaluation methodology the authors adopt for their assessment (with or without subsampling).

2.1. Information-based classification

Research proposals in the field of the EF task for CQA platforms can be grouped into three broad groups based on the information they explore to address the task: text-based, feature-based, and network-based methods.

Text-based methods

A variety of methods have been proposed to address the EF task by leveraging similarities between current and previously answered questions. Dehghan and Ansell (Dehghan and Abin, 2019) introduced a model that clusters question terms based on semantic similarity and co-occurrence. Similarly, Liang et al. (Liang, 2019) proposed a semantic unsupervised generative adversarial network to assess similarities between word representations and experts. Moreover, Dehghan et al. (Dehghan et al., 2019) modeled user expertise by utilizing the tree structure of various domains, tags, and the temporal dimension of user response behavior. In a related vein, Zhang et al. (Zhang et al., 2020) considered the temporal dimension by proposing models with multi-shift and multi-resolution settings to capture temporal dynamics effectively. Fu et al. (Fu et al., 2020) introduced a novel approach using the Recurrent Memory Reasoning Network (RMRN), which employs different reasoning memory cells equipped with attention mechanisms to focus on various aspects of the question. Building on the concept of attention mechanisms, Peng et al. (Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023) implemented these mechanisms in multi-view, multi-grained, semi-supervised pre-trained, and pre-trained models with personalized fine-tuning for expert finding. Moreover, Qian et al. (Qian et al., 2022b) proposed a Multi-Hop Interactive Attention-based Classification Network (MIACN) that utilizes attention mechanisms to detect latent interactions among question subjects and bodies.

Feature-based methods

The second group of methods in EF relies on hand-crafted features to model the expertise of community members. Roy et al. (Roy et al., 2018) formalized a scoring function to capture various aspects of expertise, including the propensity to answer questions related to profile tags, the ability to provide accepted answers, and recent activity levels. Mumtaz et al. (Mumtaz et al., 2019) proposed a framework integrating activity, community, and time-aware features, with the temporal aspect also considered in (Fu, 2019) to track and incorporate the evolution of user roles, as well as in (Kundu et al., 2019). Similarly, Tondulkar et al. (Tondulkar et al., 2018) included features that favor experts providing high-quality answers to complex questions, utilizing a comprehensive set of features capturing user availability and knowledge, and applying Learning-to-Rank (LtR) methods for expert ranking. The quality and consistency of answers are further addressed by Faisal et al. (Faisal et al., 2019), who proposed a model based on an adaptation of the bibliometric g-index. LtR techniques are further utilized by Sorkhani et al. (Sorkhani et al., 2022), introducing 74 content-based and social-based features. An important aspect in addressing the EF task is considering the intimacy between the asker and answerer, studied by Fu et al. (Fu, 2020). As in text-based approaches, Tan et al. (Tang et al., 2020) employed attention mechanisms by proposing a method using Hierarchical Attentional Factorization Machines (HAFM). This approach combines factorization machines with hierarchical attention mechanisms to effectively model user-expert interactions and to highlight the importance of various features in expert recommendation tasks. The hierarchical attention mechanism helps in prioritizing relevant information at both the user and expert levels, enabling more accurate and personalized recommendations. Contrary to many studies focusing primarily on identifying the most senior platform users as experts, Roy et al. (Roy and Singh, 2024) aim to predict promising expert users at an early stage in community question answering sites.

Network-based methods

Network-based methods integrate interactions and relationship information within networks. Kundu et al. (Kundu and Mandal, 2019) defined a framework that comprises a text-based component for assessing expert knowledge on specific topics, accompanied by a Competition Based Expertise Network (CBEN) (Aslay et al., 2013). This network uses link analysis techniques, such as AuthorRank (Liu et al., 2005) and Weighted HITS (Li et al., 2002). The CBEN approach is further utilized in (Kundu et al., 2020), where connections between two users are established if they have responded to the same question. Additionally, this study incorporates both intra-profile and inter-profile preferences of community users. In (Kundu et al., 2021b), the same authors propose a topic-sensitive hybrid expertise retrieval system (TSHER) that integrates assessments of knowledge, reputation, and authority. Le et al. (Le and Shah, 2018) measure the similarity between answerer and asker using social network techniques, employing Random Walks with Restart to calculate the proximity between two nodes.

Contrarily, Sun et al. (Sun et al., 2018a) propose a language-agnostic perspective of user expertise, constructing a competition graph with user and question nodes where edges represent increasing levels of question difficulty, thus depicting the hierarchical structure of questions. This concept of hierarchy is further explored in (Sun et al., 2019a). Addressing data sparsity, Sang et al. (Sang et al., 2019) introduced a Multi-modal Multi-view Semantic Embedding (MMSE) framework that learns semantic embeddings from both local and global perspectives, incorporating social structure information to enhance embedding quality. Ghasemi et al. (Ghasemi et al., 2021) focus on user embedding, proposing a joint model for text and node similarity.

Moreover, Li et al. (Li et al., 2019) represent the CQA platform as a Heterogeneous Information Network (HIN), aiming to learn embeddings for nodes representing question contents, raisers, and answerers. They employ an LSTM-equipped Metapath-based Embedding algorithm and a CNN scoring function to rank experts. The use of HIN and metapath-based algorithms is also noted in (Qian et al., 2022a). Liu et al. (Liu et al., 2022c) apply advanced semantic analysis and interest drift modeling to evaluate experts based on the relevance and depth of their contributions in specific domains. The concept of interest drift is similarly addressed by Krishna et al. (Krishna and Antulov-Fantulin, 2023), who propose a simple graph diffusion-based expert recommendation model that accounts for semantic and temporal information. Temporal dynamics in expertise assessment are further examined by Costa et al. in (Costa and Ortale, 2023b, a), who introduce temporally-discounted, tag-based models.

2.2. Evaluation-based classification

The EF task is commonly cast to an item recommendation task in which the items are community users to be ranked by a measure of likelihood they can effectively answer the new question. The users with the highest scores are the ones to whom the new question should be routed. When evaluating recommender systems, there are two important methods to consider: online evaluation and offline evaluation. Online, end-to-end evaluation is the primary and most reliable way to evaluate a recommender system (Hofmann et al., 2016; Liang et al., 2018), but it is not applicable during the development of a new model (Dallmann et al., 2021). On the other hand, offline evaluation is the main instrument available for academic research and allows for exploring hyperparameter settings (Cañamares and Castells, 2020; Shani and Gunawardana, 2011). It is commonly used to test a new model’s effectiveness before moving to online evaluation. Moreover, in item recommendation tasks, the catalog of items to retrieve from is usually large, and finding matching items from this large pool is challenging (Krichene and Rendle, 2020a). Considering these factors, recent studies have speeded up the evaluation process by calculating the performance metrics of interest on a target set only. This target set is usually a small sample of the items possibly matching the query that includes all relevant items and a defined number of negative (non-relevant) items. However, in recent years, researchers have analyzed and critically evaluated these sample metric strategies. Krichene et al. (Krichene and Rendle, 2020a) thoroughly investigated sampled metrics, revealing that they are inconsistent with their exact version. This finding is significant as it means that these sampled metrics do not persist relative statements, such as recommender A is better than B, not even in expectation. Furthermore, they found that the smaller the sampling size, the less difference between metrics. Canamares et al. (Cañamares and Castells, 2020) found that comparative evaluation using reduced target sets contradicts, in many cases, the corresponding outcome using large targets. Finally, Dallmann et al. (Dallmann et al., 2021) explored two widely used sampling strategies, sampling by popularity and uniform random sampling, and found that both can produce inconsistent rankings compared with the full ranking of the models and consistently produce different ranking when compared over different sample sizes.

When considering the EF context, one effective strategy to simplify the task and reduce complexity is identifying expert users based on specific criteria, such as their consistent provision of high-quality answers. However, it’s crucial to acknowledge that this approach can intensify the cold-start problem, potentially leading to the exclusion of users who could be highly relevant to new questions. Despite this, the method significantly accelerates metric computation and enhances system accuracy. So far, different strategies have been adopted to define a subset of users representing the experts: 10% of the most active users (Kundu et al., 2021b; Qian et al., 2022a; Tang et al., 2020; Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023; Roy and Singh, 2024; Li et al., 2019; Kundu and Mandal, 2019; Liu et al., 2022a), all users with the number of accepted answers greater than a threshold (Kundu et al., 2021a; Krishna and Antulov-Fantulin, 2023; Qian et al., 2022b; Sun et al., 2018b, 2019b; Fu et al., 2020; Fu, 2020; Kundu et al., 2020; Fu, 2019), or a two-stage approach that considers the acceptance ratio (Nobari et al., 2020; Fallahnejad and Beigy, 2022; Kasela et al., 2023; Dargahi Nobari et al., 2017; Mumtaz et al., 2019). No specific heuristic can determine whether a user should be considered an expert. However, to conduct a fair comparison among different models for the EF task, it is important to distinguish studies according to the sampling approach adopted during the ranking phase, ensuring that the models are compared under the same assumptions. Specifically, we recognize two different methods:

  • Experts Ranking: For a new question, this group of works essentially ranks all users labeled as expert users in the first stage. The limitation of this approach during the evaluation phase is that it requires the test set to consist of questions for which the best answerer is a user who was previously labeled as an expert. On the other hand, this evaluation methodology provides realistic figures of the end-to-end performance of the observed system. The following works belong to this group: (Nobari et al., 2020; Kundu et al., 2021b; Fallahnejad and Beigy, 2022; Liu et al., 2022c; Costa and Ortale, 2023b; Kundu et al., 2021a; Costa and Ortale, 2023a; Krishna and Antulov-Fantulin, 2023; Qian et al., 2022b; Roy and Singh, 2024; Sun et al., 2019b; Fu, 2019; Fu et al., 2020; Mumtaz et al., 2019; Kundu and Mandal, 2019; Sun et al., 2018b; Fu, 2020; Kundu et al., 2020).

  • Experts Subsample Ranking: Given the new questions along with the information of all the users who actually answered the question (ground truth), the studies in this group rank only a fixed-size subsample of users. This set of users is generally not large and always includes the ground truth users plus some other users who didn’t answer the question (negative examples). In most of the works following this approach the size of the subsample is 20 and negative examples chosen randomly among the 10% most active users are added to the subsample until it reaches the desired cardinality. While this approach overcomes one limitation of the Experts Ranking methodology by requiring only the knowledge of the answers of the questions in the training set, it relies on a sampling strategy that might produce inconsistent rankings if the sample size is varied and can result in unreliable system rankings(Krichene and Rendle, 2020a; Cañamares and Castells, 2020; Dallmann et al., 2021). The works belonging to this group include the following: (Sang et al., 2019; Ghasemi et al., 2021; Qian et al., 2022a; Tang et al., 2020; Sorkhani et al., 2022; Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023; Li et al., 2019; Zhang et al., 2020; Liu et al., 2022a).

Refer to caption
Figure 1. Illustration of the TUEF approach highlighting the distinct components. At inference time, TUEF first determines the main topics to which the question q𝑞qitalic_q belongs and the corresponding graph layers (Multi-Layer Graph). Next, for each layer, it selects the candidate experts from two perspectives: i) Network, by identifying central users that may have considerable influence within the community; ii) Content, by identifying users who previously answered questions similar to q𝑞qitalic_q. The Multi-Layer Graph is used to collect candidate experts through Random Walks (Expert Selection). Following, TUEF extracts features based on text, tags, and graph relationships for each selected experts (Feature Extraction). Finally, TUEF uses a learned, precision-oriented model to score the candidates and rank them by expected relevance (Experts Ranking).

3. Topic-oriented User-interaction model for Expert Finding (TUEF)

The TUEF framework leverages topic-oriented content similarity and user relationships to facilitate the selection and ranking of experts. This approach is based on two fundamental considerations:

  • CQA platform users utilize tags to characterize their questions and increase the likelihood of receiving relevant answers. The questions’ tags help identify topical areas within the community relevant to specific questions.

  • Unlike traditional social networks, user relationships on CQA platforms are not explicitly defined. However, implicit relationships can be discerned by analyzing users’ question-answering behaviors. The knowledge gained from these interactions can enhance answer quality and user engagement with the platform.

Figure 1 illustrates the logical organization of TUEF. To effectively integrate the valuable information derived from tags, content, and user relationships, TUEF adopts a multi-layer graph representation (Section 3.1) that encapsulates the macro topics discussed within the online community. Each layer of this graph models user relationships at the level of specific macro topics, considering their similarities based on tags, questions, and answering behaviors. TUEF identifies community experts—users known for consistently providing accepted answers—and implements an exploratory algorithm that fully exploits the multi-layer graph structure to select a group of candidate experts for each relevant macro topic associated with the current question (Section 3.2).

This exploratory algorithm integrates social and content-based information, considering experts’ centrality within the network and their expertise in the topics of interest. Central users often represent individuals who have garnered significant attention and engagement within the community. Similarly, users close to the query are more likely to connect with experts relevant to the specific topic of interest. By strategically exploring the multi-layer graph starting from these nodes, the algorithm increases the probability of selecting appropriate experts early in the process.

Finally, TUEF extracts features that represent the identified candidate experts and applies Learning-to-Rank techniques (Liu, 2009) (Section 3.3) to rank them based on their expertise and likelihood of effectively answering the question. This systematic approach ensures the selection of high-quality experts tailored to the specific information needs of the user asking the question.

3.1. User Interaction Model

In TUEF, user relationships related to each specific topic are independently modeled. From here on, let U𝑈Uitalic_U denote the set of active users on the CQA platform under consideration, and let Q𝑄Qitalic_Q represent the set of past questions posted and answered by users in U𝑈Uitalic_U.

Topic Identification.

TUEF employs a tag clustering approach to identify the main topics discussed on the CQA platform based on past questions. The clustering approach, discussed in (Hoffa, 2019), utilizes the k-means algorithm (MacQueen et al., 1967) on a tag co-occurrence matrix M𝑀Mitalic_M. Tags’ co-occurrence patterns are leveraged to group semantically related tags, as tags often associated with the same questions tend to be conceptually similar (Giannakidou et al., 2008). For each question qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q, the tags associated with q𝑞qitalic_q are denoted as tags(q)𝑡𝑎𝑔𝑠𝑞tags(q)italic_t italic_a italic_g italic_s ( italic_q ). Let T𝑇Titalic_T be the set of all unique tags across questions in Q𝑄Qitalic_Q, and let F𝐹Fitalic_F represent the set of the most frequent tags (top λ𝜆\lambdaitalic_λ) that serve as clustering features. The co-occurrence matrix M|T|×|F|superscript𝑀𝑇𝐹M^{|T|\times|F|}italic_M start_POSTSUPERSCRIPT | italic_T | × | italic_F | end_POSTSUPERSCRIPT is constructed as follows:

(1) mi,j=|{qQ|{ti,fj}tags(q),tiT,fjF}|subscript𝑚𝑖𝑗conditional-set𝑞𝑄formulae-sequencesubscript𝑡𝑖subscript𝑓𝑗𝑡𝑎𝑔𝑠𝑞formulae-sequencesubscript𝑡𝑖𝑇subscript𝑓𝑗𝐹m_{i,j}=|\{q\in Q\;|\;\{t_{i},f_{j}\}\subseteq tags(q),\;t_{i}\in T,\;f_{j}\in F\}|italic_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = | { italic_q ∈ italic_Q | { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ⊆ italic_t italic_a italic_g italic_s ( italic_q ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_F } |

Here, mi,jsubscript𝑚𝑖𝑗m_{i,j}italic_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT indicates the number of questions in Q𝑄Qitalic_Q where the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT tag and the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT feature co-occur. Finally, the matrix rows are normalized to express the relative frequency of tag co-occurrences with each feature. For a given tag tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and feature fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the normalized matrix M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG is defined as:

(2) m^i,j=mi,jn=1|F|mi,nsubscript^𝑚𝑖𝑗subscript𝑚𝑖𝑗superscriptsubscript𝑛1𝐹subscript𝑚𝑖𝑛\hat{m}_{i,j}=\frac{m_{i,j}}{\sum_{n=1}^{|F|}{m_{i,n}}}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_F | end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT end_ARG

Clustering the tag co-occurrence matrix to devise macro topics is based on the observation that CQA users often pair broad, common tags with more specific ones to achieve greater precision in topic identification. By considering the co-occurrences of specific tags with broader tags, we can enhance the categorization of questions. Using the normalized matrix M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG and a specified value of k for the k-means algorithm, the clustering process yields k disjoint clusters of tags representing primary community domains. The optimal k value is determined based on silhouette maximization criteria (Rousseeuw, 1987), as described in Section 4. This comprehensive approach to topic identification and clustering ensures that TUEF effectively captures and represents the main topical areas discussed within the online community, facilitating a good understanding of community dynamics and interactions.

Multi-Layer Graph.

TUEF meticulously models user relationships within each layer by treating users U𝑈Uitalic_U as nodes in a multi-layer graph and establishing connections based on similar patterns of providing accepted answers within specific layers. Formally, the structure of TUEF is defined using a multi-layer graph G=[L1,,Lk]𝐺subscript𝐿1subscript𝐿𝑘G=[L_{1},...,L_{k}]italic_G = [ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], where each layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to a tag cluster. Each layer Li=(Vi,Ei)subscript𝐿𝑖subscript𝑉𝑖subscript𝐸𝑖L_{i}=(V_{i},E_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is an independent graph, with nodes Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT representing users associated with the layer and edges Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denoting their relationships. To ensure consistent and accurate answers within a specific layer, TUEF includes in Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT only those users who have provided a number of accepted answers equal to or exceeding a given ϵitalic-ϵ\epsilonitalic_ϵ percentile. Moreover, to characterize users’ knowledge in Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, each user uVi𝑢subscript𝑉𝑖u\in V_{i}italic_u ∈ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assigned a topic vector buisubscriptsuperscript𝑏𝑖𝑢b^{i}_{u}italic_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. This vector represents the user’s contribution to various tags within the layer, normalized by their total contribution across all layers. Specifically, for a tag tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT within Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT position of buisubscriptsuperscript𝑏𝑖𝑢b^{i}_{u}italic_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is calculated as:

(3) bui[j]=accepted(Li,u,tj)z,maccepted(Lz,u,tm)subscriptsuperscript𝑏𝑖𝑢delimited-[]𝑗𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑subscript𝐿𝑖𝑢subscript𝑡𝑗subscriptfor-all𝑧𝑚𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑subscript𝐿𝑧𝑢subscript𝑡𝑚b^{i}_{u}[j]=\frac{accepted(L_{i},u,t_{j})}{\sum_{\forall z,m}accepted(L_{z},u% ,t_{m})}italic_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT [ italic_j ] = divide start_ARG italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT ∀ italic_z , italic_m end_POSTSUBSCRIPT italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d ( italic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_u , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG

Here, accepted(Li,u,tj)𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑subscript𝐿𝑖𝑢subscript𝑡𝑗accepted(L_{i},u,t_{j})italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the number of accepted answers provided by user u𝑢uitalic_u for questions labeled with tag tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Finally, after computing the topic vectors for all the users in Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the cosine similarity is calculated between each pair of users within the layer. If the similarity between two users uasubscript𝑢𝑎u_{a}italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ubsubscript𝑢𝑏u_{b}italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT exceeds a predefined threshold δ𝛿\deltaitalic_δ, an edge (ua,ub)subscript𝑢𝑎subscript𝑢𝑏(u_{a},u_{b})( italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) weighted by the cosine similarity value is added to Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

It is crucial to emphasize that each question in TUEF is associated with tags assigned to specific layers within G𝐺Gitalic_G. Consequently, a question can belong to multiple layers, and users answering these questions are represented across all associated layers. This multidimensional representation of user expertise and relationships across layers enables TUEF to capture a comprehensive view of users’ knowledge and interactions within the CQA platform. This approach ensures that users’ expertise and social connections transcend individual layers, reflecting the nature of community participation and expertise within the platform.

3.2. Expert Selection

The Expert Selection component, as depicted in Algorithm 1, has the goal of selecting for each new question q𝑞qitalic_q submitted to the CQA platform a set of users who are likely to be highly knowledgeable of the topics of q𝑞qitalic_q, i.e., the set of candidate experts. The component operates iteratively for each layer to which question q𝑞qitalic_q belongs. This selection process is structured into three distinct phases: (i) Sorting phase:, where the nodes within each layer are arranged in a specified order to facilitate subsequent expert selection (line 2); (ii) Collection phase:, which selects an initial set of candidate experts from the sorted node lists (line 4); (iii) Exploratory phase:, that expands the initial set by exploring the graph structure to identify additional potential experts (line 6). The remainder of this section investigates the Expert Identification process, which classifies users uU𝑢𝑈u\in Uitalic_u ∈ italic_U as either experts or non-experts, followed by a detailed exploration of the three phases mentioned above.

Input : the layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the question q𝑞qitalic_q,
the method m𝑚mitalic_m (Network or Content),
the probability p𝑝pitalic_p
Output : the set Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of candidate experts for q𝑞qitalic_q in Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
1 //sort Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT nodes based on the method criteria;
2 sorted_nodessort_nodes(Li,q,m)𝑠𝑜𝑟𝑡𝑒𝑑_𝑛𝑜𝑑𝑒𝑠𝑠𝑜𝑟𝑡_𝑛𝑜𝑑𝑒𝑠subscript𝐿𝑖𝑞𝑚sorted\_nodes\leftarrow sort\_nodes(L_{i},\>q,\>m)italic_s italic_o italic_r italic_t italic_e italic_d _ italic_n italic_o italic_d italic_e italic_s ← italic_s italic_o italic_r italic_t _ italic_n italic_o italic_d italic_e italic_s ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q , italic_m );
3 //select the initial set of experts based on probability;
4 initial_setreach_probability(sorted_nodes,p)𝑖𝑛𝑖𝑡𝑖𝑎𝑙_𝑠𝑒𝑡𝑟𝑒𝑎𝑐_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑠𝑜𝑟𝑡𝑒𝑑_𝑛𝑜𝑑𝑒𝑠𝑝initial\_set\leftarrow reach\_probability(sorted\_nodes,p)italic_i italic_n italic_i italic_t italic_i italic_a italic_l _ italic_s italic_e italic_t ← italic_r italic_e italic_a italic_c italic_h _ italic_p italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_y ( italic_s italic_o italic_r italic_t italic_e italic_d _ italic_n italic_o italic_d italic_e italic_s , italic_p );
5 //expand the initial set by exploring the graph;
6 Diexplore_graph(Li,initial_set)subscript𝐷𝑖𝑒𝑥𝑝𝑙𝑜𝑟𝑒_𝑔𝑟𝑎𝑝subscript𝐿𝑖𝑖𝑛𝑖𝑡𝑖𝑎𝑙_𝑠𝑒𝑡D_{i}\leftarrow explore\_graph(L_{i},\>initial\_set)italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_e italic_x italic_p italic_l italic_o italic_r italic_e _ italic_g italic_r italic_a italic_p italic_h ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i italic_n italic_i italic_t italic_i italic_a italic_l _ italic_s italic_e italic_t );
return Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Algorithm 1 Expert Selection

Expert Identification

Experts in CQA platforms can be identified heuristically by considering the number of accepted answers they provided in the past. Another signal of user trust is given by the acceptance ratio rusubscript𝑟𝑢r_{u}italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, i.e., the ratio between the number of accepted answers and the total number of answers a given user provides. As in (Dargahi Nobari et al., 2017), in TUEF, we adopt an expert selection criterion based precisely on these two measures: first, we select the set CU𝐶𝑈C\subseteq Uitalic_C ⊆ italic_U of candidate experts by considering all the users having a number of accepted answers greater or equal to a specified threshold β𝛽\betaitalic_β: next, we label as experts the set EC𝐸𝐶E\subseteq Citalic_E ⊆ italic_C of users whose acceptance ratio rusubscript𝑟𝑢r_{u}italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is greater than the overall average r¯¯𝑟\overline{r}over¯ start_ARG italic_r end_ARG:

(4) E={uC|ru>r¯},r¯=uCru|C|formulae-sequence𝐸conditional-set𝑢𝐶subscript𝑟𝑢¯𝑟¯𝑟subscript𝑢𝐶subscript𝑟𝑢𝐶E=\{u\in C\;|\;r_{u}>\overline{r}\},\quad\overline{r}=\frac{\sum_{u\in C}{r_{u% }}}{|C|}italic_E = { italic_u ∈ italic_C | italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT > over¯ start_ARG italic_r end_ARG } , over¯ start_ARG italic_r end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_C end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG | italic_C | end_ARG

Each layer within the multi-layer graph comprises nodes labeled as experts and non-experts. While the former are the primary targets of the Expert Selection process, the latter facilitates comprehensive exploration of the graph structure.

By implementing this process, we aim to identify users who consistently provide valuable and accepted solutions within the community, as they are likely to be proficient candidates for effectively addressing new questions posted on the platform.

Sorting

TUEF adopts a comprehensive approach to identify candidate experts for a given question q𝑞qitalic_q by leveraging content and social information. This process involves two key perspectives: the Network-based perspective, focusing on users’ centrality within the network, and the Content-based perspective, assessing users’ relevance to the newly posted query based on their past interactions. In the Network-based approach, TUEF utilizes Betweenness centrality (Freeman, 1977), which quantifies the centrality of nodes within a graph by evaluating their influence over information flow. Nodes with higher Betweenness centrality are Considered more central within the network. Each node vjisuperscriptsubscript𝑣𝑗𝑖v_{j}^{i}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assigned a Betweenness score sjisuperscriptsubscript𝑠𝑗𝑖s_{j}^{i}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which is used to sort nodes in descending order. On the other hand, the Content-based approach involves sorting layer nodes according to their similarity to the query q𝑞qitalic_q based on questions previously answered by experts. This is achieved using Information Retrieval techniques and pre-built indexes, specifically the TextIndex and TagIndex, which index the text and associated tags of historical questions, respectively. When a new question is presented with its tags, the Content-based method employs a retrieval model (we specifically use BM25) independently on both indexes to retrieve a sorted list of relevant questions, along with information about the experts who provided accepted answers. The query for the TextIndex includes the concatenated question, title, and body, while for the TagIndex it consists of the concatenated question tags. Subsequently, the lists retrieved from each index are merged alternately, preserving the original order of elements.

This approach results in a query-specific ordering of nodes within each layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, effectively leveraging network centrality and content relevance to identify and rank candidate experts tailored to the context of the new query q𝑞qitalic_q. The combination of these perspectives enhances the precision and relevance of expert selection within the TUEF framework.

Candidate Collection

In the candidate collection phase, the objective is to select a subset of experts DiVisubscript𝐷𝑖subscript𝑉𝑖D_{i}\subseteq V_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from each layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This process balances the need for a sufficiently large candidate pool to ensure high recall (the probability of obtaining the correct answer) with the desire to include relevant experts for high precision. To achieve a balance between precision and recall, we estimate the probability p𝑝pitalic_p of not receiving an answer from any user in Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Initially, p𝑝pitalic_p is set to 1 when Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is empty and is incrementally reduced as experts are added to Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each expert u𝑢uitalic_u added to Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contributes to lowering p𝑝pitalic_p based on their acceptance ratio μusubscript𝜇𝑢\mu_{u}italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, defined as the ratio of accepted answers to total answers provided by the user. Furthermore, to refine the modeling of topic-based expertise, we smooth μusubscript𝜇𝑢\mu_{u}italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT by considering the user’s activity within the specific layer: the smoothing factor is computed by adjusting μusubscript𝜇𝑢\mu_{u}italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT based on the ratio of expert answers within the layer to the maximum number of answers provided by any user within that layer. The probability p𝑝pitalic_p is updated iteratively using the formula:

(5) p=p(1μu)𝑝𝑝1subscript𝜇𝑢p=p\cdot(1-\mu_{u})italic_p = italic_p ⋅ ( 1 - italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT )

This iterative process continues until p𝑝pitalic_p becomes less than or equal to a predefined threshold α𝛼\alphaitalic_α. Once the threshold α𝛼\alphaitalic_α is met, the inclusion of new candidates to Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is halted, and the exploratory phase of the expert selection process begins. This strategic approach ensures that the candidate pool Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT comprises experts likely to provide valuable and accurate answers to the specific question while representing good starting points for MLG exploration.

Exploratory Phase

In the Exploratory phase, TUEF explores the graph structure to identify additional candidate experts that may not have been identified with the previously described phases, specifically exploiting the implicit relationships between users, aiming to enhance recall. The starting point for this exploration is the set of experts Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from each layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, for each node viDisubscript𝑣𝑖subscript𝐷𝑖v_{i}\in D_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, TUEF initiates a series of fixed-length Random Walks. At each step of the Random Walk, the next node to visit is selected randomly based on a probability distribution d𝑑ditalic_d, computed considering the neighbors of the current node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and their link weights, representing the similarities with neighboring nodes.

More formally, let 𝒩i=vi1,,viNi𝒩𝑖𝑣𝑖1subscript𝑣𝑖subscript𝑁𝑖\mathcal{N}i={v{i1},\ldots,v_{iN_{i}}}caligraphic_N italic_i = italic_v italic_i 1 , … , italic_v start_POSTSUBSCRIPT italic_i italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the neighbors of expert node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Ni=|𝒩i|subscript𝑁𝑖subscript𝒩𝑖N_{i}=|\mathcal{N}_{i}|italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the total number of neighbors of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The probability djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of visiting a neighbor node vijsubscript𝑣𝑖𝑗v_{ij}italic_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is calculated as:

(6) dj=wijz=1|Ni|wizsubscript𝑑𝑗subscript𝑤𝑖𝑗superscriptsubscript𝑧1subscript𝑁𝑖subscript𝑤𝑖𝑧d_{j}=\frac{w_{ij}}{\sum_{z=1}^{|N_{i}|}{w_{iz}}}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_z end_POSTSUBSCRIPT end_ARG

Here, wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the weight of the edge between node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its neighbor vijsubscript𝑣𝑖𝑗v_{ij}italic_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Nodes with stronger connections to the current node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are more likely to be selected during the Random Walks. Whenever an expert node is encountered during the exploration, it is added to the current set of candidate experts if not already included.

This exploratory process is carried out independently for each layer to which the question belongs, ensuring a comprehensive exploration of both Network-based and Content-based perspectives within each layer. Combining Random Walks with tailored probability distributions enriches the candidate pool Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with potentially neglected experts, enhancing the overall recall and effectiveness of the expert selection process in the TUEF framework.

3.3. Ranking candidate experts

TUEF leverages Learning to Rank (LtR) algorithms (Liu, 2009) to learn an effective ranking function from training data. LtR methods exploit a labeled dataset to learn a scoring function σ𝜎\sigmaitalic_σ that approximates the ideal ranking function inherent in the training examples. TUEF selects a subset of past questions, used to model user relationships as discussed in Section 3.1, to serve as the training set TQ𝑇𝑄T\subset Qitalic_T ⊂ italic_Q for the LtR algorithm. Each query q𝑞qitalic_q in the training set T𝑇Titalic_T is associated with a set of candidate experts CEU𝐶𝐸𝑈CE\subset Uitalic_C italic_E ⊂ italic_U. Moreover, for each query-candidate pair (q,ui)𝑞subscript𝑢𝑖(q,u_{i})( italic_q , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where qT𝑞𝑇q\in Titalic_q ∈ italic_T and uiCEsubscript𝑢𝑖𝐶𝐸u_{i}\in CEitalic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C italic_E, there exists a relevance judgment lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that indicates whether uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an expert for query q𝑞qitalic_q. Each query-candidate pair (q,ui)𝑞subscript𝑢𝑖(q,u_{i})( italic_q , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is represented by a feature vector x𝑥xitalic_x that encapsulates information about the query, the candidate expert, and their relationship. The LtR algorithm learns a function σ(x)𝜎𝑥\sigma(x)italic_σ ( italic_x ) that predicts a relevance score for the input feature vector x𝑥xitalic_x. This learned function σ(x)𝜎𝑥\sigma(x)italic_σ ( italic_x ) is subsequently used during inference to compute scores for candidate experts and rank them accordingly.

The features used to model the query and candidate expert can be categorized into two groups: Static features and Query-dependent features. The group of Static features comprises those that remain the same for each query: the Reputation of the expert, the number of Answers and AcceptedAnswers, the Ratio (i.e., the ratio of answers to accepted answers), and AvgActivity and StdActivity, representing the average and standard deviation of time intervals between consecutive answers provided by the expert. The Query-dependent features, instead, are computed every time for each query and they include:

  • LayerCount: Number of distinct graph layers in which the expert is selected during the Expert Selection process.

  • QueryKnowledge: Ratio of answers to accepted answers provided by the expert in layers relevant to the query q𝑞qitalic_q.

  • VisitCountContent and VisitCountNetwork: Total number of times an expert is encountered in the Collection and Exploratory phases using Content-based and Network-based techniques.

  • StepsContent and StepsNetwork: Number of steps required to discover the expert during Collection or Exploratory phases.

  • BetweennessPos and BetweennessScore: Expert’s rank in the list of users ordered by Betweenness score and the Betweenness score itself.

  • ScoreIndexTag and ScoreIndexText: Sum of BM25 scores of historical questions answered by the expert in IndexTag and IndexText, respectively.

  • FrequencyIndexTag and FrequencyIndexText: Number of distinct questions answered by the expert returned by the respective indexes.

  • Eigenvector, PageRank, Closeness, Degree, AvgWeights: Centrality measures and network features computed based on the graph layers relevant to the query.

Specific rules are applied to aggregate feature values for candidate experts selected in multiple layers. Network-based features (e.g., BetweennessScore, Closeness, PageRank) consider maximum values, while Content-based features (e.g., FrequencyIndex, ScoreIndex, QueryKnowledge) utilize summation. Additionally, certain features like StepsContent, StepsNetwork, and BetweennessPos use minimum values to characterize the candidate experts effectively within the ranking framework. These features collectively contribute to the comprehensive ranking process in TUEF, accommodating both static and query-specific attributes of candidate experts for optimal ranking performance.

4. Experimental Setup

The following section presents the experimental setup and details the datasets used, the settings of the TUEF hyperparameters, and the metrics used to assess performance.

4.1. Datasets

We selected six real-world datasets from the most extensive communities on the StackExchange111https://stackexchange.com/ platform, namely StackOverflow, Unix, AskUbuntu, ServerFault, Physics, and Mathematics. These datasets are publicly accessible222https://archive.org/details/stackexchange and encompass all the questions and answers posted by StackExchange users within their respective communities. Each question has a title, a body, and a list of tags. Moreover, the dataset for each community includes all the posted answers for each question. Besides question and answer content, we know the positive or negative votes received by a question or an answer, the number of views, the number of users that selected a given question as a favorite one, and the comments that other users might have written under a question or an answer. Questions can be categorized as closed, indicating the presence of an accepted answer, or open, which may have no answers. Information about the user who posted a question or an answer is available. Since an answer can be labeled as accepted only by the user who posted the originating question, the authors of accepted answers provide the human-assessed ground truth for the EF task. Additionally, each participating user in the community has associated additional information, such as a Reputation score, which is also useful for the EF task.

4.2. Parameters setting

We pre-processed the data to ensure a good set of questions and answers. Moreover, the different steps of TUEF require setting some parameters detailed below.

Pre-processing.

We pre-processed the data to ensure a good set of questions and answers. Moreover, the different steps of TUEF require setting some parameters detailed below. We applied established data cleaning procedures, as outlined in (Mumtaz et al., 2019), to ensure the quality of our dataset. We opted for an eight-year timespan for most communities, excluding StackOverflow and Mathematics. The decision to use a shorter timeframe for these two communities arises from their high daily question volumes, ensuring a more balanced representation across all communities. This strategy aims to maintain a roughly equivalent dataset size for each community. We removed questions and answers without a specified question’s ID and OwnerUserID (i.e., the ID of the user who posted the question). Subsequently, we retained questions with an AcceptedAnswerID (i.e., closed questions) and answers with a valid ParentID (i.e., the respective question ID). Questions where the asker and the best answerer were the same user were also excluded. We split the dataset, allocating 80% for training/validation and the remaining 20% for the test set. Notably, we maintained the chronological order of the questions to preserve temporal integrity. Comprehensive statistics for the resulting datasets are provided in Table 1.

User Interaction Model.

To identify the primary topics discussed in the community, we employed the clustering technique outlined in Section 3.1, focusing on the tags associated with questions in the training set. Considering the top λ=10𝜆10\lambda=10italic_λ = 10 most frequent tags as features, we determined the optimal number of clusters within the range K=[2,10]𝐾210K=[2,10]italic_K = [ 2 , 10 ] that maximizes the Silhouette score. Each cluster represents a specific macro-area, corresponding to a layer in the MLG, which captures interactions among users who have provided a number of accepted answers equal to or exceeding the ϵ=90thitalic-ϵ90𝑡\epsilon=90thitalic_ϵ = 90 italic_t italic_h percentile. Users are represented by their topic vectors used to compute the pair-wise cosine similarities (see Section 3.1). When modeling relationships, we retained edges with a similarity equal to or greater than δ=0.5𝛿0.5\delta=0.5italic_δ = 0.5.

Expert Selection.

We identify as experts the users with a number of accepted answers greater than or equal to the ω=95th𝜔95𝑡\omega=95thitalic_ω = 95 italic_t italic_h percentile of the distribution, following the procedure detailed in Section 3.2. The minimum number of accepted answers and the total number of users labeled as experts for each dataset are outlined in Table 1 under the columns MinAccAns and Experts, respectively. TagIndex and TextIndex return the 1,000 most similar past questions for each query in the expert selection process, respectively. For both Network and Content methods, after node sorting, the collection phase identifies the initial set of experts D𝐷Ditalic_D that reaches a probability threshold of p=0.001𝑝0.001p=0.001italic_p = 0.001, representing the probability of not receiving an answer. Subsequently, starting for each selected expert in D𝐷Ditalic_D, we conduct 5 Random Walks of, at most, 10 steps.

Ranking.

We utilize the LightGBM333https://lightgbm.readthedocs.io/en/stable/ (Ke et al., 2017) implementation of LambdaMART (Burges, 2010) to learn the TUEF ranking model. The LtR training set is constructed using queries from the training dataset where the accepted answerer is labeled as an expert, including 0<ζ50,000formulae-sequence0𝜁500000<\zeta\leq 50,0000 < italic_ζ ≤ 50 , 000 queries ordered from the most recent to the oldest. It is important to note that this is a TUEF parameter: if the dataset contains more than 50,000 queries answered by experts, TUEF will take the latest 50,000; otherwise, it will consider only the available queries. After performing the MLG exploration for each query of the LtR training set, we excluded all the queries for which TUEF couldn’t include in the candidate set the expert who provided the accepted answer. For the remaining queries, we extracted features for each query-candidate expert pair, as outlined in Section 3.3. The training set was split into training and validation sets following an 80/20 split. Hyper-parameter tuning is carried out using MRR on the validation set, leveraging the HyperOpt library (Bergstra et al., 2013), and optimizing four learning parameters: learning_rate[0.0001,0.15]𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒0.00010.15learning\_rate\in[0.0001,0.15]italic_l italic_e italic_a italic_r italic_n italic_i italic_n italic_g _ italic_r italic_a italic_t italic_e ∈ [ 0.0001 , 0.15 ], num_leaves[50,200]𝑛𝑢𝑚_𝑙𝑒𝑎𝑣𝑒𝑠50200num\_leaves\in[50,200]italic_n italic_u italic_m _ italic_l italic_e italic_a italic_v italic_e italic_s ∈ [ 50 , 200 ], n_estimators[50,150]𝑛_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠50150n\_estimators\in[50,150]italic_n _ italic_e italic_s italic_t italic_i italic_m italic_a italic_t italic_o italic_r italic_s ∈ [ 50 , 150 ], max_depth[8,15]𝑚𝑎𝑥_𝑑𝑒𝑝𝑡815max\_depth\in[8,15]italic_m italic_a italic_x _ italic_d italic_e italic_p italic_t italic_h ∈ [ 8 , 15 ], and min_data_in_leaf[150,500]𝑚𝑖𝑛_𝑑𝑎𝑡𝑎_𝑖𝑛_𝑙𝑒𝑎𝑓150500min\_data\_in\_leaf\in[150,500]italic_m italic_i italic_n _ italic_d italic_a italic_t italic_a _ italic_i italic_n _ italic_l italic_e italic_a italic_f ∈ [ 150 , 500 ].

Table 1 shows the number of queries used for the LtR and the average length of expert lists to rank under the columns LtR and AvgList, respectively.

Table 1. Statistics of the StackExchange communities used for the experiments.

Time Period Tags Clusters Silhouette Train Test MinAccAns Experts LtR AvgList StackOverflow 2020-12-01 2021-01-01 5365 10 0.805 39581 9521 11 350 8008 111 Unix 2015-01-01 2023-09-04 1876 6 0.463 51995 12985 21 183 22098 118 AskUbuntu 2015-01-01 2023-09-04 2021 9 0.391 43469 10166 9 265 16918 129 Server Fault 2015-01-01 2023-09-04 2137 10 0.419 30797 7398 15 180 10041 102 Physics 2015-01-01 2023-09-04 811 8 0.406 53412 13860 31 146 19192 102 Mathematics 2022-01-01 2023-09-04 1351 9 0.499 46190 11649 41 141 15128 85

4.3. Evaluation metrics

We use Precision@1 (P@1), Normalized Discounted Cumulative Gain @3 (NDCG@3), Mean Reciprocal Rank (MRR), and Recall@5 (R@5) as our evaluation metrics. The cutoffs considered are short, as finding the relevant results at the top of the ranked lists for the EF task is essential. P@1 assumes a pivotal role in the EF task, emphasizing the primary objective of identifying the most suitable answerer for a given query. NDCG@3 and MRR metrics serve as valuable tools when models exhibit identical P@1 scores, offering a detailed comparison by considering the actual position of the best answerer in the list. R@5, in turn, measures the model’s efficacy in locating the best answerer within the top five positions. Performance metrics and statistical significance tests are computed using the RanX Library (Bassani, 2022).

Reproducibility

TUEF is prototyped in Python 3.8.17. All experiments are conducted on an Intel(R) Xeon(R) Platinum 8164 CPU 2.00GHz processor with 503GB RAM on Linux 5.4.0-153-generic. The source code of TUEF is publicly available444https://github.com/maddalena-amendola/TUEF.

5. Experimental analysis

The following Section discusses two blocks of experiments. The first block (Section 5.1) regards an ablation study of TUEF across all the communities to understand the contribution to the overall performance of TUEF’s different components (detailed in Section 3). To make EF decisions transparent and interpretable, the experiments also include assessing the impact of the integration in TUEF of the interpretable LtR algorithm (IlMart (Lucchese et al., 2022)) along with examples of TUEF’s decision-making process. In the second block of experiments (Section 5.2), we conduct a comprehensive evaluation of TUEF aimed at assessing its effectiveness in two scenarios: end-to-end Expert Ranking scenario, and an offline Expert Subsampling Ranking scenario, where we conduct experiments with sampled metrics (Krichene and Rendle, 2020b), ranking only a small set of candidates for each query. In both categories, we compare TUEF with state-of-the-art competitors for which the implementation is publicly available. Finally, in Section  5.3, we study the ability of TUEF to scale with larger datasets by considering four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data.

5.1. Ablation study

We compare TUEF in an end-to-end scenario with variants of the proposed solution, each exploiting only a subset of the TUEF components. By examining these different configurations, we can assess and quantify the impact of each component and gain insights into the effectiveness of combining social and content information in our approach for addressing the EF task for CQA platforms:

  • BC: It uses the MLG and sorts the experts in the layers related to the new question based on their Betweenness centrality score.

  • BM25: It uses the MLG and sorts the experts in the question’s layers based on the BM25 score computed between the query and their previously answered questions. As in TUEF, the ranked lists of candidate experts retrieved for the tag and content indexes are merged by alternating their elements.

  • TUEF NB: It only uses the MLG and applies the Network-based method. Candidates are ranked using a LtR model exploiting the static and the following query-dependent features: LayerCount, QueryKnowledge, VisitCountNetwork, StepsNetwork, BetweennesPos, BetweennessScore, Eigenvector, PageRank, Closeness, Degree, and AvgWeights.

  • TUEF CB: It only uses the MLG and applies the Content-based method. Candidates are ranked using a LtR model exploiting the static and the following query-dependent features: LayerCount, QueryKnowledge, VisitCountContent, StepsContent, ScoreIndexTag, ScoreIndexText, FrequencyIndexTag, FrequencyIndexTex, Degree, and AvgWeights.

  • TUEF SL: In contrast to TUEF, it represents users’ relationships in a graph with a Single Layer. All other phases are unchanged.

  • TUEF NoRW: In contrast to TUEF, it skips the Exploratory phase and does not perform Random Walks to extend the set of candidate experts selected considering the probability of receiving an answer.

  • TUEF Lin: It ranks the experts according to the solution of (Roy et al., 2018), which uses a linear combination of features modeling experts, questions, and users’ expertise.

  • TUEF IlMart: It uses IlMart, the interpretable version of LambdaMart algorithm proposed in (Lucchese et al., 2022) to learn the expert ranking model. IlMart is a learning algorithm that generates explainable LTR models based on a version of LambdaMART constrained to use only univariate and bivariate functions at training time. The IlMart learning strategy is made of three steps: i) Main Effects Learning, which learns with the LambdaMART boosting strategy an ensemble of trees τ(xj)𝜏subscript𝑥𝑗\tau(x_{j})italic_τ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in which the only feature allowed for each split is xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT; ii) Interaction Effects Selection, which selects the top-K most important feature pairs (xi,xj)subscript𝑥𝑖subscript𝑥𝑗(x_{i},x_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ); iii) Interaction Effects Learning, which learns with LambdaMART an ensemble of trees τij(xi,xj)subscript𝜏𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗\tau_{ij}(x_{i},x_{j})italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where the only features allowed in a split of each tree tτij(xi,xj)𝑡subscript𝜏𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗t\in\tau_{ij}(x_{i},x_{j})italic_t ∈ italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are either xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, i.e., one of the features pairs selected by the previous step. The resulting ranking models can achieve ranking performance close to unconstrained LambdaMART models without all their complexity, thus trading off between effectiveness and explainability.

Table 2. Ablation Study Results. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines representing its different components. Scores marked with the \blacktriangledown symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the \blacktriangle symbol indicates that the corresponding baseline statistically outperforms TUEF. Underlined values indicate that the baseline has a higher score than TUEF, but the difference is not statistically significant.

Model Dataset StackOverflow Unix AskUbuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BC𝐵𝐶BCitalic_B italic_C 0.009\blacktriangledown 0.017\blacktriangledown 0.045\blacktriangledown 0.030\blacktriangledown 0.046\blacktriangledown 0.087\blacktriangledown 0.165\blacktriangledown 0.122\blacktriangledown 0.006\blacktriangledown 0.010\blacktriangledown 0.051\blacktriangledown 0.044\blacktriangledown BM25𝐵𝑀25BM25italic_B italic_M 25 0.209\blacktriangledown 0.259\blacktriangledown 0.325\blacktriangledown 0.259\blacktriangledown 0.116\blacktriangledown 0.217\blacktriangledown 0.396\blacktriangledown 0.254\blacktriangledown 0.098\blacktriangledown 0.196\blacktriangledown 0.352\blacktriangledown 0.225\blacktriangledown TUEFNB𝑇𝑈𝐸subscript𝐹𝑁𝐵TUEF_{NB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_B end_POSTSUBSCRIPT 0.062\blacktriangledown 0.127\blacktriangledown 0.236\blacktriangledown 0.140\blacktriangledown 0.27\blacktriangledown 0.353\blacktriangledown 0.494\blacktriangledown 0.376\blacktriangledown 0.126\blacktriangledown 0.182\blacktriangledown 0.299\blacktriangledown 0.208\blacktriangledown TUEFLin𝑇𝑈𝐸subscript𝐹𝐿𝑖𝑛TUEF_{Lin}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_L italic_i italic_n end_POSTSUBSCRIPT 0.263\blacktriangledown 0.383\blacktriangledown 0.569\blacktriangledown 0.406\blacktriangledown 0.297\blacktriangledown 0.382\blacktriangledown 0.542\blacktriangledown 0.41\blacktriangledown 0.184\blacktriangledown 0.285\blacktriangledown 0.440\blacktriangledown 0.307\blacktriangledown TUEFSL𝑇𝑈𝐸subscript𝐹𝑆𝐿TUEF_{SL}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT 0.431 0.565\blacktriangledown 0.750 0.564\blacktriangledown 0.336 0.430 0.585 0.452 0.255\blacktriangle 0.363\blacktriangle 0.513\blacktriangle 0.375\blacktriangle TUEFCB𝑇𝑈𝐸subscript𝐹𝐶𝐵TUEF_{CB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT 0.455 0.589 0.761 0.587 0.332 0.424 0.573\blacktriangledown 0.444\blacktriangledown 0.244 0.350 0.502 0.364 TUEFNoRW𝑇𝑈𝐸subscript𝐹𝑁𝑜𝑅𝑊TUEF_{NoRW}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_o italic_R italic_W end_POSTSUBSCRIPT 0.446 0.582 0.744\blacktriangledown 0.573\blacktriangledown 0.334 0.425 0.575 0.436\blacktriangledown 0.249 0.342 0.472\blacktriangledown 0.345\blacktriangledown TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT 0.360\blacktriangledown 0.496\blacktriangledown 0.676\blacktriangledown 0.502\blacktriangledown 0.324 0.416\blacktriangledown 0.565\blacktriangledown 0.438\blacktriangledown 0.203\blacktriangledown 0.305\blacktriangledown 0.468\blacktriangledown 0.327\blacktriangledown TUEF 0.459 0.592 0.760 0.590 0.331 0.425 0.581 0.448 0.241 0.345 0.500 0.363 Model Dataset Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BC𝐵𝐶BCitalic_B italic_C 0.010\blacktriangledown 0.040\blacktriangledown 0.099\blacktriangledown 0.071\blacktriangledown 0.014\blacktriangledown 0.027\blacktriangledown 0.093\blacktriangledown 0.059\blacktriangledown 0.002\blacktriangledown 0.040\blacktriangledown 0.136\blacktriangledown 0.059\blacktriangledown BM25𝐵𝑀25BM25italic_B italic_M 25 0.094\blacktriangledown 0.152\blacktriangledown 0.251\blacktriangledown 0.167\blacktriangledown 0.103\blacktriangledown 0.189\blacktriangledown 0.334\blacktriangledown 0.226\blacktriangledown 0.111\blacktriangledown 0.199\blacktriangledown 0.361\blacktriangledown 0.236\blacktriangledown TUEFNB𝑇𝑈𝐸subscript𝐹𝑁𝐵TUEF_{NB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_B end_POSTSUBSCRIPT 0.21\blacktriangledown 0.253\blacktriangledown 0.364\blacktriangledown 0.285\blacktriangledown 0.101\blacktriangledown 0.162\blacktriangledown 0.277\blacktriangledown 0.196\blacktriangledown 0.084\blacktriangledown 0.145\blacktriangledown 0.278\blacktriangledown 0.178\blacktriangledown TUEFLIN𝑇𝑈𝐸subscript𝐹𝐿𝐼𝑁TUEF_{LIN}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_L italic_I italic_N end_POSTSUBSCRIPT 0.260 0.368 0.524\blacktriangledown 0.388 0.139\blacktriangledown 0.217\blacktriangledown 0.357\blacktriangledown 0.253\blacktriangledown 0.177\blacktriangledown 0.27\blacktriangledown 0.427\blacktriangledown 0.301\blacktriangledown TUEFSL𝑇𝑈𝐸subscript𝐹𝑆𝐿TUEF_{SL}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT 0.279 0.389 0.547 0.406 0.220 0.299 0.440 0.331 0.208 0.315 0.471 0.335 TUEFCB𝑇𝑈𝐸subscript𝐹𝐶𝐵TUEF_{CB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT 0.267 0.377 0.538 0.395 0.215 0.294 0.429 0.325\blacktriangledown 0.219\blacktriangle 0.327\blacktriangle 0.482 0.345 TUEFNORW𝑇𝑈𝐸subscript𝐹𝑁𝑂𝑅𝑊TUEF_{NORW}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_O italic_R italic_W end_POSTSUBSCRIPT 0.272 0.378 0.527\blacktriangledown 0.382\blacktriangledown 0.218 0.295 0.429 0.313\blacktriangledown 0.210 0.313 0.469 0.321\blacktriangledown TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT 0.250 0.364\blacktriangledown 0.529\blacktriangledown 0.384\blacktriangledown 0.195\blacktriangledown 0.276\blacktriangledown 0.418\blacktriangledown 0.309\blacktriangledown 0.197\blacktriangledown 0.301\blacktriangledown 0.475 0.329\blacktriangledown TUEF 0.266 0.382 0.547 0.400 0.221 0.300 0.440 0.332 0.207 0.316 0.478 0.340

Discussion

Table 2 reports the results of the ablation study. We mark statistically significant performance gains/losses with respect to TUEF with the symbols \blacktriangledown and \blacktriangle (paired t-test with p-value¡0.05 and Bonferroni correction), and we underline the performance figures numerically greater than those of TUEF.

TUEF consistently outperforms in all examined communities some baseline models, including BC𝐵𝐶BCitalic_B italic_C, BM25𝐵𝑀25BM25italic_B italic_M 25, TUEFNB𝑇𝑈𝐸subscript𝐹𝑁𝐵TUEF_{NB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_B end_POSTSUBSCRIPT, and TUEFLin𝑇𝑈𝐸subscript𝐹𝐿𝑖𝑛TUEF_{Lin}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_L italic_i italic_n end_POSTSUBSCRIPT. A recurring pattern emerges across communities, revealing BC𝐵𝐶BCitalic_B italic_C and TUEFNB𝑇𝑈𝐸subscript𝐹𝑁𝐵TUEF_{NB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_B end_POSTSUBSCRIPT as the least effective baselines, both relying exclusively on network aspects. In contrast, BM25𝐵𝑀25BM25italic_B italic_M 25’s performance highlights the significance of the content-based component, as evidenced by a minimum of 10% of queries across communities being best addressed by experts who have previously answered similar questions. TUEFLin𝑇𝑈𝐸subscript𝐹𝐿𝑖𝑛TUEF_{Lin}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_L italic_i italic_n end_POSTSUBSCRIPT consistently exhibits lower performance compared to TUEF, underscoring the efficacy of TUEF’s LtR methodology in capturing hidden relationships among expert features.

TUEFSL𝑇𝑈𝐸subscript𝐹𝑆𝐿TUEF_{SL}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT shows slightly higher or equal performance across communities except for StackOverflow and AskUbuntu. In the case of StackOverflow, TUEF outperforms a statistically significant margin TUEFSL𝑇𝑈𝐸subscript𝐹𝑆𝐿TUEF_{SL}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT, while on AskUbuntu, the opposite holds, hinting at community-specific influences. This observation aligns with the Silhouette values described in Table 1, which quantify the clustering effectiveness of layers within the MLG. StackOverflow’s superior Silhouette value of 0.805 indicates a robust subdivision of tags into clusters, benefiting TUEF’s exploitation of topic layers. Conversely, AskUbuntu, with the lowest Silhouette value of 0.391, experiences challenges, contributing to TUEFSL𝑇𝑈𝐸subscript𝐹𝑆𝐿TUEF_{SL}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT’s higher statistical performance. Consequently, the MLG proves beneficial when distinct community topics are evident. Moreover, the values of TUEFCB𝑇𝑈𝐸subscript𝐹𝐶𝐵TUEF_{CB}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT underscore content supremacy over the network component, although incorporating both components in TUEF marginally enhances system performances for most communities. Finally, TUEFNoRW𝑇𝑈𝐸subscript𝐹𝑁𝑜𝑅𝑊TUEF_{NoRW}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_o italic_R italic_W end_POSTSUBSCRIPT exhibits lower recall (R@5) without a decrease in precision (P@1). The similar P@1 scores between TUEF and TUEFNoRW𝑇𝑈𝐸subscript𝐹𝑁𝑜𝑅𝑊TUEF_{NoRW}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_N italic_o italic_R italic_W end_POSTSUBSCRIPT result from TUEF’s random walk mechanism, which identifies a larger set of relevant experts. TUEF significantly improves recall (R@5) by consistently including the correct expert within the top five positions. However, this larger candidate set makes it more challenging to rank the best expert in the first position, thus keeping P@1 similar for both models. Overall, TUEF’s random walk enhances recall by expanding the candidate set, even though it complicates achieving the highest precision score.

Refer to caption
Figure 2. LtR Algorithm’s Feature Importances. Figure (a) on the left displays the eight most important features for the TUEF LtR algorithm. Figure (b) illustrates the feature importance values for all features considered by TUEF IlMart.
Refer to caption
Figure 3. The three most important main and interaction effects of TUEF IlMart on the StackOverflow dataset. Figures (a), (b), and (c) show the main effects of FreqIndexTag, FreqIndexText, and VisitCountContent, respectively. The x-axis represents the values the feature can have, while the y-axis represents the corresponding contribution of the feature to the predicted final score. Figure (d) illustrates the most important interaction effect learned by TUEF IlMart, composed of the features FreqIndexTag (x-axis) and FreqIndexText (y-axis). The color bar indicates the interaction contribution.

Interpretability

TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT consistently exhibits statistically lower performance than TUEF, with the most notable disparity occurring in the StackOverflow community, where the difference reaches 27.5% for P@1 and 12.43% for R@5. In other communities, the variations are still statistically significant, ranging from a maximum of 18.71% (AskUbuntu) to a minimum of 2.16% (Unix) in terms of P@1. However, the differences diminish significantly when considering R@5, with the highest value being 6.83% (AskUbuntu) and no statistical differences in Mathematics, indicating TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT proficiency in placing the top respondent within the first five positions.

TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT capacity to offer model result explanations may offset the observed performance decline. TUEF, on the other hand, employs a more sophisticated decision tree-based LtR approach, which contributes to higher performance but offers only minimal clarity regarding the significance of different factors influencing the ranking result. Figure 2 illustrates the feature importance of both models (LambdaMart and TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT) for the StackOverflow community. It reports LambdaMart’s top height features and all those considered by TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT, all of which are present among LambdaMart’s top eight features, albeit in a different order. For example, the ScoreIndexTag feature, most used by LambdaMart, is the least used by TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT. Most of the crucial features for LambdaMart are associated with the content-based method. Nevertheless, features from the network-based method, such as Closeness, PageRank, and AvgWeight, exhibit high importance values. On the other hand, all the features selected by TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT are content-based, highlighting the importance of the text component for the EF task.

Figure 3 illustrates how TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT can be used for interpretability purposes, showcasing the contribution to the predicted score of the three IlMart main effects with the highest feature importance values (i.e., FreqIndexTag, VisitCountContent, and FreqIndexText) and the most important interaction effect learned from TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT for the StackOverflow community. High values of these features indicate proficient expert competence. The 2D plots and the heatmap are built by aggregating all the trees using the same features, i.e., by representing the contribution of each main effect τ(xj)𝜏subscript𝑥𝑗\tau(x_{j})italic_τ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and each interaction effect τ(xi,xj)𝜏subscript𝑥𝑖subscript𝑥𝑗\tau(x_{i},x_{j})italic_τ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where τ(xj)𝜏subscript𝑥𝑗\tau(x_{j})italic_τ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and τij(xi,xj)subscript𝜏𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗\tau_{ij}(x_{i},x_{j})italic_τ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are ensembles of trees. Through Figure 3, we can analyze instances where TUEF successfully places the best answerer in the first position and cases where it fails. Table 3 presents four question examples: the first two showcase instances where TUEF successfully identifies the best expert by placing them in the first position. For a better understanding, we report the features of the top two ranked experts. Conversely, for the last two examples where TUEF fails, we show the feature values of the experts in ranking order up to the position where the ranker placed the best expert. The Rank column indicates the expert’s ranking position in the final list. In contrast, the BestExpert column serves as a label, with a value equal to 1 for the ground truth expert and 0 otherwise.

A notable pattern emerges in both instances of TUEF’s success and failure: the expert occupying the first position consistently demonstrates higher values for the features evaluated by the TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT ranker. This results in a greater contribution and, consequently, a higher final score, as shown in Figure 3. For example, consider the first instance in Table 3, with QID=65438592𝑄𝐼𝐷65438592QID=65438592italic_Q italic_I italic_D = 65438592, and focus on the feature FreqIndexTag. The BestExpert (first row of Table 3) has a value of 45, corresponding to a contribution of approximately 1, as shown in the upper left figure. In contrast, the expert placed in the second position (second row of Table 3) has a value of 7, contributing to a measure of less than 0.5 (half of the first). By combining all feature contributions computed with the aid of Figure 3, we can comprehend the final ranking and understand why the model could or could not place the BestExpert in the first position.

Table 3. Examples of TUEF IlMart Decision-Making Process. The table presents four examples of questions: two where TUEF IlMart successfully identified the right expert (i.e., placing the right expert in the first position of the final ranking) and two where it was unsuccessful. For each question, the table reports the values of FreqIndexTag, VisitCountContent, and FreqIndexText, along with the final expert Score, the position in the final ranking, and the BestExpert label.
QID FreqIndexTag VisitCountContent FreqIndexText Score Rank BestExpert
65438592 45 17 54 2.39 1 1
7 7 7 1.96 2 0
65438859 2 10 13 -0.12 1 1
1 1 9 -1.7 2 0
65438631 81 62 82 3.29 1 0
30 25 30 2.83 2 1
65439018 229 14 202 3.5 1 0
14 5 12 2.58 2 0
11 9 16 2.45 3 1

5.2. State-of-the-art Baselines

In this section, we evaluate TUEF based on the two approaches for the EF task discussed in Section 2. As done in Section 5.1, to assess the statistical significance of the performance gains/losses (\blacktriangle and \blacktriangledown) measured with our experiments, we conducted a paired t-test with a p-value threshold of 0.05 and Bonferroni correction. Moreover, we underline the performance numerically (not statistically) greater than those of TUEF.

Expert Ranking.

Table 4. Expert Ranking Comparison. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines in the Expert Ranking category for all selected StackExchange communities. Scores marked with the \blacktriangledown symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the \blacktriangle symbol indicates that the corresponding baseline statistically outperforms TUEF. Underlined values indicate that the baseline has a higher score than TUEF, but the difference is not statistically significant.

Model Dataset Stack Overflow Unix Ask Ubuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BM25 0.257\blacktriangledown 0.366\blacktriangledown 0.513\blacktriangledown 0.381\blacktriangledown 0.253\blacktriangledown 0.348\blacktriangledown 0.492\blacktriangledown 0.365\blacktriangledown 0.141\blacktriangledown 0.237\blacktriangledown 0.389\blacktriangledown 0.261\blacktriangledown BM25+TAG 0.307\blacktriangledown 0.431\blacktriangledown 0.604\blacktriangledown 0.441\blacktriangledown 0.260\blacktriangledown 0.345\blacktriangledown 0.494\blacktriangledown 0.368\blacktriangledown 0.151\blacktriangledown 0.256\blacktriangledown 0.415\blacktriangledown 0.277\blacktriangledown MiniLM 0.282\blacktriangledown 0.384\blacktriangledown 0.527\blacktriangledown 0.399\blacktriangledown 0.262\blacktriangledown 0.352\blacktriangledown 0.485\blacktriangledown 0.367\blacktriangledown 0.158\blacktriangledown 0.259\blacktriangledown 0.412\blacktriangledown 0.279\blacktriangledown MiniLM+TAG 0.320\blacktriangledown 0.441\blacktriangledown 0.624\blacktriangledown 0.456\blacktriangledown 0.269\blacktriangledown 0.358\blacktriangledown 0.492\blacktriangledown 0.371\blacktriangledown 0.172\blacktriangledown 0.273\blacktriangledown 0.419\blacktriangledown 0.292\blacktriangledown DistilBERT 0.288\blacktriangledown 0.394\blacktriangledown 0.542\blacktriangledown 0.407\blacktriangledown 0.268\blacktriangledown 0.356\blacktriangledown 0.492\blacktriangledown 0.373\blacktriangledown 0.161\blacktriangledown 0.264\blacktriangledown 0.415\blacktriangledown 0.283\blacktriangledown DistilBERT+TAG 0.337\blacktriangledown 0.455\blacktriangledown 0.627\blacktriangledown 0.468\blacktriangledown 0.276\blacktriangledown 0.363\blacktriangledown 0.492\blacktriangledown 0.376\blacktriangledown 0.178\blacktriangledown 0.282\blacktriangledown 0.438\blacktriangledown 0.302\blacktriangledown t-BGER 0.254\blacktriangledown 0.338\blacktriangledown 0.478\blacktriangledown 0.36\blacktriangledown 0.25\blacktriangledown 0.314\blacktriangledown 0.43\blacktriangledown 0.345\blacktriangledown 0.202\blacktriangledown 0.312\blacktriangledown 0.508 0.335\blacktriangledown TUEF 0.459 0.592 0.760 0.590 0.331 0.425 0.581 0.448 0.241 0.345 0.500 0.363 Model Dataset Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BM25 0.199\blacktriangledown 0.286\blacktriangledown 0.430\blacktriangledown 0.311\blacktriangledown 0.164\blacktriangledown 0.234\blacktriangledown 0.364\blacktriangledown 0.267\blacktriangledown 0.149\blacktriangledown 0.235\blacktriangledown 0.378\blacktriangledown 0.265\blacktriangledown BM25+TAG 0.218\blacktriangledown 0.309\blacktriangledown 0.475\blacktriangledown 0.331\blacktriangledown 0.166\blacktriangledown 0.239\blacktriangledown 0.374\blacktriangledown 0.271\blacktriangledown 0.149\blacktriangledown 0.232\blacktriangledown 0.377\blacktriangledown 0.258\blacktriangledown MiniLM 0.200\blacktriangledown 0.293\blacktriangledown 0.436\blacktriangledown 0.313\blacktriangledown 0.172\blacktriangledown 0.248\blacktriangledown 0.382\blacktriangledown 0.278\blacktriangledown 0.159\blacktriangledown 0.248\blacktriangledown 0.403\blacktriangledown 0.277\blacktriangledown MiniLM+TAG 0.224\blacktriangledown 0.315\blacktriangledown 0.476\blacktriangledown 0.337\blacktriangledown 0.172\blacktriangledown 0.25\blacktriangledown 0.386\blacktriangledown 0.279\blacktriangledown 0.161\blacktriangledown 0.248\blacktriangledown 0.405\blacktriangledown 0.277\blacktriangledown DistilBERT 0.215\blacktriangledown 0.302\blacktriangledown 0.449\blacktriangledown 0.325\blacktriangledown 0.178\blacktriangledown 0.256\blacktriangledown 0.386\blacktriangledown 0.284\blacktriangledown 0.170\blacktriangledown 0.263\blacktriangledown 0.411\blacktriangledown 0.291\blacktriangledown DistilBERT+TAG 0.223\blacktriangledown 0.319\blacktriangledown 0.478\blacktriangledown 0.341\blacktriangledown 0.179\blacktriangledown 0.254\blacktriangledown 0.384\blacktriangledown 0.283\blacktriangledown 0.17\blacktriangledown 0.262\blacktriangledown 0.414\blacktriangledown 0.291\blacktriangledown t-BGER 0.275\blacktriangle 0.356\blacktriangledown 0.474\blacktriangledown 0.375\blacktriangledown 0.080\blacktriangledown 0.151 \blacktriangledown 0.258\blacktriangledown 0.180\blacktriangledown 0.122\blacktriangledown 0.169\blacktriangledown 0.295\blacktriangledown 0.215\blacktriangledown TUEF 0.266 0.382 0.547 0.400 0.221 0.300 0.440 0.332 0.207 0.316 0.478 0.340

The first set of baselines follows the Expert Ranking configuration, which involves an end-to-end process of selecting users as experts, computing the similarity between a candidate expert and the query, and subsequently ranking the candidate experts. TUEF adopts this configuration, and we compare it with the models introduced in (Kasela et al., 2023), enabling us to evaluate the effectiveness of current neural re-rankers on this task, as well as the t-BGER model (Krishna and Antulov-Fantulin, 2023).

Kasela et al. (Kasela et al., 2023) employ a two-stage ranking architecture, where the initial stage uses a recall-oriented retriever, and the subsequent stage utilizes a precision-oriented approach. In the first stage, they use Elastic Search’s BM25 model to generate an ordered and limited list of candidate experts. In the second stage, they apply a linear combination of scores to re-rank these experts. This combination includes (i) the BM25 score, (ii) the similarity score computed by neural network-based re-ranker approaches, and for three of the six proposed baselines, (iii) the score generated by a personalized model based on user history. The neural network-based re-ranker models are DistilBERT𝐷𝑖𝑠𝑡𝑖𝑙𝐵𝐸𝑅𝑇DistilBERTitalic_D italic_i italic_s italic_t italic_i italic_l italic_B italic_E italic_R italic_T555https://huggingface.co/distilbert-base-uncased and MiniLM𝑀𝑖𝑛𝑖𝐿𝑀MiniLMitalic_M italic_i italic_n italic_i italic_L italic_M666https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. Both pre-trained models are fine-tuned for ten epochs for each community considered. The personalized model, called TAG, captures the similarity between the topics previously addressed by the asker and the answerer based on the tags in the questions asked and the answers provided. Finally, the scores are combined and computed as the weighted sum of the normalized model scores. The models’ weights are determined using a grid search on the validation set. The proposed baselines are BM25, BM25+TAG, MiniLM, MiniLM+TAG, DistilBERT, and DistilBERT+TAG. Although not explicitly stated for readability, the last four baselines also incorporate BM25.

In contrast, Krishna et al. (Krishna and Antulov-Fantulin, 2023) introduce t-BGER, a temporal weighted bipartite graph model. Their study emphasizes the importance of utilizing tags for expert identification while reducing task complexity. The model is based on a user-tag bipartite graph, where the edge weight reflects the user’s activity related to the specific tag. Additionally, they incorporate temporal discounting into this graph, assigning higher weights to recent activities. Finally, they apply resource allocation techniques through network-based inference on the bipartite graph, effectively managing highly sparse data.

As reported in Table 4, among the baselines, DistilBERT+TAG shows superior performance. However, t-BGER demonstrates better scores for the AskUbuntu and Server Fault communities. Overall, TUEF surpasses all baselines across various metrics, except for P@1 in the Server Fault community, where t-BGER exhibits a relative improvement of 3.38%, and R@5 in the AskUbuntu community, where t-BGER has a slightly higher score. Excluding these two exceptions, TUEF records minimum relative improvements of 19.31% for P@1 in the AskUbuntu community, 7.30% for NDCG@3, 15.40% for R@5, and 6.67% for MRR in the Server Fault community. In the remaining communities, TUEF reports higher relative improvements across all other metrics.

Expert Subsample Ranking.

The second set of baselines follows the Expert Ranking evaluation approach and aims to rank a small set of 20 experts when presented with a new query (see Section 2). This set always includes (i) the users who have provided the answers, including the best answerer, and (ii) a variable number of users selected to reach a total of 20 candidate experts, chosen from the top 10% of users identified as the most active answerers. Since this definition of the EF task deviates from the task previously considered by requiring the prior knowledge of who will answer the query, we adapted TUEF to allow a fair comparison with the following two state-of-the-art models for the EF task, whose code is publicly available:

  • NeRank (Li et al., 2019): It models the CQA platform as a heterogeneous network to learn representations for question raisers and question answerers through a metapath-based algorithm. Using this heterogeneous network, NeRank preserves relationship information while modeling the question’s content with a single-layer LSTM. Finally, a CNN assigns a score to each expert for a given question, representing the probability of the expert providing the accepted answer.

  • PMEF (Peng et al., 2022a): It consists of three main modules: a multi-view question encoder for learning comprehensive question features, an intra-view encoder to discover view-specific interactions among experts and target questions, and an inter-view encoder designed to extract expert/question features by integrating different view information in a personalized manner.

We always consider the questions selected for the experimentation phase for each community. However, the test set is smaller due to constraints imposed by NeRank, PMEF, and, in this specific setup, TUEF as well. Specifically, both NeRank and PMEF models require that the user posing the question has previously asked other questions to capture the interactions between the questioner and responder. Additionally, the test set includes only those questions for which TUEF, through graph exploration, included the best answerer in the set of potential experts. Moreover, to ensure a fair comparison among the three models, we required each to rank the same set of experts. This set comprises users who have answered the question, including the best answerer, and additional experts randomly chosen from those identified by TUEF as candidates. This procedure is the fairest, considering the way TUEF selects candidate experts. Specifically, as a topic-based model, TUEF chooses hard-negative samples because they are always users who have previously answered precisely the topics related to the query. In contrast, following the procedure of NeRank and PMEF and choosing from the top 10% of the community’s best answerers would lead to selecting users who could be irrelevant to the query topics, thus making it easier to rank them lower. We consider in the comparison also TUEF and TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT to study the behavior of the interpretable version of TUEF under this different configuration.

Table 5. Expert Subsample Ranking Comparison. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines in the Expert Ranking category for all selected StackExchange communities. Scores marked with the \blacktriangledown symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the \blacktriangle symbol indicates that the corresponding baseline statistically outperforms TUEF. Underlined values indicate that the baseline has a higher score than TUEF, but the difference is not statistically significant.

Model Dataset StackOverflow Unix Ask Ubuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR Nerank 0.313\blacktriangledown 0.460\blacktriangledown 0.718\blacktriangledown 0.491\blacktriangledown 0.385\blacktriangledown 0.490\blacktriangledown 0.646\blacktriangledown 0.514\blacktriangledown 0.217\blacktriangledown 0.364\blacktriangledown 0.651\blacktriangledown 0.406\blacktriangledown PMEF 0.116\blacktriangledown 0.213\blacktriangledown 0.422\blacktriangledown 0.273\blacktriangledown 0.429\blacktriangledown 0.556\blacktriangledown 0.718\blacktriangledown 0.567\blacktriangledown 0.298\blacktriangledown 0.472\blacktriangledown 0.740\blacktriangledown 0.492\blacktriangledown TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT 0.700\blacktriangledown 0.815\blacktriangledown 0.939\blacktriangledown 0.804\blacktriangledown 0.619 0.744\blacktriangle 0.899\blacktriangledown 0.741 0.528\blacktriangledown 0.673\blacktriangledown 0.875\blacktriangledown 0.673\blacktriangledown TUEF 0.800 0.883 0.989 0.877 0.611 0.738 0.906 0.736 0.574 0.718 0.901 0.712 Model Dataset Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR Nerank 0.292\blacktriangledown 0.403\blacktriangledown 0.597\blacktriangledown 0.442\blacktriangledown 0.187\blacktriangledown 0.341\blacktriangledown 0.621\blacktriangledown 0.380\blacktriangledown 0.233\blacktriangledown 0.372\blacktriangledown 0.608\blacktriangledown 0.407\blacktriangledown PMEF 0.319\blacktriangledown 0.447\blacktriangledown 0.654\blacktriangledown 0.471\blacktriangledown 0.214\blacktriangledown 0.320\blacktriangledown 0.556\blacktriangledown 0.378\blacktriangledown 0.130\blacktriangledown 0.270\blacktriangledown 0.523\blacktriangledown 0.313\blacktriangledown TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT 0.498\blacktriangledown 0.668\blacktriangledown 0.850\blacktriangledown 0.660\blacktriangledown 0.520\blacktriangle 0.674\blacktriangle 0.860 0.673\blacktriangle 0.507\blacktriangledown 0.627\blacktriangledown 0.805\blacktriangledown 0.642\blacktriangledown TUEF 0.594 0.745 0.916 0.733 0.487 0.635 0.858 0.640 0.648 0.669 0.907 0.760

Table 5 shows the results of the three models across different communities. PMEF consistently outperforms NeRank, except in the StackOverflow and Mathematics communities. However, TUEF exhibits significantly better performance than PMEF, with the smallest relative improvement being 42.42% for P@1 and 29.8% for MRR in the Unix community, and 21.92% for R@5 in the AskUbuntu community. Other metrics in the remaining communities show even more substantial relative improvements than those mentioned.

TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT consistently outperforms NeRank and PMEF. However, it exhibits lower performance across various communities compared to TUEF, except in the Unix and Physics communities, where it demonstrates statistical superiority in specific metrics. TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT has already demonstrated performance comparable to TUEF in these communities during the ablation study (Table 2). Consequently, TUEFIlMart𝑇𝑈𝐸subscript𝐹𝐼𝑙𝑀𝑎𝑟𝑡TUEF_{IlMart}italic_T italic_U italic_E italic_F start_POSTSUBSCRIPT italic_I italic_l italic_M italic_a italic_r italic_t end_POSTSUBSCRIPT is a promising compromise between interpretability and system accuracy for less topic-sensitive communities.

Finally, note that TUEF performs considerably better in this set of experiments compared to the cases analyzed in Tables 2 and 4. This improvement is the result of two factors: (i) the test set includes only queries for which TUEF successfully identified the best answerer during graph exploration, and (ii) the set of candidates to be ranked is limited to twenty users, approximately one-fifth of those that TUEF typically selects and sorts on average (see Table 1). We remark that this configuration and the specific settings for these experiments were required to ensure a fair comparison with the other considered algorithms.

5.3. TUEF scalability

Table 6. Statistics for four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data, including the number of questions used for train and LtR, the number of answers, the number of users labeled as experts, the minimum number of accepted answers the experts have (MinAccAns), the number of tags associated with questions, the number of MLG’s layers, and the average length of experts lists to rank at inference time (AvgLists).
Months Train Answers LtR Experts MinAccAns Tags Layers AvgLists
1 38616 53717 8008 350 11 5365 10 111.03
2 79451 110169 19991 577 14 7546 10 146.74
3 122612 170131 33164 819 15 9136 10 182.68
4 165320 230246 44821 1015 16 10514 10 186.06
Refer to caption
Figure 4. TUEF performance with four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data. The x-axis specifies the number of months considered, while the y-axis reports the performance metrics, including P@1, NDCG@3, R@5, MRR computed on the same test set of 1,342 queries. The rightmost plot indicates the per-query TUEF average inference time in seconds.

Table 6 and Figure 4 illustrate TUEF ability to scale and adapt to larger datasets while maintaining its effectiveness. The table shows key statistics for four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data, including the number of questions used for training the MLG and the LtR model, the number of answers, the number of users labeled as experts, the minimum number of accepted answers the experts have (MinAccAns), the number of tags associated with questions, the number of MLG’s Layers, and the average length of experts lists to rank at inference time (AvgLists). The training set size increases approximately linearly, with a consistent monthly growth rate, expanding from 38,616 questions in the 1-month dataset to 165,320 questions in the 4-month dataset. The increase in the number of questions is associated with an increase in the number of unique tags, which rise from 5,365 to 10,514, indicating a richer set of topics being covered over time. As expected, using more data increases the number of identified experts (from 350 to 1,015), thereby enriching the expert base. This increase in identified experts by TUEF leads to a corresponding increment in the number of retrieved experts at inference time, as reported in the AvgList column.

Despite the significant growth in data volume, TUEF maintains noteworthy performance on a test set of 1,432 questions, as evidenced by the performance metrics provided (Figure 4). The P@1 metric slightly decreases from 0.492 (1-month) to 0.449 (4-month). The NDCG@3 metric decreases from 0.631 to 0.590, while the MRR slightly drops from 0.625 to 0.589. This decline is relatively minor, considering the significant increase in the training data and the set of experts. However, the R@5 remains more stable for the datasets corresponding to 1, 2, and 3 months, and only slightly decreases to 0.771 for the 4-month dataset, indicating that the model consistently ranks the best answerer within the top five positions. The stability of the R@5 metric is particularly noteworthy, as it suggests that while TUEF may slightly reduce its performance in identifying the best experts at the top position, it still effectively includes them within the first 5 positions. One explanation for the observed slight decrease in performance in metrics other than R@5 is the increased difficulty the LtR algorithm faces when ranking a larger number of experts, as evidenced by the rise in the average length of lists of candidate experts (111 to 186). As the dataset grows, the expanding pool of candidate experts makes it more challenging to distinguish the best expert, leading to minor drops in Precision, NDCG, and MRR metrics. Finally, the rightmost plot shows the average latency of a single query, which increases from 0.107 seconds for the 1-month dataset to 0.167 seconds for the 4-month dataset. This increment can be attributed to the larger size of the CQA community in the 4-month dataset. Specifically, the greater number of questions leads to a larger and more dense MLG as more users and their interactions are included. Consequently, the RW applied to this denser MLG may perform more steps, resulting in the selection of more candidate experts to rank, thereby increasing the time required to perform the inference. Overall, TUEF demonstrates robust scalability, effectively managing larger datasets while maintaining a high level of performance, with slight declines in precision due to the increased data volume and complexity.

6. Conclusion

In this paper, we extended the analysis of TUEF, a Topic-oriented User-Interaction model for Expert Finding in Community Question&Answering platforms. TUEF integrates content and social data by constructing a Multi-Layer Graph to represent user interactions based on topical answering patterns. This approach allows TUEF to leverage both topic specificity and social relationships within the community to identify and rank the most knowledgeable users for any given question. The ablation study results show that TUEF’s performance is particularly notable in larger communities with well-defined topic clusters. Additionally, by incorporating interpretable Learning-to-Rank algorithms (i.e., IlMart (Lucchese et al., 2022)), TUEF achieves complete transparency, allowing a comprehensive understanding of the decision-making processes without significantly decreasing performance. Our extensive experiments conducted on multiple Stack Exchange communities demonstrate that TUEF outperforms state-of-the-art EF models in both the Expert Ranking and Expert Subsample Ranking categories. It excels in precision-oriented metrics such as P@1, NDCG@3, MRR, and R@5, with improvements over the state-of-the-art by a minimum of 42.42% in P@1, 32.73% in NDCG@3, 21.76% in R@5, and 29.81% in MRR. The study also highlighted TUEF’s ability to handle larger datasets, as evidenced by its performance on datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data collection. Specifically, TUEF consistently ranks relevant experts in top positions, demonstrating its reliability and robustness.

Acknowledgements.
This work was partially supported by: the H2020 SoBig- Data++ project (#871042); the CAMEO PRIN project (#2022ZLL7MW) funded by the MUR; the HEU EFRA project (#101093026) funded by the EC under the NextGeneration EU programme. A. Passarella’s and R. Perego’s work was partly funded under the PNRR - M4C2 - Investimento 1.3, PE00000013 - “FAIR” project. However, the views and opinions expressed are those of the authors only and do not necessarily reflect those of the EU or European Commission-EU. Neither the EU nor the granting authority can be held responsible for them.

References

  • (1)
  • Amendola et al. (2024) Maddalena Amendola, Andrea Passarella, and Raffaele Perego. 2024. Towards Robust Expert Finding in Community Question Answering Platforms. In Advances in Information Retrieval, Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, and Iadh Ounis (Eds.). Springer Nature Switzerland, Cham, 152–168.
  • Aslay et al. (2013) Çiğdem Aslay, Neil O’Hare, Luca Maria Aiello, and Alejandro Jaimes. 2013. Competition-based networks for expert finding. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 1033–1036.
  • Bassani (2022) Elias Bassani. 2022. ranx: A blazing-fast python library for ranking evaluation and comparison. In European Conference on Information Retrieval. Springer, 259–264.
  • Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning. PMLR, 115–123.
  • Burges (2010) Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
  • Cañamares and Castells (2020) Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedings of the 14th ACM Conference on Recommender Systems. 259–268.
  • Costa and Ortale (2023a) Gianni Costa and Riccardo Ortale. 2023a. Ask and Ye shall be Answered: Bayesian tag-based collaborative recommendation of trustworthy experts over time in community question answering. Information Fusion 99 (2023), 101856.
  • Costa and Ortale (2023b) Gianni Costa and Riccardo Ortale. 2023b. Here are the answers. What is your question? Bayesian collaborative tag-based recommendation of time-sensitive expertise in question-answering communities. Expert Systems with Applications 225 (2023), 120042.
  • Dallmann et al. (2021) Alexander Dallmann, Daniel Zoller, and Andreas Hotho. 2021. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 505–514.
  • Dargahi Nobari et al. (2017) Arash Dargahi Nobari, Sajad Sotudeh Gharebagh, and Mahmood Neshati. 2017. Skill translation models in expert finding. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1057–1060.
  • Dehghan and Abin (2019) Mahdi Dehghan and Ahmad Ali Abin. 2019. Translations Diversification for Expert Finding: A Novel Clustering-based Approach. ACM Trans. Knowl. Discov. Data 13, 3 (2019), 32:1–32:20. https://doi.org/10.1145/3320489
  • Dehghan et al. (2019) Mahdi Dehghan, Maryam Biabani, and Ahmad Ali Abin. 2019. Temporal expert profiling: With an application to T-shaped expert finding. Inf. Process. Manag. 56, 3 (2019), 1067–1079. https://doi.org/10.1016/j.ipm.2019.02.017
  • Faisal et al. (2019) Muhammad Shahzad Faisal, Ali Daud, Abubakr Usman Akram, Rabeeh Ayaz Abbasi, Naif Radi Aljohani, and Irfan Mehmood. 2019. Expert ranking techniques for online rated forums. Computers in Human Behavior 100 (2019), 168–176.
  • Fallahnejad and Beigy (2022) Zohreh Fallahnejad and Hamid Beigy. 2022. Attention-based skill translation models for expert finding. Expert Systems with Applications 193 (2022), 116433.
  • Freeman (1977) Linton C Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry (1977), 35–41.
  • Fu (2019) Chaogang Fu. 2019. Tracking user-role evolution via topic modeling in community question answering. Information Processing & Management 56, 6 (2019), 102075.
  • Fu (2020) Chaogang Fu. 2020. User correlation model for question recommendation in community question answering. Applied Intelligence 50 (2020), 634–645.
  • Fu et al. (2020) Jinlan Fu, Yi Li, Qi Zhang, Qinzhuo Wu, Renfeng Ma, Xuanjing Huang, and Yu-Gang Jiang. 2020. Recurrent memory reasoning network for expert finding in community question answering. In Proceedings of the 13th international conference on web search and data mining. 187–195.
  • Ghasemi et al. (2021) Negin Ghasemi, Ramin Fatourechi, and Saeedeh Momtazi. 2021. User embedding for expert finding in community question answering. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 4 (2021), 1–16.
  • Giannakidou et al. (2008) Eirini Giannakidou, Vassiliki Koutsonikola, Athena Vakali, and Yiannis Kompatsiaris. 2008. Co-Clustering Tags and Social Data Sources. In 2008 The Ninth International Conference on Web-Age Information Management. 317–324. https://doi.org/10.1109/WAIM.2008.61
  • Hoffa (2019) Felipe Hoffa. 2019. Making Sense of the Metadata: Clustering 4,000 Stack Overflow tags with BigQuery k-means. https://stackoverflow.blog/2019/07/24/making-sense-of-the-metadata-clustering-4000-stack-overflow-tags-with-bigquery-k-means/
  • Hofmann et al. (2016) Katja Hofmann, Lihong Li, Filip Radlinski, et al. 2016. Online evaluation for information retrieval. Foundations and Trends® in Information Retrieval 10, 1 (2016), 1–117.
  • Kasela et al. (2023) Pranav Kasela, Gabriella Pasi, and Raffaele Perego. 2023. SE-PEF: a Resource for Personalized Expert Finding. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 288–309.
  • Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  • Krichene and Rendle (2020a) Walid Krichene and Steffen Rendle. 2020a. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1748–1757.
  • Krichene and Rendle (2020b) Walid Krichene and Steffen Rendle. 2020b. On Sampled Metrics for Item Recommendation. In KDD 2020. https://dl.acm.org/doi/10.1145/3394486.3403226
  • Krishna and Antulov-Fantulin (2023) Vaibhav Krishna and Nino Antulov-Fantulin. 2023. Temporal-Weighted Bipartite Graph Model for Sparse Expert Recommendation in Community Question Answering. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization. 156–163.
  • Kundu and Mandal (2019) Dipankar Kundu and Deba Prasad Mandal. 2019. Formulation of a hybrid expertise retrieval system in community question answering services. Applied Intelligence 49 (2019), 463–477.
  • Kundu et al. (2019) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2019. Finding active experts for question routing in community question answering services. In Pattern Recognition and Machine Intelligence: 8th International Conference, PReMI 2019, Tezpur, India, December 17-20, 2019, Proceedings, Part II. Springer, 320–327.
  • Kundu et al. (2020) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2020. Preference enhanced hybrid expertise retrieval system in community question answering services. Decision Support Systems 129 (2020), 113164.
  • Kundu et al. (2021a) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2021a. Time-aware hybrid expertise retrieval system in community question answering services. Applied Intelligence 51, 10 (2021), 6914–6931.
  • Kundu et al. (2021b) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2021b. Topic sensitive hybrid expertise retrieval system in community question answering services. Knowledge-Based Systems 211 (2021), 106535.
  • Le and Shah (2018) Long T Le and Chirag Shah. 2018. Retrieving people: Identifying potential answerers in community question-answering. Journal of the association for information science and technology 69, 10 (2018), 1246–1258.
  • Li et al. (2002) Longzhuang Li, Yi Shang, and Wei Zhang. 2002. Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th international conference on World Wide Web. 527–535.
  • Li et al. (2019) Zeyu Li, Jyun-Yu Jiang, Yizhou Sun, and Wei Wang. 2019. Personalized question routing via heterogeneous network embedding. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 192–199.
  • Liang et al. (2018) Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 world wide web conference. 689–698.
  • Liang (2019) Shangsong Liang. 2019. Unsupervised Semantic Generative Adversarial Networks for Expert Retrieval. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 1039–1050. https://doi.org/10.1145/3308558.3313625
  • Liu et al. (2022a) Hongtao Liu, Zhepeng Lv, Qing Yang, Dongliang Xu, and Qiyao Peng. 2022a. Efficient Non-sampling Expert Finding. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4239–4243.
  • Liu et al. (2022b) Hongtao Liu, Zhepeng Lv, Qing Yang, Dongliang Xu, and Qiyao Peng. 2022b. Expertbert: Pretraining expert finding. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4244–4248.
  • Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (mar 2009), 225–331. https://doi.org/10.1561/1500000016
  • Liu et al. (2005) Xiaoming Liu, Johan Bollen, Michael L Nelson, and Herbert Van de Sompel. 2005. Co-authorship networks in the digital library research community. Information processing & management 41, 6 (2005), 1462–1480.
  • Liu et al. (2022c) Yue Liu, Weize Tang, Zitu Liu, Lin Ding, and Aihua Tang. 2022c. High-quality domain expert finding method in CQA based on multi-granularity semantic analysis and interest drift. Information Sciences 596 (2022), 395–413.
  • Lucchese et al. (2022) Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Alberto Veneri. 2022. ILMART: Interpretable Ranking with Constrained LambdaMART. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2255–2259.
  • MacQueen et al. (1967) James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.
  • Mumtaz et al. (2019) Sara Mumtaz, Carlos Rodriguez, and Boualem Benatallah. 2019. Expert2vec: Experts representation in community question answering for question routing. In Advanced Information Systems Engineering: 31st International Conference, CAiSE 2019, Rome, Italy, June 3–7, 2019, Proceedings 31. Springer, 213–229.
  • Nobari et al. (2020) Arash Dargahi Nobari, Mahmood Neshati, and Sajad Sotudeh Gharebagh. 2020. Quality-aware skill translation models for expert finding on StackOverflow. Information Systems 87 (2020), 101413.
  • Peng et al. (2023) Qiyao Peng, Hongtao Liu, Zhepeng Lv, Qing Yang, and Wenjun Wang. 2023. Contrastive Pre-training for Personalized Expert Finding. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Peng et al. (2022a) Qiyao Peng, Hongtao Liu, Yinghui Wang, Hongyan Xu, Pengfei Jiao, Minglai Shao, and Wenjun Wang. 2022a. Towards a multi-view attentive matching for personalized expert finding. In Proceedings of the ACM Web Conference 2022. 2131–2140.
  • Peng et al. (2022b) Qiyao Peng, Wenjun Wang, Hongtao Liu, Yinghui Wang, Hongyan Xu, and Minglai Shao. 2022b. Towards comprehensive expert finding with a hierarchical matching network. Knowledge-Based Systems 257 (2022), 109933.
  • Qian et al. (2022a) Lingfei Qian, Jian Wang, Hongfei Lin, Bo Xu, and Liang Yang. 2022a. Heterogeneous information network embedding based on multiperspective metapath for question routing. Knowledge-Based Systems 240 (2022), 107842.
  • Qian et al. (2022b) Lingfei Qian, Jian Wang, Hongfei Lin, Liang Yang, and Yu Zhang. 2022b. Multi-hop interactive attention based classification network for expert recommendation. Neurocomputing 488 (2022), 436–443.
  • Rousseeuw (1987) Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.
  • Roy and Singh (2024) Pradeep Kumar Roy and Jyoti Prakash Singh. 2024. Early prediction of promising expert users on community question answering sites. International Journal of System Assurance Engineering and Management (2024), 1–12.
  • Roy et al. (2018) Pradeep Kumar Roy, Jyoti Prakash Singh, and Amitava Nag. 2018. Finding active expert users for question routing in community question answering sites. In Machine Learning and Data Mining in Pattern Recognition: 14th International Conference, MLDM 2018, New York, NY, USA, July 15-19, 2018, Proceedings, Part II 14. Springer, 440–451.
  • Sang et al. (2019) Lei Sang, Min Xu, ShengSheng Qian, and Xindong Wu. 2019. Multi-modal multi-view Bayesian semantic embedding for community question answering. Neurocomputing 334 (2019), 44–58.
  • Shani and Gunawardana (2011) Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems. Recommender systems handbook (2011), 257–297.
  • Sorkhani et al. (2022) Soroosh Sorkhani, Roohollah Etemadi, Amin Bigdeli, Morteza Zihayat, and Ebrahim Bagheri. 2022. Feature-based question routing in community question answering platforms. Information Sciences 608 (2022), 696–717.
  • Sun et al. (2019a) Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P. Sadayappan, and Srinivasan Parthasarathy. 2019a. ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 265–272. https://doi.org/10.1609/aaai.v33i01.3301265
  • Sun et al. (2019b) Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P Sadayappan, and Srinivasan Parthasarathy. 2019b. Atp: Directed graph embedding with asymmetric transitivity preservation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 265–272.
  • Sun et al. (2018a) Jiankai Sun, Sobhan Moosavi, Rajiv Ramnath, and Srinivasan Parthasarathy. 2018a. QDEE: Question Difficulty and Expertise Estimation in Community Question Answering Sites. In Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25-28, 2018. AAAI Press, 375–384. https://aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17854
  • Sun et al. (2018b) Jiankai Sun, Sobhan Moosavi, Rajiv Ramnath, and Srinivasan Parthasarathy. 2018b. QDEE: question difficulty and expertise estimation in community question answering sites. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12.
  • Tang et al. (2020) Weizhao Tang, Tun Lu, Dongsheng Li, Hansu Gu, and Ning Gu. 2020. Hierarchical attentional factorization machines for expert recommendation in community question answering. IEEE Access 8 (2020), 35331–35343.
  • Tondulkar et al. (2018) Rohan Tondulkar, Manisha Dubey, and Maunendra Sankar Desarkar. 2018. Get me the best: predicting best answerers in community question answering sites. In Proceedings of the 12th ACM Conference on Recommender Systems. 251–259.
  • Zhang et al. (2020) Xuchao Zhang, Wei Cheng, Bo Zong, Yuncong Chen, Jianwu Xu, Ding Li, and Haifeng Chen. 2020. Temporal Context-Aware Representation Learning for Question Routing. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, James Caverlee, Xia (Ben) Hu, Mounia Lalmas, and Wei Wang (Eds.). ACM, 753–761. https://doi.org/10.1145/3336191.3371847