Leveraging Topic Specificity and Social Relationships for Expert Finding in Community Question Answering Platforms

Maddalena Amendola [email protected] 0000−0001−6556−4032 IIT-CNR, ISTI-CNRPisaItaly , Andrea Passarella [email protected] 0000−0002−1694−612X IIT-CNRPisaItaly and Raffaele Perego [email protected] 0000−0001−7189−4724] ISTI-CNRPisaItaly

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Online Community Question Answering (CQA) platforms have become indispensable tools for users seeking expert solutions to their technical queries. The effectiveness of these platforms relies on their ability to identify and direct questions to the most knowledgeable users within the community, a process known as Expert Finding (EF). EF accuracy is crucial for increasing user engagement and the reliability of provided answers. Despite recent advancements in EF methodologies, blending the diverse information sources available on CQA platforms for effective expert identification remains challenging. In this paper, we present TUEF, a Topic-oriented User-Interaction model for Expert Finding, which aims to fully and transparently leverage the heterogeneous information available within online question-answering communities. TUEF integrates content and social data by constructing a multi-layer graph that maps out user relationships based on their answering patterns on specific topics. By combining these sources of information, TUEF identifies the most relevant and knowledgeable users for any given question and ranks them using learning-to-rank techniques. Our findings indicate that TUEF’s topic-oriented model significantly enhances performance, particularly in large communities discussing well-defined topics. Additionally, we show that the interpretable learning-to-rank algorithm integrated into TUEF offers transparency and explainability with minimal performance trade-offs. The exhaustive experiments conducted on six different CQA communities of Stack Exchange show that TUEF outperforms all competitors with a minimum performance boost of 42.42% in P@1, 32.73% in NDCG@3, 21.76% in R@5, and 29.81% in MRR, excelling in both the evaluation approaches present in the previous literature.

Expert Finding, Community Question Answering, Evaluation methodology, Social User Interactions, Learning to Rank, Explainable models

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

Online Community Question-Answering (CQA) platforms, such as StackOverflow and AskUbuntu, have become indispensable tools for users seeking expert solutions to their technical queries. These platforms are built upon the collaborative efforts of users who pose questions and provide answers, thus creating an extensive repository of shared knowledge. The success of CQA platforms relies on their ability to effectively identify and direct questions to the most knowledgeable experts within the community, thus engaging the users. This crucial process, known as Expert Finding (EF), is vital for ensuring the accuracy and reliability of the answers provided.

The EF task focuses on identifying and recognizing users with a high level of expertise who can offer accurate and timely responses to posted questions. It significantly enhances user engagement, trust, and satisfaction by routing questions to the most knowledgeable community members. Despite advancements in EF methodologies, there are still challenges in effectively integrating the diverse information sources available on CQA platforms. Current proposals for addressing the EF task in CQA contexts rely on information derived from the textual content of questions and answers and from features and signals that model the users’ engagement with this content and their interactions within the networked community. The complexity of such interactions and the dynamic nature of expertise necessitates a comprehensive approach that can adapt to varying contexts and user behaviors, which, to the best of our knowledge, still needs to be fully addressed in the literature.

In this paper, we comprehensively address the EF task by proposing TUEF, a Topic-oriented User-Interaction model for Expert Finding, which aims to fully and transparently utilize the vast amount of heterogeneous information available within online question-answering communities. TUEF integrates content and social data by constructing a topic-based multi-layer graph that maps out user relationships based on their topical answering patterns. By combining these sources of information, TUEF aims to identify the most relevant and knowledgeable users for any given question. TUEF operates by first generating the multi-layer graph, where each layer represents a major topic within the community. These topics are automatically identified through the analysis of tags from past questions. Within each layer, nodes represent active users in topic discussions, and edges model the similarities and relationships among these users. When a question is posed, TUEF uses the multi-layer graph and a ranking model to determine the relevant topics and corresponding graph layers. In each relevant layer, TUEF selects candidate experts from two perspectives: (i) Network Perspective: identifying central users who have significant influence within the community; (ii) Content Perspective: identifying users who have previously answered similar questions. Navigating the graph from seed nodes identified according to these criteria, TUEF explores and collects candidate experts through appropriate exploration policies. It then extracts static and query-dependent features based on text and graph relationships for these candidates. Finally, TUEF applies Learning-to-Rank techniques to score and rank the candidates by their expected relevance to the question.

This work builds upon a previous paper (Amendola et al., 2024), extending it with the following main new contributions:

•

We introduce a new categorization of state-of-the-art EF models based on the evaluation approach they adopt: Expert Ranking and Expert Subsample Ranking. Such categorization contributes to the definition of good practices for a fair EF model evaluation.
•

We conduct an ablation study of TUEF across six scientific communities: StackOverflow, Unix, AskUbuntu, Server Fault, Physics, and Mathematics. This analysis allows us to understand the contribution of the various CQA information sources and TUEF components to the end-to-end predictive performance and to validate the generality of TUEF performance.
•

We integrate an interpretable Learning-to-Rank algorithm (Lucchese et al., 2022) into the TUEF framework and provide examples of interpretation, showing how we can make the model fully transparent without losing its accuracy.
•

We compare TUEF with state-of-the-art competitor models within the Expert Ranking category, to which TUEF belongs. Moreover, we adapt TUEF for a fair comparison with state-of-the-art models in the Expert Subsample Ranking category, accommodating this different evaluation configuration. We thus highlight the overall advantage of TUEF concerning relevant alternatives in the state of the art.
•

We study the ability of TUEF to scale with larger datasets by considering four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data.

Our findings indicate that the multi-layer graph significantly enhances performance when the topics discussed in the community involve a well-defined clustering of question tags, as in StackOverflow. In the cases of small communities characterized by a less broad distribution of discussion topics, TUEF using single- or multi-layer graphs exhibit comparable performance. As discussed in (Amendola et al., 2024), content information consistently emerges as the most crucial component for EF in all communities. Additionally, we show that integrating an interpretable Learning-to-Rank solution into TUEF typically results in a slight performance decrease. However, particularly in some communities, the reduction in prediction performance is minimal, making the interpretable version of TUEF a valuable option for gaining insights into the decision-making process. Finally, the results of extensive experiments conducted on CQA communities with varying characteristics show that TUEF surpasses all baseline models, excelling in both the Expert Ranking and Expert Subsample Ranking evaluation categories.

The article is structured as follows: in Section 2, we present state-of-the-art models divided based on two different points of view: (i) the information used (Text-, Feature-, and Network-based methods) and (ii) the evaluation approach adopted (Expert Ranking and Expert Subsample Ranking). Next, Section 3 details the TUEF framework. Following, in Section 4, we present the experimental setup and detail the communities examined, the settings of TUEF hyperparameters, and the metrics used to assess its performance. Finally, in Section 5, we present the results of the TUEF ablation study across all communities and the comparison between TUEF and the state-of-the-art baselines under the two evaluation-based categories.

2. Related Work

In this section, we categorize state-of-the-art models from two perspectives: the type of information the models use (textual, features, and network) and the evaluation methodology the authors adopt for their assessment (with or without subsampling).

2.1. Information-based classification

Research proposals in the field of the EF task for CQA platforms can be grouped into three broad groups based on the information they explore to address the task: text-based, feature-based, and network-based methods.

Text-based methods

A variety of methods have been proposed to address the EF task by leveraging similarities between current and previously answered questions. Dehghan and Ansell (Dehghan and Abin, 2019) introduced a model that clusters question terms based on semantic similarity and co-occurrence. Similarly, Liang et al. (Liang, 2019) proposed a semantic unsupervised generative adversarial network to assess similarities between word representations and experts. Moreover, Dehghan et al. (Dehghan et al., 2019) modeled user expertise by utilizing the tree structure of various domains, tags, and the temporal dimension of user response behavior. In a related vein, Zhang et al. (Zhang et al., 2020) considered the temporal dimension by proposing models with multi-shift and multi-resolution settings to capture temporal dynamics effectively. Fu et al. (Fu et al., 2020) introduced a novel approach using the Recurrent Memory Reasoning Network (RMRN), which employs different reasoning memory cells equipped with attention mechanisms to focus on various aspects of the question. Building on the concept of attention mechanisms, Peng et al. (Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023) implemented these mechanisms in multi-view, multi-grained, semi-supervised pre-trained, and pre-trained models with personalized fine-tuning for expert finding. Moreover, Qian et al. (Qian et al., 2022b) proposed a Multi-Hop Interactive Attention-based Classification Network (MIACN) that utilizes attention mechanisms to detect latent interactions among question subjects and bodies.

Feature-based methods

The second group of methods in EF relies on hand-crafted features to model the expertise of community members. Roy et al. (Roy et al., 2018) formalized a scoring function to capture various aspects of expertise, including the propensity to answer questions related to profile tags, the ability to provide accepted answers, and recent activity levels. Mumtaz et al. (Mumtaz et al., 2019) proposed a framework integrating activity, community, and time-aware features, with the temporal aspect also considered in (Fu, 2019) to track and incorporate the evolution of user roles, as well as in (Kundu et al., 2019). Similarly, Tondulkar et al. (Tondulkar et al., 2018) included features that favor experts providing high-quality answers to complex questions, utilizing a comprehensive set of features capturing user availability and knowledge, and applying Learning-to-Rank (LtR) methods for expert ranking. The quality and consistency of answers are further addressed by Faisal et al. (Faisal et al., 2019), who proposed a model based on an adaptation of the bibliometric g-index. LtR techniques are further utilized by Sorkhani et al. (Sorkhani et al., 2022), introducing 74 content-based and social-based features. An important aspect in addressing the EF task is considering the intimacy between the asker and answerer, studied by Fu et al. (Fu, 2020). As in text-based approaches, Tan et al. (Tang et al., 2020) employed attention mechanisms by proposing a method using Hierarchical Attentional Factorization Machines (HAFM). This approach combines factorization machines with hierarchical attention mechanisms to effectively model user-expert interactions and to highlight the importance of various features in expert recommendation tasks. The hierarchical attention mechanism helps in prioritizing relevant information at both the user and expert levels, enabling more accurate and personalized recommendations. Contrary to many studies focusing primarily on identifying the most senior platform users as experts, Roy et al. (Roy and Singh, 2024) aim to predict promising expert users at an early stage in community question answering sites.

Network-based methods

Network-based methods integrate interactions and relationship information within networks. Kundu et al. (Kundu and Mandal, 2019) defined a framework that comprises a text-based component for assessing expert knowledge on specific topics, accompanied by a Competition Based Expertise Network (CBEN) (Aslay et al., 2013). This network uses link analysis techniques, such as AuthorRank (Liu et al., 2005) and Weighted HITS (Li et al., 2002). The CBEN approach is further utilized in (Kundu et al., 2020), where connections between two users are established if they have responded to the same question. Additionally, this study incorporates both intra-profile and inter-profile preferences of community users. In (Kundu et al., 2021b), the same authors propose a topic-sensitive hybrid expertise retrieval system (TSHER) that integrates assessments of knowledge, reputation, and authority. Le et al. (Le and Shah, 2018) measure the similarity between answerer and asker using social network techniques, employing Random Walks with Restart to calculate the proximity between two nodes.

Contrarily, Sun et al. (Sun et al., 2018a) propose a language-agnostic perspective of user expertise, constructing a competition graph with user and question nodes where edges represent increasing levels of question difficulty, thus depicting the hierarchical structure of questions. This concept of hierarchy is further explored in (Sun et al., 2019a). Addressing data sparsity, Sang et al. (Sang et al., 2019) introduced a Multi-modal Multi-view Semantic Embedding (MMSE) framework that learns semantic embeddings from both local and global perspectives, incorporating social structure information to enhance embedding quality. Ghasemi et al. (Ghasemi et al., 2021) focus on user embedding, proposing a joint model for text and node similarity.

Moreover, Li et al. (Li et al., 2019) represent the CQA platform as a Heterogeneous Information Network (HIN), aiming to learn embeddings for nodes representing question contents, raisers, and answerers. They employ an LSTM-equipped Metapath-based Embedding algorithm and a CNN scoring function to rank experts. The use of HIN and metapath-based algorithms is also noted in (Qian et al., 2022a). Liu et al. (Liu et al., 2022c) apply advanced semantic analysis and interest drift modeling to evaluate experts based on the relevance and depth of their contributions in specific domains. The concept of interest drift is similarly addressed by Krishna et al. (Krishna and Antulov-Fantulin, 2023), who propose a simple graph diffusion-based expert recommendation model that accounts for semantic and temporal information. Temporal dynamics in expertise assessment are further examined by Costa et al. in (Costa and Ortale, 2023b, a), who introduce temporally-discounted, tag-based models.

2.2. Evaluation-based classification

The EF task is commonly cast to an item recommendation task in which the items are community users to be ranked by a measure of likelihood they can effectively answer the new question. The users with the highest scores are the ones to whom the new question should be routed. When evaluating recommender systems, there are two important methods to consider: online evaluation and offline evaluation. Online, end-to-end evaluation is the primary and most reliable way to evaluate a recommender system (Hofmann et al., 2016; Liang et al., 2018), but it is not applicable during the development of a new model (Dallmann et al., 2021). On the other hand, offline evaluation is the main instrument available for academic research and allows for exploring hyperparameter settings (Cañamares and Castells, 2020; Shani and Gunawardana, 2011). It is commonly used to test a new model’s effectiveness before moving to online evaluation. Moreover, in item recommendation tasks, the catalog of items to retrieve from is usually large, and finding matching items from this large pool is challenging (Krichene and Rendle, 2020a). Considering these factors, recent studies have speeded up the evaluation process by calculating the performance metrics of interest on a target set only. This target set is usually a small sample of the items possibly matching the query that includes all relevant items and a defined number of negative (non-relevant) items. However, in recent years, researchers have analyzed and critically evaluated these sample metric strategies. Krichene et al. (Krichene and Rendle, 2020a) thoroughly investigated sampled metrics, revealing that they are inconsistent with their exact version. This finding is significant as it means that these sampled metrics do not persist relative statements, such as recommender A is better than B, not even in expectation. Furthermore, they found that the smaller the sampling size, the less difference between metrics. Canamares et al. (Cañamares and Castells, 2020) found that comparative evaluation using reduced target sets contradicts, in many cases, the corresponding outcome using large targets. Finally, Dallmann et al. (Dallmann et al., 2021) explored two widely used sampling strategies, sampling by popularity and uniform random sampling, and found that both can produce inconsistent rankings compared with the full ranking of the models and consistently produce different ranking when compared over different sample sizes.

When considering the EF context, one effective strategy to simplify the task and reduce complexity is identifying expert users based on specific criteria, such as their consistent provision of high-quality answers. However, it’s crucial to acknowledge that this approach can intensify the cold-start problem, potentially leading to the exclusion of users who could be highly relevant to new questions. Despite this, the method significantly accelerates metric computation and enhances system accuracy. So far, different strategies have been adopted to define a subset of users representing the experts: 10% of the most active users (Kundu et al., 2021b; Qian et al., 2022a; Tang et al., 2020; Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023; Roy and Singh, 2024; Li et al., 2019; Kundu and Mandal, 2019; Liu et al., 2022a), all users with the number of accepted answers greater than a threshold (Kundu et al., 2021a; Krishna and Antulov-Fantulin, 2023; Qian et al., 2022b; Sun et al., 2018b, 2019b; Fu et al., 2020; Fu, 2020; Kundu et al., 2020; Fu, 2019), or a two-stage approach that considers the acceptance ratio (Nobari et al., 2020; Fallahnejad and Beigy, 2022; Kasela et al., 2023; Dargahi Nobari et al., 2017; Mumtaz et al., 2019). No specific heuristic can determine whether a user should be considered an expert. However, to conduct a fair comparison among different models for the EF task, it is important to distinguish studies according to the sampling approach adopted during the ranking phase, ensuring that the models are compared under the same assumptions. Specifically, we recognize two different methods:

•

Experts Ranking: For a new question, this group of works essentially ranks all users labeled as expert users in the first stage. The limitation of this approach during the evaluation phase is that it requires the test set to consist of questions for which the best answerer is a user who was previously labeled as an expert. On the other hand, this evaluation methodology provides realistic figures of the end-to-end performance of the observed system. The following works belong to this group: (Nobari et al., 2020; Kundu et al., 2021b; Fallahnejad and Beigy, 2022; Liu et al., 2022c; Costa and Ortale, 2023b; Kundu et al., 2021a; Costa and Ortale, 2023a; Krishna and Antulov-Fantulin, 2023; Qian et al., 2022b; Roy and Singh, 2024; Sun et al., 2019b; Fu, 2019; Fu et al., 2020; Mumtaz et al., 2019; Kundu and Mandal, 2019; Sun et al., 2018b; Fu, 2020; Kundu et al., 2020).
•

Experts Subsample Ranking: Given the new questions along with the information of all the users who actually answered the question (ground truth), the studies in this group rank only a fixed-size subsample of users. This set of users is generally not large and always includes the ground truth users plus some other users who didn’t answer the question (negative examples). In most of the works following this approach the size of the subsample is 20 and negative examples chosen randomly among the 10% most active users are added to the subsample until it reaches the desired cardinality. While this approach overcomes one limitation of the Experts Ranking methodology by requiring only the knowledge of the answers of the questions in the training set, it relies on a sampling strategy that might produce inconsistent rankings if the sample size is varied and can result in unreliable system rankings(Krichene and Rendle, 2020a; Cañamares and Castells, 2020; Dallmann et al., 2021). The works belonging to this group include the following: (Sang et al., 2019; Ghasemi et al., 2021; Qian et al., 2022a; Tang et al., 2020; Sorkhani et al., 2022; Peng et al., 2022a, b; Liu et al., 2022b; Peng et al., 2023; Li et al., 2019; Zhang et al., 2020; Liu et al., 2022a).

Refer to caption — Figure 1. Illustration of the TUEF approach highlighting the distinct components. At inference time, TUEF first determines the main topics to which the question $q$ belongs and the corresponding graph layers (Multi-Layer Graph). Next, for each layer, it selects the candidate experts from two perspectives: i) Network, by identifying central users that may have considerable influence within the community; ii) Content, by identifying users who previously answered questions similar to $q$ . The Multi-Layer Graph is used to collect candidate experts through Random Walks (Expert Selection). Following, TUEF extracts features based on text, tags, and graph relationships for each selected experts (Feature Extraction). Finally, TUEF uses a learned, precision-oriented model to score the candidates and rank them by expected relevance (Experts Ranking).

3. Topic-oriented User-interaction model for Expert Finding (TUEF)

The TUEF framework leverages topic-oriented content similarity and user relationships to facilitate the selection and ranking of experts. This approach is based on two fundamental considerations:

•

CQA platform users utilize tags to characterize their questions and increase the likelihood of receiving relevant answers. The questions’ tags help identify topical areas within the community relevant to specific questions.
•

Unlike traditional social networks, user relationships on CQA platforms are not explicitly defined. However, implicit relationships can be discerned by analyzing users’ question-answering behaviors. The knowledge gained from these interactions can enhance answer quality and user engagement with the platform.

Figure 1 illustrates the logical organization of TUEF. To effectively integrate the valuable information derived from tags, content, and user relationships, TUEF adopts a multi-layer graph representation (Section 3.1) that encapsulates the macro topics discussed within the online community. Each layer of this graph models user relationships at the level of specific macro topics, considering their similarities based on tags, questions, and answering behaviors. TUEF identifies community experts—users known for consistently providing accepted answers—and implements an exploratory algorithm that fully exploits the multi-layer graph structure to select a group of candidate experts for each relevant macro topic associated with the current question (Section 3.2).

This exploratory algorithm integrates social and content-based information, considering experts’ centrality within the network and their expertise in the topics of interest. Central users often represent individuals who have garnered significant attention and engagement within the community. Similarly, users close to the query are more likely to connect with experts relevant to the specific topic of interest. By strategically exploring the multi-layer graph starting from these nodes, the algorithm increases the probability of selecting appropriate experts early in the process.

Finally, TUEF extracts features that represent the identified candidate experts and applies Learning-to-Rank techniques (Liu, 2009) (Section 3.3) to rank them based on their expertise and likelihood of effectively answering the question. This systematic approach ensures the selection of high-quality experts tailored to the specific information needs of the user asking the question.

3.1. User Interaction Model

In TUEF, user relationships related to each specific topic are independently modeled. From here on, let $U$ denote the set of active users on the CQA platform under consideration, and let $Q$ represent the set of past questions posted and answered by users in $U$ .

Topic Identification.

TUEF employs a tag clustering approach to identify the main topics discussed on the CQA platform based on past questions. The clustering approach, discussed in (Hoffa, 2019), utilizes the k-means algorithm (MacQueen et al., 1967) on a tag co-occurrence matrix $M$ . Tags’ co-occurrence patterns are leveraged to group semantically related tags, as tags often associated with the same questions tend to be conceptually similar (Giannakidou et al., 2008). For each question $q\in Q$ , the tags associated with $q$ are denoted as $tags(q)$ . Let $T$ be the set of all unique tags across questions in $Q$ , and let $F$ represent the set of the most frequent tags (top $\lambda$ ) that serve as clustering features. The co-occurrence matrix $M^{|T|\times|F|}$ is constructed as follows:

(1)

m_{i,j}=|\{q\in Q\;|\;\{t_{i},f_{j}\}\subseteq tags(q),\;t_{i}\in T,\;f_{j}\in F\}|

Here, $m_{i,j}$ indicates the number of questions in $Q$ where the $i_{th}$ tag and the $j_{th}$ feature co-occur. Finally, the matrix rows are normalized to express the relative frequency of tag co-occurrences with each feature. For a given tag $t_{i}$ and feature $f_{j}$ , the normalized matrix $\hat{M}$ is defined as:

(2)

\hat{m}_{i,j}=\frac{m_{i,j}}{\sum_{n=1}^{|F|}{m_{i,n}}}

Clustering the tag co-occurrence matrix to devise macro topics is based on the observation that CQA users often pair broad, common tags with more specific ones to achieve greater precision in topic identification. By considering the co-occurrences of specific tags with broader tags, we can enhance the categorization of questions. Using the normalized matrix $\hat{M}$ and a specified value of k for the k-means algorithm, the clustering process yields k disjoint clusters of tags representing primary community domains. The optimal k value is determined based on silhouette maximization criteria (Rousseeuw, 1987), as described in Section 4. This comprehensive approach to topic identification and clustering ensures that TUEF effectively captures and represents the main topical areas discussed within the online community, facilitating a good understanding of community dynamics and interactions.

Multi-Layer Graph.

TUEF meticulously models user relationships within each layer by treating users $U$ as nodes in a multi-layer graph and establishing connections based on similar patterns of providing accepted answers within specific layers. Formally, the structure of TUEF is defined using a multi-layer graph $G=[L_{1},...,L_{k}]$ , where each layer $L_{i}$ corresponds to a tag cluster. Each layer $L_{i}=(V_{i},E_{i})$ is an independent graph, with nodes $V_{i}$ representing users associated with the layer and edges $E_{i}$ denoting their relationships. To ensure consistent and accurate answers within a specific layer, TUEF includes in $V_{i}$ only those users who have provided a number of accepted answers equal to or exceeding a given $\epsilon$ percentile. Moreover, to characterize users’ knowledge in $V_{i}$ , each user $u\in V_{i}$ is assigned a topic vector $b^{i}_{u}$ . This vector represents the user’s contribution to various tags within the layer, normalized by their total contribution across all layers. Specifically, for a tag $t_{j}$ within $L_{i}$ , the $j_{th}$ position of $b^{i}_{u}$ is calculated as:

(3)

b^{i}_{u}[j]=\frac{accepted(L_{i},u,t_{j})}{\sum_{\forall z,m}accepted(L_{z},u% ,t_{m})}

Here, $accepted(L_{i},u,t_{j})$ denotes the number of accepted answers provided by user $u$ for questions labeled with tag $t_{j}$ in layer $L_{i}$ . Finally, after computing the topic vectors for all the users in $V_{i}$ , the cosine similarity is calculated between each pair of users within the layer. If the similarity between two users $u_{a}$ and $u_{b}$ exceeds a predefined threshold $\delta$ , an edge $(u_{a},u_{b})$ weighted by the cosine similarity value is added to $E_{i}$ .

It is crucial to emphasize that each question in TUEF is associated with tags assigned to specific layers within $G$ . Consequently, a question can belong to multiple layers, and users answering these questions are represented across all associated layers. This multidimensional representation of user expertise and relationships across layers enables TUEF to capture a comprehensive view of users’ knowledge and interactions within the CQA platform. This approach ensures that users’ expertise and social connections transcend individual layers, reflecting the nature of community participation and expertise within the platform.

3.2. Expert Selection

The Expert Selection component, as depicted in Algorithm 1, has the goal of selecting for each new question $q$ submitted to the CQA platform a set of users who are likely to be highly knowledgeable of the topics of $q$ , i.e., the set of candidate experts. The component operates iteratively for each layer to which question $q$ belongs. This selection process is structured into three distinct phases: (i) Sorting phase:, where the nodes within each layer are arranged in a specified order to facilitate subsequent expert selection (line 2); (ii) Collection phase:, which selects an initial set of candidate experts from the sorted node lists (line 4); (iii) Exploratory phase:, that expands the initial set by exploring the graph structure to identify additional potential experts (line 6). The remainder of this section investigates the Expert Identification process, which classifies users $u\in U$ as either experts or non-experts, followed by a detailed exploration of the three phases mentioned above.

Input : the layer

L_{i}

, the question

q

the method

m

(Network or Content),

the probability

p

Output : the set

D_{i}

of candidate experts for

q

L_{i}

1 //sort

L_{i}

nodes based on the method criteria;

sorted\_nodes\leftarrow sort\_nodes(L_{i},\>q,\>m)

;

3 //select the initial set of experts based on probability;

initial\_set\leftarrow reach\_probability(sorted\_nodes,p)

;

5 //expand the initial set by exploring the graph;

D_{i}\leftarrow explore\_graph(L_{i},\>initial\_set)

;

return $D_{i}$

Algorithm 1 Expert Selection

Expert Identification

Experts in CQA platforms can be identified heuristically by considering the number of accepted answers they provided in the past. Another signal of user trust is given by the acceptance ratio $r_{u}$ , i.e., the ratio between the number of accepted answers and the total number of answers a given user provides. As in (Dargahi Nobari et al., 2017), in TUEF, we adopt an expert selection criterion based precisely on these two measures: first, we select the set $C\subseteq U$ of candidate experts by considering all the users having a number of accepted answers greater or equal to a specified threshold $\beta$ : next, we label as experts the set $E\subseteq C$ of users whose acceptance ratio $r_{u}$ is greater than the overall average $\overline{r}$ :

(4)

E=\{u\in C\;|\;r_{u}>\overline{r}\},\quad\overline{r}=\frac{\sum_{u\in C}{r_{u% }}}{|C|}

Each layer within the multi-layer graph comprises nodes labeled as experts and non-experts. While the former are the primary targets of the Expert Selection process, the latter facilitates comprehensive exploration of the graph structure.

By implementing this process, we aim to identify users who consistently provide valuable and accepted solutions within the community, as they are likely to be proficient candidates for effectively addressing new questions posted on the platform.

Sorting

TUEF adopts a comprehensive approach to identify candidate experts for a given question $q$ by leveraging content and social information. This process involves two key perspectives: the Network-based perspective, focusing on users’ centrality within the network, and the Content-based perspective, assessing users’ relevance to the newly posted query based on their past interactions. In the Network-based approach, TUEF utilizes Betweenness centrality (Freeman, 1977), which quantifies the centrality of nodes within a graph by evaluating their influence over information flow. Nodes with higher Betweenness centrality are Considered more central within the network. Each node $v_{j}^{i}$ in layer $L_{i}$ is assigned a Betweenness score $s_{j}^{i}$ , which is used to sort nodes in descending order. On the other hand, the Content-based approach involves sorting layer nodes according to their similarity to the query $q$ based on questions previously answered by experts. This is achieved using Information Retrieval techniques and pre-built indexes, specifically the TextIndex and TagIndex, which index the text and associated tags of historical questions, respectively. When a new question is presented with its tags, the Content-based method employs a retrieval model (we specifically use BM25) independently on both indexes to retrieve a sorted list of relevant questions, along with information about the experts who provided accepted answers. The query for the TextIndex includes the concatenated question, title, and body, while for the TagIndex it consists of the concatenated question tags. Subsequently, the lists retrieved from each index are merged alternately, preserving the original order of elements.

This approach results in a query-specific ordering of nodes within each layer $L_{i}$ , effectively leveraging network centrality and content relevance to identify and rank candidate experts tailored to the context of the new query $q$ . The combination of these perspectives enhances the precision and relevance of expert selection within the TUEF framework.

Candidate Collection

In the candidate collection phase, the objective is to select a subset of experts $D_{i}\subseteq V_{i}$ from each layer $L_{i}$ . This process balances the need for a sufficiently large candidate pool to ensure high recall (the probability of obtaining the correct answer) with the desire to include relevant experts for high precision. To achieve a balance between precision and recall, we estimate the probability $p$ of not receiving an answer from any user in $D_{i}$ . Initially, $p$ is set to 1 when $D_{i}$ is empty and is incrementally reduced as experts are added to $D_{i}$ . Each expert $u$ added to $D_{i}$ contributes to lowering $p$ based on their acceptance ratio $\mu_{u}$ , defined as the ratio of accepted answers to total answers provided by the user. Furthermore, to refine the modeling of topic-based expertise, we smooth $\mu_{u}$ by considering the user’s activity within the specific layer: the smoothing factor is computed by adjusting $\mu_{u}$ based on the ratio of expert answers within the layer to the maximum number of answers provided by any user within that layer. The probability $p$ is updated iteratively using the formula:

(5)

p=p\cdot(1-\mu_{u})

This iterative process continues until $p$ becomes less than or equal to a predefined threshold $\alpha$ . Once the threshold $\alpha$ is met, the inclusion of new candidates to $D_{i}$ is halted, and the exploratory phase of the expert selection process begins. This strategic approach ensures that the candidate pool $D_{i}$ comprises experts likely to provide valuable and accurate answers to the specific question while representing good starting points for MLG exploration.

Exploratory Phase

In the Exploratory phase, TUEF explores the graph structure to identify additional candidate experts that may not have been identified with the previously described phases, specifically exploiting the implicit relationships between users, aiming to enhance recall. The starting point for this exploration is the set of experts $D_{i}$ from each layer $L_{i}$ . Specifically, for each node $v_{i}\in D_{i}$ , TUEF initiates a series of fixed-length Random Walks. At each step of the Random Walk, the next node to visit is selected randomly based on a probability distribution $d$ , computed considering the neighbors of the current node $v_{i}$ and their link weights, representing the similarities with neighboring nodes.

More formally, let $\mathcal{N}i={v{i1},\ldots,v_{iN_{i}}}$ denote the neighbors of expert node $v_{i}$ , where $N_{i}=|\mathcal{N}_{i}|$ is the total number of neighbors of $v_{i}$ . The probability $d_{j}$ of visiting a neighbor node $v_{ij}$ is calculated as:

(6)

d_{j}=\frac{w_{ij}}{\sum_{z=1}^{|N_{i}|}{w_{iz}}}

Here, $w_{ij}$ represents the weight of the edge between node $v_{i}$ and its neighbor $v_{ij}$ . Nodes with stronger connections to the current node $v_{i}$ are more likely to be selected during the Random Walks. Whenever an expert node is encountered during the exploration, it is added to the current set of candidate experts if not already included.

This exploratory process is carried out independently for each layer to which the question belongs, ensuring a comprehensive exploration of both Network-based and Content-based perspectives within each layer. Combining Random Walks with tailored probability distributions enriches the candidate pool $D_{i}$ with potentially neglected experts, enhancing the overall recall and effectiveness of the expert selection process in the TUEF framework.

3.3. Ranking candidate experts

TUEF leverages Learning to Rank (LtR) algorithms (Liu, 2009) to learn an effective ranking function from training data. LtR methods exploit a labeled dataset to learn a scoring function $\sigma$ that approximates the ideal ranking function inherent in the training examples. TUEF selects a subset of past questions, used to model user relationships as discussed in Section 3.1, to serve as the training set $T\subset Q$ for the LtR algorithm. Each query $q$ in the training set $T$ is associated with a set of candidate experts $CE\subset U$ . Moreover, for each query-candidate pair $(q,u_{i})$ , where $q\in T$ and $u_{i}\in CE$ , there exists a relevance judgment $l_{i}$ that indicates whether $u_{i}$ is an expert for query $q$ . Each query-candidate pair $(q,u_{i})$ is represented by a feature vector $x$ that encapsulates information about the query, the candidate expert, and their relationship. The LtR algorithm learns a function $\sigma(x)$ that predicts a relevance score for the input feature vector $x$ . This learned function $\sigma(x)$ is subsequently used during inference to compute scores for candidate experts and rank them accordingly.

The features used to model the query and candidate expert can be categorized into two groups: Static features and Query-dependent features. The group of Static features comprises those that remain the same for each query: the Reputation of the expert, the number of Answers and AcceptedAnswers, the Ratio (i.e., the ratio of answers to accepted answers), and AvgActivity and StdActivity, representing the average and standard deviation of time intervals between consecutive answers provided by the expert. The Query-dependent features, instead, are computed every time for each query and they include:

•

LayerCount: Number of distinct graph layers in which the expert is selected during the Expert Selection process.
•

QueryKnowledge: Ratio of answers to accepted answers provided by the expert in layers relevant to the query $q$ .
•

VisitCountContent and VisitCountNetwork: Total number of times an expert is encountered in the Collection and Exploratory phases using Content-based and Network-based techniques.
•

StepsContent and StepsNetwork: Number of steps required to discover the expert during Collection or Exploratory phases.
•

BetweennessPos and BetweennessScore: Expert’s rank in the list of users ordered by Betweenness score and the Betweenness score itself.
•

ScoreIndexTag and ScoreIndexText: Sum of BM25 scores of historical questions answered by the expert in IndexTag and IndexText, respectively.
•

FrequencyIndexTag and FrequencyIndexText: Number of distinct questions answered by the expert returned by the respective indexes.
•

Eigenvector, PageRank, Closeness, Degree, AvgWeights: Centrality measures and network features computed based on the graph layers relevant to the query.

Specific rules are applied to aggregate feature values for candidate experts selected in multiple layers. Network-based features (e.g., BetweennessScore, Closeness, PageRank) consider maximum values, while Content-based features (e.g., FrequencyIndex, ScoreIndex, QueryKnowledge) utilize summation. Additionally, certain features like StepsContent, StepsNetwork, and BetweennessPos use minimum values to characterize the candidate experts effectively within the ranking framework. These features collectively contribute to the comprehensive ranking process in TUEF, accommodating both static and query-specific attributes of candidate experts for optimal ranking performance.

4. Experimental Setup

The following section presents the experimental setup and details the datasets used, the settings of the TUEF hyperparameters, and the metrics used to assess performance.

4.1. Datasets

We selected six real-world datasets from the most extensive communities on the StackExchange¹¹1https://stackexchange.com/ platform, namely StackOverflow, Unix, AskUbuntu, ServerFault, Physics, and Mathematics. These datasets are publicly accessible²²2https://archive.org/details/stackexchange and encompass all the questions and answers posted by StackExchange users within their respective communities. Each question has a title, a body, and a list of tags. Moreover, the dataset for each community includes all the posted answers for each question. Besides question and answer content, we know the positive or negative votes received by a question or an answer, the number of views, the number of users that selected a given question as a favorite one, and the comments that other users might have written under a question or an answer. Questions can be categorized as closed, indicating the presence of an accepted answer, or open, which may have no answers. Information about the user who posted a question or an answer is available. Since an answer can be labeled as accepted only by the user who posted the originating question, the authors of accepted answers provide the human-assessed ground truth for the EF task. Additionally, each participating user in the community has associated additional information, such as a Reputation score, which is also useful for the EF task.

4.2. Parameters setting

We pre-processed the data to ensure a good set of questions and answers. Moreover, the different steps of TUEF require setting some parameters detailed below.

Pre-processing.

We pre-processed the data to ensure a good set of questions and answers. Moreover, the different steps of TUEF require setting some parameters detailed below. We applied established data cleaning procedures, as outlined in (Mumtaz et al., 2019), to ensure the quality of our dataset. We opted for an eight-year timespan for most communities, excluding StackOverflow and Mathematics. The decision to use a shorter timeframe for these two communities arises from their high daily question volumes, ensuring a more balanced representation across all communities. This strategy aims to maintain a roughly equivalent dataset size for each community. We removed questions and answers without a specified question’s ID and OwnerUserID (i.e., the ID of the user who posted the question). Subsequently, we retained questions with an AcceptedAnswerID (i.e., closed questions) and answers with a valid ParentID (i.e., the respective question ID). Questions where the asker and the best answerer were the same user were also excluded. We split the dataset, allocating 80% for training/validation and the remaining 20% for the test set. Notably, we maintained the chronological order of the questions to preserve temporal integrity. Comprehensive statistics for the resulting datasets are provided in Table 1.

User Interaction Model.

To identify the primary topics discussed in the community, we employed the clustering technique outlined in Section 3.1, focusing on the tags associated with questions in the training set. Considering the top $\lambda=10$ most frequent tags as features, we determined the optimal number of clusters within the range $K=[2,10]$ that maximizes the Silhouette score. Each cluster represents a specific macro-area, corresponding to a layer in the MLG, which captures interactions among users who have provided a number of accepted answers equal to or exceeding the $\epsilon=90th$ percentile. Users are represented by their topic vectors used to compute the pair-wise cosine similarities (see Section 3.1). When modeling relationships, we retained edges with a similarity equal to or greater than $\delta=0.5$ .

Expert Selection.

We identify as experts the users with a number of accepted answers greater than or equal to the $\omega=95th$ percentile of the distribution, following the procedure detailed in Section 3.2. The minimum number of accepted answers and the total number of users labeled as experts for each dataset are outlined in Table 1 under the columns MinAccAns and Experts, respectively. TagIndex and TextIndex return the 1,000 most similar past questions for each query in the expert selection process, respectively. For both Network and Content methods, after node sorting, the collection phase identifies the initial set of experts $D$ that reaches a probability threshold of $p=0.001$ , representing the probability of not receiving an answer. Subsequently, starting for each selected expert in $D$ , we conduct 5 Random Walks of, at most, 10 steps.

Ranking.

We utilize the LightGBM³³3https://lightgbm.readthedocs.io/en/stable/ (Ke et al., 2017) implementation of LambdaMART (Burges, 2010) to learn the TUEF ranking model. The LtR training set is constructed using queries from the training dataset where the accepted answerer is labeled as an expert, including $0<\zeta\leq 50,000$ queries ordered from the most recent to the oldest. It is important to note that this is a TUEF parameter: if the dataset contains more than 50,000 queries answered by experts, TUEF will take the latest 50,000; otherwise, it will consider only the available queries. After performing the MLG exploration for each query of the LtR training set, we excluded all the queries for which TUEF couldn’t include in the candidate set the expert who provided the accepted answer. For the remaining queries, we extracted features for each query-candidate expert pair, as outlined in Section 3.3. The training set was split into training and validation sets following an 80/20 split. Hyper-parameter tuning is carried out using MRR on the validation set, leveraging the HyperOpt library (Bergstra et al., 2013), and optimizing four learning parameters: $learning\_rate\in[0.0001,0.15]$ , $num\_leaves\in[50,200]$ , $n\_estimators\in[50,150]$ , $max\_depth\in[8,15]$ , and $min\_data\_in\_leaf\in[150,500]$ .

Table 1 shows the number of queries used for the LtR and the average length of expert lists to rank under the columns LtR and AvgList, respectively.

Table 1. Statistics of the StackExchange communities used for the experiments.

Time Period Tags Clusters Silhouette Train Test MinAccAns Experts LtR AvgList StackOverflow 2020-12-01 2021-01-01 5365 10 0.805 39581 9521 11 350 8008 111 Unix 2015-01-01 2023-09-04 1876 6 0.463 51995 12985 21 183 22098 118 AskUbuntu 2015-01-01 2023-09-04 2021 9 0.391 43469 10166 9 265 16918 129 Server Fault 2015-01-01 2023-09-04 2137 10 0.419 30797 7398 15 180 10041 102 Physics 2015-01-01 2023-09-04 811 8 0.406 53412 13860 31 146 19192 102 Mathematics 2022-01-01 2023-09-04 1351 9 0.499 46190 11649 41 141 15128 85

4.3. Evaluation metrics

We use Precision@1 (P@1), Normalized Discounted Cumulative Gain @3 (NDCG@3), Mean Reciprocal Rank (MRR), and Recall@5 (R@5) as our evaluation metrics. The cutoffs considered are short, as finding the relevant results at the top of the ranked lists for the EF task is essential. P@1 assumes a pivotal role in the EF task, emphasizing the primary objective of identifying the most suitable answerer for a given query. NDCG@3 and MRR metrics serve as valuable tools when models exhibit identical P@1 scores, offering a detailed comparison by considering the actual position of the best answerer in the list. R@5, in turn, measures the model’s efficacy in locating the best answerer within the top five positions. Performance metrics and statistical significance tests are computed using the RanX Library (Bassani, 2022).

Reproducibility

TUEF is prototyped in Python 3.8.17. All experiments are conducted on an Intel(R) Xeon(R) Platinum 8164 CPU 2.00GHz processor with 503GB RAM on Linux 5.4.0-153-generic. The source code of TUEF is publicly available⁴⁴4https://github.com/maddalena-amendola/TUEF.

5. Experimental analysis

The following Section discusses two blocks of experiments. The first block (Section 5.1) regards an ablation study of TUEF across all the communities to understand the contribution to the overall performance of TUEF’s different components (detailed in Section 3). To make EF decisions transparent and interpretable, the experiments also include assessing the impact of the integration in TUEF of the interpretable LtR algorithm (IlMart (Lucchese et al., 2022)) along with examples of TUEF’s decision-making process. In the second block of experiments (Section 5.2), we conduct a comprehensive evaluation of TUEF aimed at assessing its effectiveness in two scenarios: end-to-end Expert Ranking scenario, and an offline Expert Subsampling Ranking scenario, where we conduct experiments with sampled metrics (Krichene and Rendle, 2020b), ranking only a small set of candidates for each query. In both categories, we compare TUEF with state-of-the-art competitors for which the implementation is publicly available. Finally, in Section 5.3, we study the ability of TUEF to scale with larger datasets by considering four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data.

5.1. Ablation study

We compare TUEF in an end-to-end scenario with variants of the proposed solution, each exploiting only a subset of the TUEF components. By examining these different configurations, we can assess and quantify the impact of each component and gain insights into the effectiveness of combining social and content information in our approach for addressing the EF task for CQA platforms:

•

BC: It uses the MLG and sorts the experts in the layers related to the new question based on their Betweenness centrality score.
•

BM25: It uses the MLG and sorts the experts in the question’s layers based on the BM25 score computed between the query and their previously answered questions. As in TUEF, the ranked lists of candidate experts retrieved for the tag and content indexes are merged by alternating their elements.
•

TUEF _NB: It only uses the MLG and applies the Network-based method. Candidates are ranked using a LtR model exploiting the static and the following query-dependent features: LayerCount, QueryKnowledge, VisitCountNetwork, StepsNetwork, BetweennesPos, BetweennessScore, Eigenvector, PageRank, Closeness, Degree, and AvgWeights.
•

TUEF _CB: It only uses the MLG and applies the Content-based method. Candidates are ranked using a LtR model exploiting the static and the following query-dependent features: LayerCount, QueryKnowledge, VisitCountContent, StepsContent, ScoreIndexTag, ScoreIndexText, FrequencyIndexTag, FrequencyIndexTex, Degree, and AvgWeights.
•

TUEF _SL: In contrast to TUEF, it represents users’ relationships in a graph with a Single Layer. All other phases are unchanged.
•

TUEF _NoRW: In contrast to TUEF, it skips the Exploratory phase and does not perform Random Walks to extend the set of candidate experts selected considering the probability of receiving an answer.
•

TUEF _Lin: It ranks the experts according to the solution of (Roy et al., 2018), which uses a linear combination of features modeling experts, questions, and users’ expertise.
•

TUEF _IlMart: It uses IlMart, the interpretable version of LambdaMart algorithm proposed in (Lucchese et al., 2022) to learn the expert ranking model. IlMart is a learning algorithm that generates explainable LTR models based on a version of LambdaMART constrained to use only univariate and bivariate functions at training time. The IlMart learning strategy is made of three steps: i) Main Effects Learning, which learns with the LambdaMART boosting strategy an ensemble of trees $\tau(x_{j})$ in which the only feature allowed for each split is $x_{j}$ ; ii) Interaction Effects Selection, which selects the top-K most important feature pairs $(x_{i},x_{j})$ ; iii) Interaction Effects Learning, which learns with LambdaMART an ensemble of trees $\tau_{ij}(x_{i},x_{j})$ , where the only features allowed in a split of each tree $t\in\tau_{ij}(x_{i},x_{j})$ are either $x_{i}$ or $x_{j}$ , i.e., one of the features pairs selected by the previous step. The resulting ranking models can achieve ranking performance close to unconstrained LambdaMART models without all their complexity, thus trading off between effectiveness and explainability.

Table 2. Ablation Study Results. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines representing its different components. Scores marked with the

\blacktriangledown

symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the

\blacktriangle

symbol indicates that the corresponding baseline statistically outperforms TUEF. Underlined values indicate that the baseline has a higher score than TUEF, but the difference is not statistically significant.

StackOverflow Unix AskUbuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR $BC$ 0.009 $\blacktriangledown$ 0.017 $\blacktriangledown$ 0.045 $\blacktriangledown$ 0.030 $\blacktriangledown$ 0.046 $\blacktriangledown$ 0.087 $\blacktriangledown$ 0.165 $\blacktriangledown$ 0.122 $\blacktriangledown$ 0.006 $\blacktriangledown$ 0.010 $\blacktriangledown$ 0.051 $\blacktriangledown$ 0.044 $\blacktriangledown$ $BM25$ 0.209 $\blacktriangledown$ 0.259 $\blacktriangledown$ 0.325 $\blacktriangledown$ 0.259 $\blacktriangledown$ 0.116 $\blacktriangledown$ 0.217 $\blacktriangledown$ 0.396 $\blacktriangledown$ 0.254 $\blacktriangledown$ 0.098 $\blacktriangledown$ 0.196 $\blacktriangledown$ 0.352 $\blacktriangledown$ 0.225 $\blacktriangledown$ $TUEF_{NB}$ 0.062 $\blacktriangledown$ 0.127 $\blacktriangledown$ 0.236 $\blacktriangledown$ 0.140 $\blacktriangledown$ 0.27 $\blacktriangledown$ 0.353 $\blacktriangledown$ 0.494 $\blacktriangledown$ 0.376 $\blacktriangledown$ 0.126 $\blacktriangledown$ 0.182 $\blacktriangledown$ 0.299 $\blacktriangledown$ 0.208 $\blacktriangledown$ $TUEF_{Lin}$ 0.263 $\blacktriangledown$ 0.383 $\blacktriangledown$ 0.569 $\blacktriangledown$ 0.406 $\blacktriangledown$ 0.297 $\blacktriangledown$ 0.382 $\blacktriangledown$ 0.542 $\blacktriangledown$ 0.41 $\blacktriangledown$ 0.184 $\blacktriangledown$ 0.285 $\blacktriangledown$ 0.440 $\blacktriangledown$ 0.307 $\blacktriangledown$ $TUEF_{SL}$ 0.431 0.565 $\blacktriangledown$ 0.750 0.564 $\blacktriangledown$ 0.336 0.430 0.585 0.452 0.255 $\blacktriangle$ 0.363 $\blacktriangle$ 0.513 $\blacktriangle$ 0.375 $\blacktriangle$ $TUEF_{CB}$ 0.455 0.589 0.761 0.587 0.332 0.424 0.573 $\blacktriangledown$ 0.444 $\blacktriangledown$ 0.244 0.350 0.502 0.364 $TUEF_{NoRW}$ 0.446 0.582 0.744 $\blacktriangledown$ 0.573 $\blacktriangledown$ 0.334 0.425 0.575 0.436 $\blacktriangledown$ 0.249 0.342 0.472 $\blacktriangledown$ 0.345 $\blacktriangledown$ $TUEF_{IlMart}$ 0.360 $\blacktriangledown$ 0.496 $\blacktriangledown$ 0.676 $\blacktriangledown$ 0.502 $\blacktriangledown$ 0.324 0.416 $\blacktriangledown$ 0.565 $\blacktriangledown$ 0.438 $\blacktriangledown$ 0.203 $\blacktriangledown$ 0.305 $\blacktriangledown$ 0.468 $\blacktriangledown$ 0.327 $\blacktriangledown$ TUEF 0.459 0.592 0.760 0.590 0.331 0.425 0.581 0.448 0.241 0.345 0.500 0.363 Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR $BC$ 0.010 $\blacktriangledown$ 0.040 $\blacktriangledown$ 0.099 $\blacktriangledown$ 0.071 $\blacktriangledown$ 0.014 $\blacktriangledown$ 0.027 $\blacktriangledown$ 0.093 $\blacktriangledown$ 0.059 $\blacktriangledown$ 0.002 $\blacktriangledown$ 0.040 $\blacktriangledown$ 0.136 $\blacktriangledown$ 0.059 $\blacktriangledown$ $BM25$ 0.094 $\blacktriangledown$ 0.152 $\blacktriangledown$ 0.251 $\blacktriangledown$ 0.167 $\blacktriangledown$ 0.103 $\blacktriangledown$ 0.189 $\blacktriangledown$ 0.334 $\blacktriangledown$ 0.226 $\blacktriangledown$ 0.111 $\blacktriangledown$ 0.199 $\blacktriangledown$ 0.361 $\blacktriangledown$ 0.236 $\blacktriangledown$ $TUEF_{NB}$ 0.21 $\blacktriangledown$ 0.253 $\blacktriangledown$ 0.364 $\blacktriangledown$ 0.285 $\blacktriangledown$ 0.101 $\blacktriangledown$ 0.162 $\blacktriangledown$ 0.277 $\blacktriangledown$ 0.196 $\blacktriangledown$ 0.084 $\blacktriangledown$ 0.145 $\blacktriangledown$ 0.278 $\blacktriangledown$ 0.178 $\blacktriangledown$ $TUEF_{LIN}$ 0.260 0.368 0.524 $\blacktriangledown$ 0.388 0.139 $\blacktriangledown$ 0.217 $\blacktriangledown$ 0.357 $\blacktriangledown$ 0.253 $\blacktriangledown$ 0.177 $\blacktriangledown$ 0.27 $\blacktriangledown$ 0.427 $\blacktriangledown$ 0.301 $\blacktriangledown$ $TUEF_{SL}$ 0.279 0.389 0.547 0.406 0.220 0.299 0.440 0.331 0.208 0.315 0.471 0.335 $TUEF_{CB}$ 0.267 0.377 0.538 0.395 0.215 0.294 0.429 0.325 $\blacktriangledown$ 0.219 $\blacktriangle$ 0.327 $\blacktriangle$ 0.482 0.345 $TUEF_{NORW}$ 0.272 0.378 0.527 $\blacktriangledown$ 0.382 $\blacktriangledown$ 0.218 0.295 0.429 0.313 $\blacktriangledown$ 0.210 0.313 0.469 0.321 $\blacktriangledown$ $TUEF_{IlMart}$ 0.250 0.364 $\blacktriangledown$ 0.529 $\blacktriangledown$ 0.384 $\blacktriangledown$ 0.195 $\blacktriangledown$ 0.276 $\blacktriangledown$ 0.418 $\blacktriangledown$ 0.309 $\blacktriangledown$ 0.197 $\blacktriangledown$ 0.301 $\blacktriangledown$ 0.475 0.329 $\blacktriangledown$ TUEF 0.266 0.382 0.547 0.400 0.221 0.300 0.440 0.332 0.207 0.316 0.478 0.340

Discussion

Table 2 reports the results of the ablation study. We mark statistically significant performance gains/losses with respect to TUEF with the symbols $\blacktriangledown$ and $\blacktriangle$ (paired t-test with p-value¡0.05 and Bonferroni correction), and we underline the performance figures numerically greater than those of TUEF.

TUEF consistently outperforms in all examined communities some baseline models, including $BC$ , $BM25$ , $TUEF_{NB}$ , and $TUEF_{Lin}$ . A recurring pattern emerges across communities, revealing $BC$ and $TUEF_{NB}$ as the least effective baselines, both relying exclusively on network aspects. In contrast, $BM25$ ’s performance highlights the significance of the content-based component, as evidenced by a minimum of 10% of queries across communities being best addressed by experts who have previously answered similar questions. $TUEF_{Lin}$ consistently exhibits lower performance compared to TUEF, underscoring the efficacy of TUEF’s LtR methodology in capturing hidden relationships among expert features.

$TUEF_{SL}$ shows slightly higher or equal performance across communities except for StackOverflow and AskUbuntu. In the case of StackOverflow, TUEF outperforms a statistically significant margin $TUEF_{SL}$ , while on AskUbuntu, the opposite holds, hinting at community-specific influences. This observation aligns with the Silhouette values described in Table 1, which quantify the clustering effectiveness of layers within the MLG. StackOverflow’s superior Silhouette value of 0.805 indicates a robust subdivision of tags into clusters, benefiting TUEF’s exploitation of topic layers. Conversely, AskUbuntu, with the lowest Silhouette value of 0.391, experiences challenges, contributing to $TUEF_{SL}$ ’s higher statistical performance. Consequently, the MLG proves beneficial when distinct community topics are evident. Moreover, the values of $TUEF_{CB}$ underscore content supremacy over the network component, although incorporating both components in TUEF marginally enhances system performances for most communities. Finally, $TUEF_{NoRW}$ exhibits lower recall (R@5) without a decrease in precision (P@1). The similar P@1 scores between TUEF and $TUEF_{NoRW}$ result from TUEF’s random walk mechanism, which identifies a larger set of relevant experts. TUEF significantly improves recall (R@5) by consistently including the correct expert within the top five positions. However, this larger candidate set makes it more challenging to rank the best expert in the first position, thus keeping P@1 similar for both models. Overall, TUEF’s random walk enhances recall by expanding the candidate set, even though it complicates achieving the highest precision score.

Interpretability

$TUEF_{IlMart}$ consistently exhibits statistically lower performance than TUEF, with the most notable disparity occurring in the StackOverflow community, where the difference reaches 27.5% for P@1 and 12.43% for R@5. In other communities, the variations are still statistically significant, ranging from a maximum of 18.71% (AskUbuntu) to a minimum of 2.16% (Unix) in terms of P@1. However, the differences diminish significantly when considering R@5, with the highest value being 6.83% (AskUbuntu) and no statistical differences in Mathematics, indicating $TUEF_{IlMart}$ proficiency in placing the top respondent within the first five positions.

$TUEF_{IlMart}$ capacity to offer model result explanations may offset the observed performance decline. TUEF, on the other hand, employs a more sophisticated decision tree-based LtR approach, which contributes to higher performance but offers only minimal clarity regarding the significance of different factors influencing the ranking result. Figure 2 illustrates the feature importance of both models (LambdaMart and $TUEF_{IlMart}$ ) for the StackOverflow community. It reports LambdaMart’s top height features and all those considered by $TUEF_{IlMart}$ , all of which are present among LambdaMart’s top eight features, albeit in a different order. For example, the ScoreIndexTag feature, most used by LambdaMart, is the least used by $TUEF_{IlMart}$ . Most of the crucial features for LambdaMart are associated with the content-based method. Nevertheless, features from the network-based method, such as Closeness, PageRank, and AvgWeight, exhibit high importance values. On the other hand, all the features selected by $TUEF_{IlMart}$ are content-based, highlighting the importance of the text component for the EF task.

Figure 3 illustrates how $TUEF_{IlMart}$ can be used for interpretability purposes, showcasing the contribution to the predicted score of the three IlMart main effects with the highest feature importance values (i.e., FreqIndexTag, VisitCountContent, and FreqIndexText) and the most important interaction effect learned from $TUEF_{IlMart}$ for the StackOverflow community. High values of these features indicate proficient expert competence. The 2D plots and the heatmap are built by aggregating all the trees using the same features, i.e., by representing the contribution of each main effect $\tau(x_{j})$ and each interaction effect $\tau(x_{i},x_{j})$ , where $\tau(x_{j})$ and $\tau_{ij}(x_{i},x_{j})$ are ensembles of trees. Through Figure 3, we can analyze instances where TUEF successfully places the best answerer in the first position and cases where it fails. Table 3 presents four question examples: the first two showcase instances where TUEF successfully identifies the best expert by placing them in the first position. For a better understanding, we report the features of the top two ranked experts. Conversely, for the last two examples where TUEF fails, we show the feature values of the experts in ranking order up to the position where the ranker placed the best expert. The Rank column indicates the expert’s ranking position in the final list. In contrast, the BestExpert column serves as a label, with a value equal to 1 for the ground truth expert and 0 otherwise.

A notable pattern emerges in both instances of TUEF’s success and failure: the expert occupying the first position consistently demonstrates higher values for the features evaluated by the $TUEF_{IlMart}$ ranker. This results in a greater contribution and, consequently, a higher final score, as shown in Figure 3. For example, consider the first instance in Table 3, with $QID=65438592$ , and focus on the feature FreqIndexTag. The BestExpert (first row of Table 3) has a value of 45, corresponding to a contribution of approximately 1, as shown in the upper left figure. In contrast, the expert placed in the second position (second row of Table 3) has a value of 7, contributing to a measure of less than 0.5 (half of the first). By combining all feature contributions computed with the aid of Figure 3, we can comprehend the final ranking and understand why the model could or could not place the BestExpert in the first position.

Table 3. Examples of TUEF _IlMart Decision-Making Process. The table presents four examples of questions: two where TUEF _IlMart successfully identified the right expert (i.e., placing the right expert in the first position of the final ranking) and two where it was unsuccessful. For each question, the table reports the values of FreqIndexTag, VisitCountContent, and FreqIndexText, along with the final expert Score, the position in the final ranking, and the BestExpert label.

QID	FreqIndexTag	VisitCountContent	FreqIndexText	Score	Rank	BestExpert
65438592	45	17	54	2.39	1	1
65438592	7	7	7	1.96	2	0
65438859	2	10	13	-0.12	1	1
65438859	1	1	9	-1.7	2	0
65438631	81	62	82	3.29	1	0
65438631	30	25	30	2.83	2	1
65439018	229	14	202	3.5	1	0
	14	5	12	2.58	2	0
	11	9	16	2.45	3	1

5.2. State-of-the-art Baselines

In this section, we evaluate TUEF based on the two approaches for the EF task discussed in Section 2. As done in Section 5.1, to assess the statistical significance of the performance gains/losses ( $\blacktriangle$ and $\blacktriangledown$ ) measured with our experiments, we conducted a paired t-test with a p-value threshold of 0.05 and Bonferroni correction. Moreover, we underline the performance numerically (not statistically) greater than those of TUEF.

Expert Ranking.

Table 4. Expert Ranking Comparison. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines in the Expert Ranking category for all selected StackExchange communities. Scores marked with the

\blacktriangledown

symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the

\blacktriangle

Stack Overflow Unix Ask Ubuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BM25 0.257 $\blacktriangledown$ 0.366 $\blacktriangledown$ 0.513 $\blacktriangledown$ 0.381 $\blacktriangledown$ 0.253 $\blacktriangledown$ 0.348 $\blacktriangledown$ 0.492 $\blacktriangledown$ 0.365 $\blacktriangledown$ 0.141 $\blacktriangledown$ 0.237 $\blacktriangledown$ 0.389 $\blacktriangledown$ 0.261 $\blacktriangledown$ BM25+TAG 0.307 $\blacktriangledown$ 0.431 $\blacktriangledown$ 0.604 $\blacktriangledown$ 0.441 $\blacktriangledown$ 0.260 $\blacktriangledown$ 0.345 $\blacktriangledown$ 0.494 $\blacktriangledown$ 0.368 $\blacktriangledown$ 0.151 $\blacktriangledown$ 0.256 $\blacktriangledown$ 0.415 $\blacktriangledown$ 0.277 $\blacktriangledown$ MiniLM 0.282 $\blacktriangledown$ 0.384 $\blacktriangledown$ 0.527 $\blacktriangledown$ 0.399 $\blacktriangledown$ 0.262 $\blacktriangledown$ 0.352 $\blacktriangledown$ 0.485 $\blacktriangledown$ 0.367 $\blacktriangledown$ 0.158 $\blacktriangledown$ 0.259 $\blacktriangledown$ 0.412 $\blacktriangledown$ 0.279 $\blacktriangledown$ MiniLM+TAG 0.320 $\blacktriangledown$ 0.441 $\blacktriangledown$ 0.624 $\blacktriangledown$ 0.456 $\blacktriangledown$ 0.269 $\blacktriangledown$ 0.358 $\blacktriangledown$ 0.492 $\blacktriangledown$ 0.371 $\blacktriangledown$ 0.172 $\blacktriangledown$ 0.273 $\blacktriangledown$ 0.419 $\blacktriangledown$ 0.292 $\blacktriangledown$ DistilBERT 0.288 $\blacktriangledown$ 0.394 $\blacktriangledown$ 0.542 $\blacktriangledown$ 0.407 $\blacktriangledown$ 0.268 $\blacktriangledown$ 0.356 $\blacktriangledown$ 0.492 $\blacktriangledown$ 0.373 $\blacktriangledown$ 0.161 $\blacktriangledown$ 0.264 $\blacktriangledown$ 0.415 $\blacktriangledown$ 0.283 $\blacktriangledown$ DistilBERT+TAG 0.337 $\blacktriangledown$ 0.455 $\blacktriangledown$ 0.627 $\blacktriangledown$ 0.468 $\blacktriangledown$ 0.276 $\blacktriangledown$ 0.363 $\blacktriangledown$ 0.492 $\blacktriangledown$ 0.376 $\blacktriangledown$ 0.178 $\blacktriangledown$ 0.282 $\blacktriangledown$ 0.438 $\blacktriangledown$ 0.302 $\blacktriangledown$ t-BGER 0.254 $\blacktriangledown$ 0.338 $\blacktriangledown$ 0.478 $\blacktriangledown$ 0.36 $\blacktriangledown$ 0.25 $\blacktriangledown$ 0.314 $\blacktriangledown$ 0.43 $\blacktriangledown$ 0.345 $\blacktriangledown$ 0.202 $\blacktriangledown$ 0.312 $\blacktriangledown$ 0.508 0.335 $\blacktriangledown$ TUEF 0.459 0.592 0.760 0.590 0.331 0.425 0.581 0.448 0.241 0.345 0.500 0.363 Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR BM25 0.199 $\blacktriangledown$ 0.286 $\blacktriangledown$ 0.430 $\blacktriangledown$ 0.311 $\blacktriangledown$ 0.164 $\blacktriangledown$ 0.234 $\blacktriangledown$ 0.364 $\blacktriangledown$ 0.267 $\blacktriangledown$ 0.149 $\blacktriangledown$ 0.235 $\blacktriangledown$ 0.378 $\blacktriangledown$ 0.265 $\blacktriangledown$ BM25+TAG 0.218 $\blacktriangledown$ 0.309 $\blacktriangledown$ 0.475 $\blacktriangledown$ 0.331 $\blacktriangledown$ 0.166 $\blacktriangledown$ 0.239 $\blacktriangledown$ 0.374 $\blacktriangledown$ 0.271 $\blacktriangledown$ 0.149 $\blacktriangledown$ 0.232 $\blacktriangledown$ 0.377 $\blacktriangledown$ 0.258 $\blacktriangledown$ MiniLM 0.200 $\blacktriangledown$ 0.293 $\blacktriangledown$ 0.436 $\blacktriangledown$ 0.313 $\blacktriangledown$ 0.172 $\blacktriangledown$ 0.248 $\blacktriangledown$ 0.382 $\blacktriangledown$ 0.278 $\blacktriangledown$ 0.159 $\blacktriangledown$ 0.248 $\blacktriangledown$ 0.403 $\blacktriangledown$ 0.277 $\blacktriangledown$ MiniLM+TAG 0.224 $\blacktriangledown$ 0.315 $\blacktriangledown$ 0.476 $\blacktriangledown$ 0.337 $\blacktriangledown$ 0.172 $\blacktriangledown$ 0.25 $\blacktriangledown$ 0.386 $\blacktriangledown$ 0.279 $\blacktriangledown$ 0.161 $\blacktriangledown$ 0.248 $\blacktriangledown$ 0.405 $\blacktriangledown$ 0.277 $\blacktriangledown$ DistilBERT 0.215 $\blacktriangledown$ 0.302 $\blacktriangledown$ 0.449 $\blacktriangledown$ 0.325 $\blacktriangledown$ 0.178 $\blacktriangledown$ 0.256 $\blacktriangledown$ 0.386 $\blacktriangledown$ 0.284 $\blacktriangledown$ 0.170 $\blacktriangledown$ 0.263 $\blacktriangledown$ 0.411 $\blacktriangledown$ 0.291 $\blacktriangledown$ DistilBERT+TAG 0.223 $\blacktriangledown$ 0.319 $\blacktriangledown$ 0.478 $\blacktriangledown$ 0.341 $\blacktriangledown$ 0.179 $\blacktriangledown$ 0.254 $\blacktriangledown$ 0.384 $\blacktriangledown$ 0.283 $\blacktriangledown$ 0.17 $\blacktriangledown$ 0.262 $\blacktriangledown$ 0.414 $\blacktriangledown$ 0.291 $\blacktriangledown$ t-BGER 0.275 $\blacktriangle$ 0.356 $\blacktriangledown$ 0.474 $\blacktriangledown$ 0.375 $\blacktriangledown$ 0.080 $\blacktriangledown$ 0.151 $\blacktriangledown$ 0.258 $\blacktriangledown$ 0.180 $\blacktriangledown$ 0.122 $\blacktriangledown$ 0.169 $\blacktriangledown$ 0.295 $\blacktriangledown$ 0.215 $\blacktriangledown$ TUEF 0.266 0.382 0.547 0.400 0.221 0.300 0.440 0.332 0.207 0.316 0.478 0.340

The first set of baselines follows the Expert Ranking configuration, which involves an end-to-end process of selecting users as experts, computing the similarity between a candidate expert and the query, and subsequently ranking the candidate experts. TUEF adopts this configuration, and we compare it with the models introduced in (Kasela et al., 2023), enabling us to evaluate the effectiveness of current neural re-rankers on this task, as well as the t-BGER model (Krishna and Antulov-Fantulin, 2023).

Kasela et al. (Kasela et al., 2023) employ a two-stage ranking architecture, where the initial stage uses a recall-oriented retriever, and the subsequent stage utilizes a precision-oriented approach. In the first stage, they use Elastic Search’s BM25 model to generate an ordered and limited list of candidate experts. In the second stage, they apply a linear combination of scores to re-rank these experts. This combination includes (i) the BM25 score, (ii) the similarity score computed by neural network-based re-ranker approaches, and for three of the six proposed baselines, (iii) the score generated by a personalized model based on user history. The neural network-based re-ranker models are $DistilBERT$ ⁵⁵5https://huggingface.co/distilbert-base-uncased and $MiniLM$ ⁶⁶6https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. Both pre-trained models are fine-tuned for ten epochs for each community considered. The personalized model, called TAG, captures the similarity between the topics previously addressed by the asker and the answerer based on the tags in the questions asked and the answers provided. Finally, the scores are combined and computed as the weighted sum of the normalized model scores. The models’ weights are determined using a grid search on the validation set. The proposed baselines are BM25, BM25+TAG, MiniLM, MiniLM+TAG, DistilBERT, and DistilBERT+TAG. Although not explicitly stated for readability, the last four baselines also incorporate BM25.

In contrast, Krishna et al. (Krishna and Antulov-Fantulin, 2023) introduce t-BGER, a temporal weighted bipartite graph model. Their study emphasizes the importance of utilizing tags for expert identification while reducing task complexity. The model is based on a user-tag bipartite graph, where the edge weight reflects the user’s activity related to the specific tag. Additionally, they incorporate temporal discounting into this graph, assigning higher weights to recent activities. Finally, they apply resource allocation techniques through network-based inference on the bipartite graph, effectively managing highly sparse data.

As reported in Table 4, among the baselines, DistilBERT+TAG shows superior performance. However, t-BGER demonstrates better scores for the AskUbuntu and Server Fault communities. Overall, TUEF surpasses all baselines across various metrics, except for P@1 in the Server Fault community, where t-BGER exhibits a relative improvement of 3.38%, and R@5 in the AskUbuntu community, where t-BGER has a slightly higher score. Excluding these two exceptions, TUEF records minimum relative improvements of 19.31% for P@1 in the AskUbuntu community, 7.30% for NDCG@3, 15.40% for R@5, and 6.67% for MRR in the Server Fault community. In the remaining communities, TUEF reports higher relative improvements across all other metrics.

Expert Subsample Ranking.

The second set of baselines follows the Expert Ranking evaluation approach and aims to rank a small set of 20 experts when presented with a new query (see Section 2). This set always includes (i) the users who have provided the answers, including the best answerer, and (ii) a variable number of users selected to reach a total of 20 candidate experts, chosen from the top 10% of users identified as the most active answerers. Since this definition of the EF task deviates from the task previously considered by requiring the prior knowledge of who will answer the query, we adapted TUEF to allow a fair comparison with the following two state-of-the-art models for the EF task, whose code is publicly available:

•

NeRank (Li et al., 2019): It models the CQA platform as a heterogeneous network to learn representations for question raisers and question answerers through a metapath-based algorithm. Using this heterogeneous network, NeRank preserves relationship information while modeling the question’s content with a single-layer LSTM. Finally, a CNN assigns a score to each expert for a given question, representing the probability of the expert providing the accepted answer.
•

PMEF (Peng et al., 2022a): It consists of three main modules: a multi-view question encoder for learning comprehensive question features, an intra-view encoder to discover view-specific interactions among experts and target questions, and an inter-view encoder designed to extract expert/question features by integrating different view information in a personalized manner.

We always consider the questions selected for the experimentation phase for each community. However, the test set is smaller due to constraints imposed by NeRank, PMEF, and, in this specific setup, TUEF as well. Specifically, both NeRank and PMEF models require that the user posing the question has previously asked other questions to capture the interactions between the questioner and responder. Additionally, the test set includes only those questions for which TUEF, through graph exploration, included the best answerer in the set of potential experts. Moreover, to ensure a fair comparison among the three models, we required each to rank the same set of experts. This set comprises users who have answered the question, including the best answerer, and additional experts randomly chosen from those identified by TUEF as candidates. This procedure is the fairest, considering the way TUEF selects candidate experts. Specifically, as a topic-based model, TUEF chooses hard-negative samples because they are always users who have previously answered precisely the topics related to the query. In contrast, following the procedure of NeRank and PMEF and choosing from the top 10% of the community’s best answerers would lead to selecting users who could be irrelevant to the query topics, thus making it easier to rank them lower. We consider in the comparison also TUEF and $TUEF_{IlMart}$ to study the behavior of the interpretable version of TUEF under this different configuration.

Table 5. Expert Subsample Ranking Comparison. The table reports the P@1, NDCG@3, R@5, and MRR scores of TUEF and the baselines in the Expert Ranking category for all selected StackExchange communities. Scores marked with the

\blacktriangledown

symbol indicate that TUEF statistically outperforms the corresponding baseline according to the paired t-test with Bonferroni correction and p-value¡0.05. Conversely, the

\blacktriangle

StackOverflow Unix Ask Ubuntu P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR Nerank 0.313 $\blacktriangledown$ 0.460 $\blacktriangledown$ 0.718 $\blacktriangledown$ 0.491 $\blacktriangledown$ 0.385 $\blacktriangledown$ 0.490 $\blacktriangledown$ 0.646 $\blacktriangledown$ 0.514 $\blacktriangledown$ 0.217 $\blacktriangledown$ 0.364 $\blacktriangledown$ 0.651 $\blacktriangledown$ 0.406 $\blacktriangledown$ PMEF 0.116 $\blacktriangledown$ 0.213 $\blacktriangledown$ 0.422 $\blacktriangledown$ 0.273 $\blacktriangledown$ 0.429 $\blacktriangledown$ 0.556 $\blacktriangledown$ 0.718 $\blacktriangledown$ 0.567 $\blacktriangledown$ 0.298 $\blacktriangledown$ 0.472 $\blacktriangledown$ 0.740 $\blacktriangledown$ 0.492 $\blacktriangledown$ $TUEF_{IlMart}$ 0.700 $\blacktriangledown$ 0.815 $\blacktriangledown$ 0.939 $\blacktriangledown$ 0.804 $\blacktriangledown$ 0.619 0.744 $\blacktriangle$ 0.899 $\blacktriangledown$ 0.741 0.528 $\blacktriangledown$ 0.673 $\blacktriangledown$ 0.875 $\blacktriangledown$ 0.673 $\blacktriangledown$ TUEF 0.800 0.883 0.989 0.877 0.611 0.738 0.906 0.736 0.574 0.718 0.901 0.712 Server Fault Physics Mathematics P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR P@1 NDCG@3 R@5 MRR Nerank 0.292 $\blacktriangledown$ 0.403 $\blacktriangledown$ 0.597 $\blacktriangledown$ 0.442 $\blacktriangledown$ 0.187 $\blacktriangledown$ 0.341 $\blacktriangledown$ 0.621 $\blacktriangledown$ 0.380 $\blacktriangledown$ 0.233 $\blacktriangledown$ 0.372 $\blacktriangledown$ 0.608 $\blacktriangledown$ 0.407 $\blacktriangledown$ PMEF 0.319 $\blacktriangledown$ 0.447 $\blacktriangledown$ 0.654 $\blacktriangledown$ 0.471 $\blacktriangledown$ 0.214 $\blacktriangledown$ 0.320 $\blacktriangledown$ 0.556 $\blacktriangledown$ 0.378 $\blacktriangledown$ 0.130 $\blacktriangledown$ 0.270 $\blacktriangledown$ 0.523 $\blacktriangledown$ 0.313 $\blacktriangledown$ $TUEF_{IlMart}$ 0.498 $\blacktriangledown$ 0.668 $\blacktriangledown$ 0.850 $\blacktriangledown$ 0.660 $\blacktriangledown$ 0.520 $\blacktriangle$ 0.674 $\blacktriangle$ 0.860 0.673 $\blacktriangle$ 0.507 $\blacktriangledown$ 0.627 $\blacktriangledown$ 0.805 $\blacktriangledown$ 0.642 $\blacktriangledown$ TUEF 0.594 0.745 0.916 0.733 0.487 0.635 0.858 0.640 0.648 0.669 0.907 0.760

Table 5 shows the results of the three models across different communities. PMEF consistently outperforms NeRank, except in the StackOverflow and Mathematics communities. However, TUEF exhibits significantly better performance than PMEF, with the smallest relative improvement being 42.42% for P@1 and 29.8% for MRR in the Unix community, and 21.92% for R@5 in the AskUbuntu community. Other metrics in the remaining communities show even more substantial relative improvements than those mentioned.

$TUEF_{IlMart}$ consistently outperforms NeRank and PMEF. However, it exhibits lower performance across various communities compared to TUEF, except in the Unix and Physics communities, where it demonstrates statistical superiority in specific metrics. $TUEF_{IlMart}$ has already demonstrated performance comparable to TUEF in these communities during the ablation study (Table 2). Consequently, $TUEF_{IlMart}$ is a promising compromise between interpretability and system accuracy for less topic-sensitive communities.

Finally, note that TUEF performs considerably better in this set of experiments compared to the cases analyzed in Tables 2 and 4. This improvement is the result of two factors: (i) the test set includes only queries for which TUEF successfully identified the best answerer during graph exploration, and (ii) the set of candidates to be ranked is limited to twenty users, approximately one-fifth of those that TUEF typically selects and sorts on average (see Table 1). We remark that this configuration and the specific settings for these experiments were required to ensure a fair comparison with the other considered algorithms.

5.3. TUEF scalability

Table 6. Statistics for four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data, including the number of questions used for train and LtR, the number of answers, the number of users labeled as experts, the minimum number of accepted answers the experts have (MinAccAns), the number of tags associated with questions, the number of MLG’s layers, and the average length of experts lists to rank at inference time (AvgLists).

Months	Train	Answers	LtR	Experts	MinAccAns	Tags	Layers	AvgLists
1	38616	53717	8008	350	11	5365	10	111.03
2	79451	110169	19991	577	14	7546	10	146.74
3	122612	170131	33164	819	15	9136	10	182.68
4	165320	230246	44821	1015	16	10514	10	186.06

Table 6 and Figure 4 illustrate TUEF ability to scale and adapt to larger datasets while maintaining its effectiveness. The table shows key statistics for four datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data, including the number of questions used for training the MLG and the LtR model, the number of answers, the number of users labeled as experts, the minimum number of accepted answers the experts have (MinAccAns), the number of tags associated with questions, the number of MLG’s Layers, and the average length of experts lists to rank at inference time (AvgLists). The training set size increases approximately linearly, with a consistent monthly growth rate, expanding from 38,616 questions in the 1-month dataset to 165,320 questions in the 4-month dataset. The increase in the number of questions is associated with an increase in the number of unique tags, which rise from 5,365 to 10,514, indicating a richer set of topics being covered over time. As expected, using more data increases the number of identified experts (from 350 to 1,015), thereby enriching the expert base. This increase in identified experts by TUEF leads to a corresponding increment in the number of retrieved experts at inference time, as reported in the AvgList column.

Despite the significant growth in data volume, TUEF maintains noteworthy performance on a test set of 1,432 questions, as evidenced by the performance metrics provided (Figure 4). The P@1 metric slightly decreases from 0.492 (1-month) to 0.449 (4-month). The NDCG@3 metric decreases from 0.631 to 0.590, while the MRR slightly drops from 0.625 to 0.589. This decline is relatively minor, considering the significant increase in the training data and the set of experts. However, the R@5 remains more stable for the datasets corresponding to 1, 2, and 3 months, and only slightly decreases to 0.771 for the 4-month dataset, indicating that the model consistently ranks the best answerer within the top five positions. The stability of the R@5 metric is particularly noteworthy, as it suggests that while TUEF may slightly reduce its performance in identifying the best experts at the top position, it still effectively includes them within the first 5 positions. One explanation for the observed slight decrease in performance in metrics other than R@5 is the increased difficulty the LtR algorithm faces when ranking a larger number of experts, as evidenced by the rise in the average length of lists of candidate experts (111 to 186). As the dataset grows, the expanding pool of candidate experts makes it more challenging to distinguish the best expert, leading to minor drops in Precision, NDCG, and MRR metrics. Finally, the rightmost plot shows the average latency of a single query, which increases from 0.107 seconds for the 1-month dataset to 0.167 seconds for the 4-month dataset. This increment can be attributed to the larger size of the CQA community in the 4-month dataset. Specifically, the greater number of questions leads to a larger and more dense MLG as more users and their interactions are included. Consequently, the RW applied to this denser MLG may perform more steps, resulting in the selection of more candidate experts to rank, thereby increasing the time required to perform the inference. Overall, TUEF demonstrates robust scalability, effectively managing larger datasets while maintaining a high level of performance, with slight declines in precision due to the increased data volume and complexity.

6. Conclusion

In this paper, we extended the analysis of TUEF, a Topic-oriented User-Interaction model for Expert Finding in Community Question&Answering platforms. TUEF integrates content and social data by constructing a Multi-Layer Graph to represent user interactions based on topical answering patterns. This approach allows TUEF to leverage both topic specificity and social relationships within the community to identify and rank the most knowledgeable users for any given question. The ablation study results show that TUEF’s performance is particularly notable in larger communities with well-defined topic clusters. Additionally, by incorporating interpretable Learning-to-Rank algorithms (i.e., IlMart (Lucchese et al., 2022)), TUEF achieves complete transparency, allowing a comprehensive understanding of the decision-making processes without significantly decreasing performance. Our extensive experiments conducted on multiple Stack Exchange communities demonstrate that TUEF outperforms state-of-the-art EF models in both the Expert Ranking and Expert Subsample Ranking categories. It excels in precision-oriented metrics such as P@1, NDCG@3, MRR, and R@5, with improvements over the state-of-the-art by a minimum of 42.42% in P@1, 32.73% in NDCG@3, 21.76% in R@5, and 29.81% in MRR. The study also highlighted TUEF’s ability to handle larger datasets, as evidenced by its performance on datasets corresponding to 1, 2, 3, and 4 months of StackOverflow data collection. Specifically, TUEF consistently ranks relevant experts in top positions, demonstrating its reliability and robustness.

Acknowledgements.

This work was partially supported by: the H2020 SoBig- Data++ project (#871042); the CAMEO PRIN project (#2022ZLL7MW) funded by the MUR; the HEU EFRA project (#101093026) funded by the EC under the NextGeneration EU programme. A. Passarella’s and R. Perego’s work was partly funded under the PNRR - M4C2 - Investimento 1.3, PE00000013 - “FAIR” project. However, the views and opinions expressed are those of the authors only and do not necessarily reflect those of the EU or European Commission-EU. Neither the EU nor the granting authority can be held responsible for them.

References

(1)
Amendola et al. (2024) Maddalena Amendola, Andrea Passarella, and Raffaele Perego. 2024. Towards Robust Expert Finding in Community Question Answering Platforms. In Advances in Information Retrieval, Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, and Iadh Ounis (Eds.). Springer Nature Switzerland, Cham, 152–168.
Aslay et al. (2013) Çiğdem Aslay, Neil O’Hare, Luca Maria Aiello, and Alejandro Jaimes. 2013. Competition-based networks for expert finding. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 1033–1036.
Bassani (2022) Elias Bassani. 2022. ranx: A blazing-fast python library for ranking evaluation and comparison. In European Conference on Information Retrieval. Springer, 259–264.
Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning. PMLR, 115–123.
Burges (2010) Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
Cañamares and Castells (2020) Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedings of the 14th ACM Conference on Recommender Systems. 259–268.
Costa and Ortale (2023a) Gianni Costa and Riccardo Ortale. 2023a. Ask and Ye shall be Answered: Bayesian tag-based collaborative recommendation of trustworthy experts over time in community question answering. Information Fusion 99 (2023), 101856.
Costa and Ortale (2023b) Gianni Costa and Riccardo Ortale. 2023b. Here are the answers. What is your question? Bayesian collaborative tag-based recommendation of time-sensitive expertise in question-answering communities. Expert Systems with Applications 225 (2023), 120042.
Dallmann et al. (2021) Alexander Dallmann, Daniel Zoller, and Andreas Hotho. 2021. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 505–514.
Dargahi Nobari et al. (2017) Arash Dargahi Nobari, Sajad Sotudeh Gharebagh, and Mahmood Neshati. 2017. Skill translation models in expert finding. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1057–1060.
Dehghan and Abin (2019) Mahdi Dehghan and Ahmad Ali Abin. 2019. Translations Diversification for Expert Finding: A Novel Clustering-based Approach. ACM Trans. Knowl. Discov. Data 13, 3 (2019), 32:1–32:20. https://doi.org/10.1145/3320489
Dehghan et al. (2019) Mahdi Dehghan, Maryam Biabani, and Ahmad Ali Abin. 2019. Temporal expert profiling: With an application to T-shaped expert finding. Inf. Process. Manag. 56, 3 (2019), 1067–1079. https://doi.org/10.1016/j.ipm.2019.02.017
Faisal et al. (2019) Muhammad Shahzad Faisal, Ali Daud, Abubakr Usman Akram, Rabeeh Ayaz Abbasi, Naif Radi Aljohani, and Irfan Mehmood. 2019. Expert ranking techniques for online rated forums. Computers in Human Behavior 100 (2019), 168–176.
Fallahnejad and Beigy (2022) Zohreh Fallahnejad and Hamid Beigy. 2022. Attention-based skill translation models for expert finding. Expert Systems with Applications 193 (2022), 116433.
Freeman (1977) Linton C Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry (1977), 35–41.
Fu (2019) Chaogang Fu. 2019. Tracking user-role evolution via topic modeling in community question answering. Information Processing & Management 56, 6 (2019), 102075.
Fu (2020) Chaogang Fu. 2020. User correlation model for question recommendation in community question answering. Applied Intelligence 50 (2020), 634–645.
Fu et al. (2020) Jinlan Fu, Yi Li, Qi Zhang, Qinzhuo Wu, Renfeng Ma, Xuanjing Huang, and Yu-Gang Jiang. 2020. Recurrent memory reasoning network for expert finding in community question answering. In Proceedings of the 13th international conference on web search and data mining. 187–195.
Ghasemi et al. (2021) Negin Ghasemi, Ramin Fatourechi, and Saeedeh Momtazi. 2021. User embedding for expert finding in community question answering. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 4 (2021), 1–16.
Giannakidou et al. (2008) Eirini Giannakidou, Vassiliki Koutsonikola, Athena Vakali, and Yiannis Kompatsiaris. 2008. Co-Clustering Tags and Social Data Sources. In 2008 The Ninth International Conference on Web-Age Information Management. 317–324. https://doi.org/10.1109/WAIM.2008.61
Hoffa (2019) Felipe Hoffa. 2019. Making Sense of the Metadata: Clustering 4,000 Stack Overflow tags with BigQuery k-means. https://stackoverflow.blog/2019/07/24/making-sense-of-the-metadata-clustering-4000-stack-overflow-tags-with-bigquery-k-means/
Hofmann et al. (2016) Katja Hofmann, Lihong Li, Filip Radlinski, et al. 2016. Online evaluation for information retrieval. Foundations and Trends® in Information Retrieval 10, 1 (2016), 1–117.
Kasela et al. (2023) Pranav Kasela, Gabriella Pasi, and Raffaele Perego. 2023. SE-PEF: a Resource for Personalized Expert Finding. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 288–309.
Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
Krichene and Rendle (2020a) Walid Krichene and Steffen Rendle. 2020a. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1748–1757.
Krichene and Rendle (2020b) Walid Krichene and Steffen Rendle. 2020b. On Sampled Metrics for Item Recommendation. In KDD 2020. https://dl.acm.org/doi/10.1145/3394486.3403226
Krishna and Antulov-Fantulin (2023) Vaibhav Krishna and Nino Antulov-Fantulin. 2023. Temporal-Weighted Bipartite Graph Model for Sparse Expert Recommendation in Community Question Answering. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization. 156–163.
Kundu and Mandal (2019) Dipankar Kundu and Deba Prasad Mandal. 2019. Formulation of a hybrid expertise retrieval system in community question answering services. Applied Intelligence 49 (2019), 463–477.
Kundu et al. (2019) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2019. Finding active experts for question routing in community question answering services. In Pattern Recognition and Machine Intelligence: 8th International Conference, PReMI 2019, Tezpur, India, December 17-20, 2019, Proceedings, Part II. Springer, 320–327.
Kundu et al. (2020) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2020. Preference enhanced hybrid expertise retrieval system in community question answering services. Decision Support Systems 129 (2020), 113164.
Kundu et al. (2021a) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2021a. Time-aware hybrid expertise retrieval system in community question answering services. Applied Intelligence 51, 10 (2021), 6914–6931.
Kundu et al. (2021b) Dipankar Kundu, Rajat Kumar Pal, and Deba Prasad Mandal. 2021b. Topic sensitive hybrid expertise retrieval system in community question answering services. Knowledge-Based Systems 211 (2021), 106535.
Le and Shah (2018) Long T Le and Chirag Shah. 2018. Retrieving people: Identifying potential answerers in community question-answering. Journal of the association for information science and technology 69, 10 (2018), 1246–1258.
Li et al. (2002) Longzhuang Li, Yi Shang, and Wei Zhang. 2002. Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th international conference on World Wide Web. 527–535.
Li et al. (2019) Zeyu Li, Jyun-Yu Jiang, Yizhou Sun, and Wei Wang. 2019. Personalized question routing via heterogeneous network embedding. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 192–199.
Liang et al. (2018) Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 world wide web conference. 689–698.
Liang (2019) Shangsong Liang. 2019. Unsupervised Semantic Generative Adversarial Networks for Expert Retrieval. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 1039–1050. https://doi.org/10.1145/3308558.3313625
Liu et al. (2022a) Hongtao Liu, Zhepeng Lv, Qing Yang, Dongliang Xu, and Qiyao Peng. 2022a. Efficient Non-sampling Expert Finding. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4239–4243.
Liu et al. (2022b) Hongtao Liu, Zhepeng Lv, Qing Yang, Dongliang Xu, and Qiyao Peng. 2022b. Expertbert: Pretraining expert finding. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4244–4248.
Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (mar 2009), 225–331. https://doi.org/10.1561/1500000016
Liu et al. (2005) Xiaoming Liu, Johan Bollen, Michael L Nelson, and Herbert Van de Sompel. 2005. Co-authorship networks in the digital library research community. Information processing & management 41, 6 (2005), 1462–1480.
Liu et al. (2022c) Yue Liu, Weize Tang, Zitu Liu, Lin Ding, and Aihua Tang. 2022c. High-quality domain expert finding method in CQA based on multi-granularity semantic analysis and interest drift. Information Sciences 596 (2022), 395–413.
Lucchese et al. (2022) Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Alberto Veneri. 2022. ILMART: Interpretable Ranking with Constrained LambdaMART. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2255–2259.
MacQueen et al. (1967) James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.
Mumtaz et al. (2019) Sara Mumtaz, Carlos Rodriguez, and Boualem Benatallah. 2019. Expert2vec: Experts representation in community question answering for question routing. In Advanced Information Systems Engineering: 31st International Conference, CAiSE 2019, Rome, Italy, June 3–7, 2019, Proceedings 31. Springer, 213–229.
Nobari et al. (2020) Arash Dargahi Nobari, Mahmood Neshati, and Sajad Sotudeh Gharebagh. 2020. Quality-aware skill translation models for expert finding on StackOverflow. Information Systems 87 (2020), 101413.
Peng et al. (2023) Qiyao Peng, Hongtao Liu, Zhepeng Lv, Qing Yang, and Wenjun Wang. 2023. Contrastive Pre-training for Personalized Expert Finding. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Peng et al. (2022a) Qiyao Peng, Hongtao Liu, Yinghui Wang, Hongyan Xu, Pengfei Jiao, Minglai Shao, and Wenjun Wang. 2022a. Towards a multi-view attentive matching for personalized expert finding. In Proceedings of the ACM Web Conference 2022. 2131–2140.
Peng et al. (2022b) Qiyao Peng, Wenjun Wang, Hongtao Liu, Yinghui Wang, Hongyan Xu, and Minglai Shao. 2022b. Towards comprehensive expert finding with a hierarchical matching network. Knowledge-Based Systems 257 (2022), 109933.
Qian et al. (2022a) Lingfei Qian, Jian Wang, Hongfei Lin, Bo Xu, and Liang Yang. 2022a. Heterogeneous information network embedding based on multiperspective metapath for question routing. Knowledge-Based Systems 240 (2022), 107842.
Qian et al. (2022b) Lingfei Qian, Jian Wang, Hongfei Lin, Liang Yang, and Yu Zhang. 2022b. Multi-hop interactive attention based classification network for expert recommendation. Neurocomputing 488 (2022), 436–443.
Rousseeuw (1987) Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.
Roy and Singh (2024) Pradeep Kumar Roy and Jyoti Prakash Singh. 2024. Early prediction of promising expert users on community question answering sites. International Journal of System Assurance Engineering and Management (2024), 1–12.
Roy et al. (2018) Pradeep Kumar Roy, Jyoti Prakash Singh, and Amitava Nag. 2018. Finding active expert users for question routing in community question answering sites. In Machine Learning and Data Mining in Pattern Recognition: 14th International Conference, MLDM 2018, New York, NY, USA, July 15-19, 2018, Proceedings, Part II 14. Springer, 440–451.
Sang et al. (2019) Lei Sang, Min Xu, ShengSheng Qian, and Xindong Wu. 2019. Multi-modal multi-view Bayesian semantic embedding for community question answering. Neurocomputing 334 (2019), 44–58.
Shani and Gunawardana (2011) Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems. Recommender systems handbook (2011), 257–297.
Sorkhani et al. (2022) Soroosh Sorkhani, Roohollah Etemadi, Amin Bigdeli, Morteza Zihayat, and Ebrahim Bagheri. 2022. Feature-based question routing in community question answering platforms. Information Sciences 608 (2022), 696–717.
Sun et al. (2019a) Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P. Sadayappan, and Srinivasan Parthasarathy. 2019a. ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 265–272. https://doi.org/10.1609/aaai.v33i01.3301265
Sun et al. (2019b) Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P Sadayappan, and Srinivasan Parthasarathy. 2019b. Atp: Directed graph embedding with asymmetric transitivity preservation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 265–272.
Sun et al. (2018a) Jiankai Sun, Sobhan Moosavi, Rajiv Ramnath, and Srinivasan Parthasarathy. 2018a. QDEE: Question Difficulty and Expertise Estimation in Community Question Answering Sites. In Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25-28, 2018. AAAI Press, 375–384. https://aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17854
Sun et al. (2018b) Jiankai Sun, Sobhan Moosavi, Rajiv Ramnath, and Srinivasan Parthasarathy. 2018b. QDEE: question difficulty and expertise estimation in community question answering sites. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12.
Tang et al. (2020) Weizhao Tang, Tun Lu, Dongsheng Li, Hansu Gu, and Ning Gu. 2020. Hierarchical attentional factorization machines for expert recommendation in community question answering. IEEE Access 8 (2020), 35331–35343.
Tondulkar et al. (2018) Rohan Tondulkar, Manisha Dubey, and Maunendra Sankar Desarkar. 2018. Get me the best: predicting best answerers in community question answering sites. In Proceedings of the 12th ACM Conference on Recommender Systems. 251–259.
Zhang et al. (2020) Xuchao Zhang, Wei Cheng, Bo Zong, Yuncong Chen, Jianwu Xu, Ding Li, and Haifeng Chen. 2020. Temporal Context-Aware Representation Learning for Question Routing. In WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, James Caverlee, Xia (Ben) Hu, Mounia Lalmas, and Wei Wang (Eds.). ACM, 753–761. https://doi.org/10.1145/3336191.3371847