Practitioner Evaluations on Software Testing Tools

In software engineering practice, evaluating and selecting the software testing tools that best fit the project at hand is an important and challenging task. In scientific studies of software engineering, practitioner evaluations and beliefs have recently gained interest, and some studies suggest that practitioners find beliefs of peers more credible than empirical evidence. To study how software practitioners evaluate testing tools, we applied online opinion surveys (n=89). We analyzed the reliability of the opinions utilizing Krippendorff's alpha, intra-class correlation coefficient (ICC), and coefficients of variation (CV). Negative binomial regression was used to evaluate the effect of demographics. We find that opinions towards a specific tool can be conflicting. We show how increasing the number of respondents improves the reliability of the estimates measured with ICC. Our results indicate that on average, opinions from seven experts provide a moderate level of reliability. From demographics, we find that technical seniority leads to more negative evaluations. To improve the understanding, robustness, and impact of the findings, we need to conduct further studies by utilizing diverse sources and complementary methods.


INTRODUCTION
Software projects face demands for delivering high-quality software at top speed. At the same time, there is the pressure for cost reduction. Test automation can be the solution but only after the problem of finding the right tool(s) has been solved. Therefore, selecting Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request  the correct tool(s) is important for profitable high speed and high quality testing. However, there are hundreds of commercial and open source tools available for software testing. Finding the right tool(s) even for evaluation and comparison can be challenging.
When faced with such choices practitioner often turn to fellow practitioners. In fact, in the context of software process improvement, it has been shown that practitioners prefer the opinions of their equals over empirical evidence [43]. There is no plausible evidence suggesting the situation would be any different for test tool selection. It can be questioned whether such beliefs are uniform or credible, in general.
Our goal is to study how experts evaluate quality attributes of popular software testing tools, to assess whether such expert advice can be trusted or not, and to study the effect of background (demographic) variables. We set to answer the following research questions: RQ1 Do survey respondents agree or have consistent opinions on the criteria? RQ2 How do background variables affect the survey evaluations (response variable)?
As a contribution, we show that increasing the number of respondents improves the reliability of the estimates measured with ICC, but the number of experts required for reliable evaluations is rather small.

BACKGROUND
We identified three relevant branches of prior work regarding our study: software test tool selection in Section 2.1, surveys of developers' opinions in Section 2.2 and assessment of responses in Section 2.3. In the following, we present a brief overview to these fields.

Software Test Tool Selection
Software test tool selection can be seen as a special case of software tool selection. Test automation, where tools play an integral part, can be considered as a solution to save (testing) costs and to improve quality and speed in software development [15]. Software testing tools impact the work of professionals across an organization. For a software testing tool to work in an organization, there are interconnections that need to be checked during evaluations [39]. Core capabilities of tools can be helpful in evaluation and selection of suitable tools [32]. However, challenges and obstacles in software testing are reported to be related not only to lack of time and resources, but also to lack of tools [13,32,41]. Costs have been reported to be among the topmost barriers to the use of automated testing tools [16,17]. Despite the proliferation of practically free open source tools, the inevitable barrier of costs has not disappeared. Testing budgets are expected to continue to consume a big proportion of the overall budgets [6,7].
For tool selection, there are different, more or less commercial comparison matrices available, e.g., [3,20,40]. Such sources may be useful for identifying tools, but the contents are neither generalizable nor validated for tool selection. There are software testing related academic studies which rely on surveys as the key methodology, e.g., [8,11,13,16,24,32,41,44], but only a few report software test tools (used by the practitioners) by name (e.g., [8,16,17]).
In grey literature, test tool evaluations tend to propose and include tasks like live trials, proof-of-concepts and demos [45]. Such tasks require resources and competence, and are considered to bear the risk of wrong decisions [39]. Thus, investigating solutions and methodologies to help making sense of the software testing tools is topical and warranted.

Developers' Beliefs and Opinion Surveys
Passos et al. [37] and Devanbu et al. [10] conclude that people are influenced by strong beliefs obtained from personal experiences rather than from empirical research. Similarly, Rainer et al. [43] present that for software process improvement, software practitioners find local opinion more credible knowledge than empirical evidence. Test tool automation consultation has been claimed to be the service most required from external consultants [23]. Beliefs may be the triggers for initiatives to adopt new technologies or tools, but the decisions are be based on opinions of experts [37]. Pano et al. [36] found social influence as an important factor in the process of adopting the best JavaScript framework, while prior research has little meaning to the practitioners.
Opinion survey is a common means of gauging, describing the public's collective sentiment for some defined need [14]. Online opinion surveys have emerged as a promising complementary way for understanding the collective public opinion [22]. Hosio et al. [21] have developed a light-weight decision support tool for surveying large pools of users for subjective opinions on how a given solution fairs in light of various criteria. The data can then be modeled for answers that best match a desired criteria configuration. Such a system is based on the concept of the wisdom of the crowds [48].

Assessment of Responses
To evaluate software testing tools, we need collective information, knowledge from people having invested time in choosing and using tools [44]. Kitchenham et al. [26] define a survey as a "comprehensive system for collecting information to describe, compare or explain knowledge, attitudes and behavior", and representativeness of responses can be justified by analyzing the demographics of the respondents [26].
In software engineering (SE), there are studies reporting low values for expert agreement/reliability using Kirppendorff's alpha and/or ICC, by e.g., Borg et al. [4], Anvaari et al. [1] and Kitchenham et al. [27]. Evaluations depend on the interpretation of a construct under study, i.e., include some degree of subjectivity [5,47]. Interrater reliability is always specific to a given setting, i.e., respondents, instrument and time [5,47]. Yet, Libby and Blashfield [33] claim that a small group of experts can provide as accurate evaluations as a large group.  [30] as a measure for the agreement among observers (respondents), intra-class correlation (ICC) as a measure of reliability of evaluations, and coefficient of variation (CV) that we use to evaluate agreement. In sections 3.2.5 and 3.2.6 we describe the approaches to study the effect of the number of respondents on the accuracy of the evaluations, and the effect of demographics on tool evaluations, respectively.

Opinion Surveys
We constructed a survey questionnaire including questions about background information and 15 questions for evaluating criteria (on different selected tools of choice), see Table 1. We used the criteria for the survey from a set of characteristics considered important by practitioners in test tool selection [44,45] and resting on the ISO/IEC 25010 1 quality model.
The respondents were able to select one or more tools and evaluate the criteria of choice for each tool, one tool at a time. The list of tools (100) was created from a set of tools identified by practitioners for software testing [44]. The respondents could indicate the basis of their evaluations for the tool(s), i.e., whether those were based on personal experience using the tool, or on a generic opinion, e.g., from observing others using the tool. The criteria were evaluated on a scale from 0 to 10, at intervals of 0.5 (the default value being 5) and using a slider as the UI input element. The online opinion survey method used was adopted from the studies of Hosio et al. [21] and Goncalves et al. [18]. Both the questionnaire and the survey tool were validated by the authors and by an industry partner.
Survey#1 was published online August 29 th , 2016. First, we promoted the survey to Finnish software testing professionals in a testing assembly in Finland, then posted a link to the survey (to selected groups) in Twitter, LinkedIn and Reddit, and sent a link to the survey to the public e-mail list of a testing association in Finland. We received 21 (of the 48 unique) responses with useful data (60 tool evaluations for 30 tools), and decided to harness survey#2.
For survey#2, we utilized the same online tool, but with clear focus to ensure fair amount of valid responses, at least for one tool. We contacted a number of practitioners from a set of Finnish collaborating companies in the EUREKA ITEA3 TESTOMAT 2 research project. The selected practitioners were known to be either familiar with Robot Framework 3 , an open source, "generic test automation

B5
Country where you work. C8 Possibility to remold or expand the tool.

B6
The domain of business you work on C9 Cross-platform support.

B7
Describe the software or system the company or organization you work for is developing. C10 Maintenance and re-use of test cases & test data.

B8
List the programming languages you use in your work. C11 Active further development of the tool.

B9
My answers are based on a) personal experience (using the tool) b) personal conception (e.g., observing others use it)

C12
Popularity of the tool.

C1
Compatibility with existing tools (e.g., CI-tools). C13 Low cost price or licensing of the tool (expected costs for acquisition and usage).

C2
Applicability of the tool to the tasks, methods & processes C14 Performance of the tool (e.g. speed) for its purpose.

C3
Easy to deploy (initial effort to take the tool into use). framework for acceptance testing and acceptance test-driven development (ATDD)" (as having used the tool and/or participated in the development of the tool), or the tool was utilized in their company. Survey#2 focused on Robot Framework, but the respondents were requested to evaluate other tools, too. Survey#2 was published on March 1 st , 2018. We promoted it by e-mail to seven professional software consultants (from six companies), asking them to distribute the link to their colleagues considered relevant for answering the questions. Similar approach, aka snowball or chain sampling [38], has been used by e.g., Ågerfalk and Fitzgerald [42]. To reach a wider audience, the survey was promoted in Robot Framework Slack and in Twitter with hashtag robotframework. Survey#2 was was open for a month. We received 68 (of the 80 unique) responses with useful data (101 tool evaluations for 17 tools). All collected data for both surveys are anonymous. See the study related material in Appendix A. [49] has proposed a rule of thumb for detecting distant values, i.e., outliers on the basis of the quartiles of the data. Tukey [49] defined an outlier as a value more than 1.5 times the interquartile range (IQR, i.e., Q3 − Q1)) from the quartiles, i.e., either below Q1 − 1.5 * IQR or above Q3 + 1.5 * IQR. Osborne and Overbay [35] and Chandola et al. [9] emphasize the importance of studying the outliers, as those may have real life relevance [9], and include relevant information. We intended to study outliers in the data to see if some criteria for a tool have more outliers than others. Outliers may be a sign of nuisance, error or legitimate data, but can also be "inspiration for inquiry" [35].

Krippendorff's alpha.
Krippendorff's alpha (α) is a statistical measure for determining inter-rater reliability. The values for the α range from perfect disagreement (0) to perfect agreement (1).
The values α ≥ 0.800 are suggested for drawing reliable conclusions while values 0.667 ≤ α < 0.800 are claimed for tentative conclusions only [29].
We used the R-function kripp.alpha 4 to measure the level of agreement among the respondents (raters) on the criteria (subjects) of the top 6 most evaluated tools. We considered the level of measurement for the data to be ratio, since the possible values (from 0 to 10 at intervals of 0.5, i.e., 21 levels) were ordered units having the same difference and an absolute zero. As the values were limited to our scale, the α values were calculated for ordinal type of data, too.

Intra-class Correlation Coefficient (ICC)
. ICC is a common statistics used for measuring inter-rater reliability for ratio type of data [19]. As for Krippendorff's α, the values for ICC vary between 0 and 1, higher values indicating greater reliability. The commonly referenced ICC values are ≥ 0.90 for excellent, 0.75 ≤ and < 0.90 for good, 0.50 ≤ and < 0.75 for moderate and < 0.5 for poor agreement [28].
We used the R-function ICC 5 to estimate the association among the respondents for the top 6 tools. The function provides results for six different forms, presented as two numbers, i.e., ICC(x,y) or ICCxy. The first number (x) indicates the model (1, 2 or 3) and the second (y) the type of the measurement protocol (either "1" as a single rater/measurement, or "k" as the mean of k respondents/measurements) [46]. As the results may differ and lead to different interpretations, it is suggested to report both the results and the computational variant [19,28]. To select the correct form [46], we analyzed the prerequisites suggested by Koo and Li [28]: (1) Do we have the same set of respondents for all criteria? Yes, the same set of respondents evaluated all criteria. (2) Is the sample of respondents randomly selected from a larger population or is it a specific sample of respondents? We had a specific sample of respondents, a convenience sample [25]. The respondents evaluated the same criteria, but the underlying contexts and constructs may vary for samples (even for respondents). Thus, there is no intention to generalize the tool related results regarding the values as such, but to analyze reliability of responses. (3) Are we interested in the reliability of a single respondent or the mean value of multiple respondents? We were interested in reliability of the mean value of many respondents. (4) Are we concerned about consistency or agreement? We wanted to check consistency (not absolute agreement). Thus, the first two questions are used to guide the selection of the model. The third question is about the type, whether the measurement protocol will be conducted by applying "single respondent" or "mean of k respondents". The last question is about the difference of the purpose.
We measured ICC using a two-way mixed effects, average measures for consistency, i.e., ICC(3,k) [46] with the purpose to estimate the degree the respondents provided consistency in the evaluations across the criteria. (For ICC2 and ICC3 the difference is the consideration of respondents as random or fixed effects). In reporting the results, we followed the guidelines suggested by Hallgren [19] and Koo and Li [28]. In cases where the single measured ICC's are low (ICC2) and average measured ICC's (ICC3) are high, it is suggested to report both cases to demonstrate the discrepancy [46].

Coefficient of Variation (CV).
We measured the coefficient of variation (CV) for the criteria evaluations for the top 6 tools, to analyze the extent of variability in evaluations in relation to the mean of the population. Practically, the lower the CV the less variation there exists. As our criteria are very different of nature (e.g., some more human oriented than others like "Programming Skills" and "Costs"), CV's allow to compare the variation across different criteria having different means.
As our data was considered to be of type ratio, but was limited to our scale (values from 0 to 10 at intervals of 0.5, i.e., 21 levels), we calculated the CV for both ratio and ordinal type of data. For ratio type of data the CV was calculated as the ratio of the standard deviation to the mean (1). For calculating the CV for the ordinal type of data we used the formula (2) presented by Kvålseth [31].
(1) CV for ratio type of data (2) CV for ordinal type of data, as in Kvålseth [31].

Number of Respondents for ICC.
To analyze the effect of the number of respondents to the incremental accuracy of tool evaluations, we applied the example modeled by Libby and Blashfield [33]. They empirically tested the effects of group size in decision making, and concluded that on average, having three accurate judges could improve average performance (in most cases). Employment of a small number of judges would be practical and cost efficient [33]. We generated random sets of respondents (from 2 to n respondents, n being the total number of respondents for a tool, see Table 2) for each top 6 tools. For each size of sets (from 2 to n) we run 100 iterations of ICC (each run with a new random set of respondents) with intention to compare the medians of the groups to the common ICC reference values [28]. Thus, the total number of ICC values for the tools ((n − 1) * 100) were 400 for Appium (n=5), 900 for Jenkins (n=10), 300 for Jira (n=4), 400 for JMeter (n=5), 7600 for Robot Framework (n=77) and 400 for Selenium (n=5).

Effect of the Demographics.
For studying the effect of demographics on the evaluations, we carried out a negative binomial regression analysis (for modeling count variables) with R-function glm.nb 6 . We used an automatic method, R-function stepAIC 7 to analyze proposed variable selection. For the baseline model, we included seven variables: familiarity with Robot Framework (see the question ID RFW in Table 1), experience regarding the use of the tool, years in the current role and in the work area, type of role and work area, and business domain.

RESULTS
The two surveys included evaluations for 2128 criteria, for 38 unique tools, in total. We filtered out any evaluations known to be test cases, duplicates or having only default values. The top 6 most evaluated tools in the surveys, namely Robot Framework, Jenkins, Appium, JMeter, Selenium and Jira, received 1525 evaluations, in total, see Table 2.
The arithmetic mean of evaluations for the criteria in the surveys for the top 6 tools, are shown in Table 4. The fact that practitioners tend to perceive Jira as a tool for software testing seemed rather reasonable, the tool being part of a whole, "Bringing testing capabilities within Jira helps tightly integrate product management, development, and testing to streamline efficiency and productivity." 8 .
In both surveys, Robot Framework was the most evaluated tool, see Table 2. That is expected to be a by-product of two obvious facts: 1) Robot Framework as "a local tool" among the respondents (majority working in Finland) and 2) the utilization of convenience sampling [25] for survey#2. The respondents (n = 89) reported the country they work in as Australia (3) (4), United Kingdom (2) and USA (5). Three respondents did not provide that information. See background details in Table 3.

RQ1 -Opinions of the Criteria
To answer the RQ1 "Do survey respondents agree or have consistent opinions on the criteria?", we analyzed the top 6 tools, see Table 4. We intended to identify the criteria that require focusing or investing in, and to analyze the reliability of the data. Robot Framework had a total of 1117 evaluations (about 52% of all evaluations) by 77 respondents, see Table 2. When analyzing the boxplot for Robot Framework, the median value was ≥ 80.0 for all other criteria, except for Popularity and Programming Skills. Those criteria also had the highest variance and the lowest lower quartile values (60.0 and 40.0, respectively). The criterion having the smallest IQR for Robot Framework was Costs (100 − 91.25 = 8.75), while the largest IQR was 45 (85 − 40) for Programing skills.
The evaluations for the top 6 tools included 62 outlier values, see Table 4. There were no outliers for Appium and Jira, just one for Selenium (2%), two for JMeter (3%), four for Jenkins (3%) and 55 (5%) for Robot Framework. Those outlier values were given by 27 unique respondents (30% of all respondents) with years of experience, i.e., on average 14.1 years in the industry (median 12). The number of years in the current role was on average 4.0 years (median 3). The respondents were inclined to evaluate the tools critically.
The measurements for Krippendorff's α resulted in low values, see Table 6. Although the criteria being evaluated were the same, the evaluation of a criterion for a tool is a factor of some specific context, underlying construct for a tool and level of experience of an expert. Krippendorff's α was not considered as the best measure in our case as there was a wide range of possible values, and the evaluations were based on personal perceptions and experiences. Thus, we did not expect all respondents to interpret the criteria the same way. In fact, Dybå et al. [12] emphasize that "seemingly unpatterned and disagreeing findings from quantitative studies may have underlying consistency when omnibus context is taken into account".
As suggested by Shrout and Fleiss [46], we report ICC for single measured and average measured values for both random and fixed effects, i.e, ICC(2,1) & ICC(2,k) and ICC(3,1) & ICC(3,k), see Table 5. The resulting ICC(3,k), i.e., average consistency among fixed respondents varied between 0.60 for Selenium and 0.94 for Robot Framework. The ICC(3,k) for Robot Framework was in the "good" range (although the value 0.94 was in the "excellent" range [28]), as there is 95% chance the value will be in between 0.881 and 0.97 (in the worst case the ICC value would be considered "good"). The absolute agreement, as Krippendorff's alpha, was low for Robot Framework, 0.12 and 0.16 for single measured random and fixed effects, i.e., ICC(2,1) and ICC(3,1), respectively.
As the criteria had different means, we measured the coefficient of variation (CV) for the evaluations of top 6 tools, see Table 6. The calculated CV's indicate strong positive correlation (Pearson's correlation), although the CV's for ordinal data were slightly better, in general. For Robot Framework, Cross-Platform Support, Cost-Effectiveness and Costs had the lowest uncertainty (the lower the value the more precise the estimate). Thus, in the case of Robot Framework, it would be beneficial to study issues related to Programming Skills, Popularity, Easy to Deploy and Maintenance of Test Cases & Data in more detail.
For analyzing the number of experienced practitioners for improving the accuracy of tool evaluation, we run iterations, as described in Section 3.2.5. We used the values of 0.5 ("moderate"), 0.75 ("good") and 0.9 ("excellent") as threshold values for indicating the levels of reliability [28] for the group medians. For Robot Framework, 7 respondents were required to get to the "moderate" level of reliability, 16 to get to the "good" and 47 for "excellent" level of reliability, see Figure 1. For Appium, Jenkins, Jira and JMeter, the "moderate" level of reliability was reached with 3 respondents while for Selenium with 4. For Jenkins the combination of 7 respondents reached "good" level of reliability. The medians for the other four tools (Appium, Jira, JMeter and Selenium) did not reach either "good" or "excellent" level, indicating a need for more respondents.

RQ2 -Background of the Respondents
The RQ2 covered the effect of the background of the respondents to the evaluations: "How do background variables affect the survey evaluations (response variable)?". The results for evaluations for all tools (n = 2128 evaluations) are shown in Table 7.
We carried out a negative binomial regression to analyze the effect of demographics on evaluations. To select a subset of the explanatory variables, we used model simplification as described in Section 3.2.6. The proposed best model included four variables: experience regarding the use of the tool, familiarity with Robot Framework (see the question ID RFW in Table 1) and years in the work area and in the current role. However, we decided to keep all original seven variables, see Table 7.
The background variables were not expected to make a very accurate model, as the respondents rated their personal experiences related to a tool, and there were evaluations for different tools. In fact, the missingness information about the model indicates that there were 745 partial observations, i.e., not including all required       data, and those were not used in fitting the model. The AIC measure of variance is 12710, but uninformative as we have just one model. Deviance residuals indicate our model is not biased in one direction (1Q (−0.4737), 3Q (0.5586) and median (0.0893)).
The respondents reported the basis of their evaluations, i.e., either experience (personal experience using the tool, 0) or opinion (generic opinion e.g., from observing others using the tool, 1). An opinion based evaluation is significantly associated with a decrease of 0.1349 in evaluation n = 149, r (1358) = −0.1349, p = 0.0001, compared to one based on experience using the tool (n = 1979).
Regarding the familiarity with Robot Framework, the baseline is "NA". The factor "No", i.e., the evaluations of those respondents that had not contributed to the development of the tool, is significant n = 629, r (1358) = 0.0788, p = 0.0014 with respect to the baseline.
The coefficients for role implies that given all other variables were constant, an evaluation of an individual contributor would be expected to be −0.0600 less than evaluation for baseline (executive role), i.e., n = 999, r (1358) = −0.0600, p = 0.1855. Similarly, the categorial variables lead and specialist have impact with respect to the baseline, as for a lead the values are n = 331, r (1358) = −0.0469, p = 0.2830 and for a specialist n = 617, r (1358) = −0.1018, p = 0.0195.

DISCUSSION
Regarding the RQ1, "Do survey respondents agree or have consistent opinions on the criteria?", we acknowledge that tool evaluations are context-specific, practitioner-related and conducted in retrospect to experiences [51]. As our surveys were anonymous, we could not ask the respondents to reason their evaluations, but just analyze the variability of the values for the criteria. As there were 21 options for each criterion and agreement requires absolute consistency, the low results for Krippendorff's alpha were not surprising. However, when analyzing the relative ordering of the ratings, deviations from the mean with ICC(3,k), the average consistency among the respondents for the top 6 tools was in the "moderate" or "good" level of reliability, in general, see Table 5.
Costs are considered as barriers to the use of automated testing tools [16,17]. The top 6 tools were considered as low cost and cost-effective, in general, see Table 4. Wagenaar et al. [50] reported Scrum teams to prefer perceived usefulness over perceived ease of use after using a tool. Our findings from the surveys seem to support that observation as the rankings for e.g., Applicability (#6) or Cost-Effectiveness (#4) are higher than for Easy to Use (#9). Azizyan et al. [2] observed diversity of opinions in the form of conflicting comments for simple versus more comprehensive tools. Our findings from the surveys for the criterion of Programming Skills (high variability among respondents, see Table 6) support the former when considering programming skills as prerequisite for the use of a more comprehensive tool.
The CV's tend to be higher for Programming Skills than for the other criteria. Agreements on the criteria are important for confirming assumptions, but the disagreements (i.e., low values and outliers) are valuable for identifying possibly problematic issues. For example, low evaluations for criteria considered important in tool selection [44] are worth studying, in more detail. Our findings suggest that collective opinion can be used to point out issues, worth focusing on or investigating, in more detail.
Mannes et al. [34] consider expert knowledge as "accurate, robust and appealing as a mechanism for helping individuals tap collective wisdom". Our findings support the suggestion by Libby and Blashfield [33] that performance of a group as a function of the number of raters improves with a few accurate raters only. However, more raters may be required for minimizing the probability of making poor decisions [33]. Our findings from running ICC by pooling different combinations of respondents (raters) suggest that seven experienced respondents are enough for "moderate" level of reliability, but considerably more experienced respondents are required for "good" or "excellent" level of reliability. Thus, we find that trusting an opinion of just one or opinions of a few practitioners may be questionable or misleading, and can lead to wrong decisions.
To study the RQ2, "How do background variables affect the survey evaluations (response variable)?", we carried out a negative binomial regression, see Table 7. We observed the opinion based evaluations to be significantly lower than those based on experience of using the tool. Years of experience in the working area seems to be a factor having negative effect on the evaluations. The years in the current role, in turn, seems significant. The tools have been available only for some time, e.g., JMeter 9 since 1998 but Appium 10 only since 2012.
Nowadays, popular open source tools have active development communities. Technical seniority (e.g., having a specialist role), was a significant factor, specialists providing slightly more critical evaluations. Thus, the role of a respondent is predicted useful for similar types of surveys.
Earlier, expert tool users were not considered reliable for evaluating software testing tools, as they were not expected to have the experience or knowledge to make distinctions between various aspects of tools usage [39]. Nowadays, expert tool users can be active in the development of some open source tool(s) and thus, have in depth understanding of the functionality and possible special characteristics of such tool(s). Practitioners seem to value perceptions of local crowds as credible empirical evidence [18,43]. Software testing tools are used in various business domains. However, studying tool evaluations in a single company or within a single domain could provide a limited view on the criteria. Anvaari et al. [1] reported that neither long experience in the area of interest nor the same domain of expertise provided agreement among raters.

THREATS TO VALIDITY
We followed the guidelines presented by Wohlin et al. [52] for evaluating the validity of the study. Regarding internal validity, we acknowledge the bias of the sampling techniques for the surveys (to reach experts from several organizations, to get a rich set of data, at least on one case tool). Threats to external validity are related to the small (n = 89) sample size.
As tool evaluations are construct and context specific, bound to time and experiences, the results are not generalizable as such.
There is no single truth to confirm, but the results provide a basis for analyzing possible problematic perceptions. To address construct validity, the survey was piloted in advance. Based on the results (e.g., variance for evaluations of some criteria), the questionnaire would need to be refined for further studies. Thus, our results may be due to confounding variables not taken into account.

CONCLUSIONS AND FUTURE WORK
Tool evaluations are construct and context specific, and bound to time and experiences. Thus, opinions on software testing tools can be diverging or conflicting. Recollection of personal experiences is error-prone, but beliefs should be given attention in research to help to provide and to disseminate verified evidence to the practitioners [10]. Trusting on beliefs or perceptions of a small group of practitioners can be inaccurate or misleading. Therefore, perceptions and beliefs of practitioners should be analyzed with caution. We find it possible to harness realistic personal insights of the subject area into crowd-based insights.
We find that collective opinion, in the context of interest, is important in pointing out the criteria of importance or with polarized opinions worth investigating in more detail. According to our findings, experience based evaluations (on using a tool) seem to be more positive than those based on pure opinion (not having used a tool), and expert respondents tend to provide consistent evaluations for some criteria. However, some specific roles (with technical seniority like specialists) are highly significant providing negative evaluations. Practitioners with different background may not have consensus about evaluations but the differences how they apply the given scale may be predicted.
Our findings suggest that more than just three expert respondents are required to gain reliable evidence for testing tool evaluations. We conclude that on average, opinions from seven experts can provide reliable evidence for moderate level of accuracy. There is a need for practical and efficient ways for conducting tool evaluations that provide reliable empirical evidence for software practitioners. Considerably more work needs to be conducted for better understanding and for establishing more definitive, tool specific evidence.