How Pair Programming in Code Cities Unlocked Hidden Software Insights

Table of Links
Abstract and I. Introduction
II. Approach
A. Architectural Design
B. Proof of Concept Implementation
III. Envisioned Usage Scenarios
IV. Experiment Design and Demographics
A. Participants
B. Target System and Task
C. Procedure
V. Results and Discussion
VI. Related Work
VII. Conclusions and Future Work, Acknowledgment, and References
V. RESULTS & DISCUSSION
Our mid-questionnaires and post-questionnaire contained statements for which participants had to indicate their level of (dis)agreement on a 5-point Likert scale. The questionnaires also included free reply fields to leave a comment on any experiment-related matter. Additionally, the instructor made a note of observations such as rational usages of specific features as well as noticeable emotions [27] and mentions of the participants. In the following, we present and use the results of our conducted user study to revisit our posed research questions. Furthermore, we discuss the threats to validity of our evaluation. Although we use the term SV in this paper, we do not want it to be understood as a generalization of our results. We again emphasize that the results and their interpretation are restricted to our particular prototype using collaborative code cities and our experiment. Therefore, the findings should be seen as first insights and indicators for refinements rather than statistically grounded results.
Task evaluation
We measured an overall task correctness of 90 %. The related time spent solving the tasks is depicted in Figure 5. The average time spent on T1 is for both the mean and median 19 minutes. The fasted participant correctly solved T1 in seven minutes. This person was already familiar with ExplorViz. For T2, we see 29 minutes for the mean and 24 minutes for the median. Both tasks were without time limit, hence the outlier group for T2. Figure 5 also depicts the participants’ perceived task difficulty. T1 and T2 were found to be difficult by four participants, with T1 also found to be very difficult by one person. Due to the overall distribution, we conclude that the tasks were neither too easy nor too difficult.
RQ1*: How do subjects use the embedded SV and code editor during task solving?*
To the best of our knowledge, this work presents a novel approach that combines code editors with remote pair programming techniques and embedded, collaborative code cities. Therefore, we first intend to understand how the participants in our study use the approach with free choice of the tool, i.e., embedded SV and code editor, as well as with tasks referenced to SC1. In that context, Figure 6 depicts the time spent using each tool per task. For measurement, a VS code event was used to capture the time at which participants clicked on the code editor or the ExplorViz extension, therefore switched their focused context. We would like to mention that it was technically (due to VS Code’s limitations for extensions) only possible to measure the time spent between context switches. Thus, if a participant did not change the context but, for example, only used the SV, then our measurements indicate a time spent of one minute for the SV. This is the case for the fastest participant for T1 mentioned above, who actively interacted only with the SV during this task (as confirmed by the video recording). The average time spent using the SV for T1 is seven minutes and nine minutes for VS Code (both mean and median). During this task, participants comprehended the source code for the first time and probably spent more time reading it. It is therefore surprising that the time difference for the first task is already quite small. The reason for this is that code cities can facilitate the understanding of structures and are therefore suitable for the task of obtaining an overview [34], [35]. This was also explicitly mentioned by three participants in the free text fields. For T2, the average time spent using the SV is fifteen minutes and eight minutes for VS Code. The (almost) double amount of time spent using the SV results from the two outliers. For this task, however, the median for time spent using the SV is thirteen minutes and eight minutes for VS code. We suppose that this comes from the shared software cities and the ability to highlight objects in question. The instructor’s notes mention the frequent use of shared popups within two groups. The video recordings confirm that these groups often use the popups as a basis for discussion. Also, participants often use the ping feature of our tool to highlight certain details for their collaborator. Therefore, they spent more time
using the SV. However, collaboration is not the only reason for that. T2 explicitly requires to understand and extend a program flow. The SV provides a visual overview of the software system’s structure and in our case also of a runtime behavior snapshot (see Section IV-B). As a result, it is far easier and obvious to use this available visualization and for example trace imaginary method calls with the mouse cursor (especially, when combined with collaborative features).
Figure 6 also presents the number of context switches for each task. We observe that for T1 the number of switches between SV and code editor is much more distributed among the participants than for T2. Again, the reason for that is presumably the collaboration in T2. Most of the time, the participants work together and therefore change their tool when initiated by the other collaborator. For both T1 and T2, the median of context switches is around forty, indicating that the amount of context switches is independent on our tasks and collaboration.
Since our approach incorporates the runtime behavior of the target system, we also intended to know how participants perceived the usefulness of the two tools to comprehend the posed program flow of T1. In this context, Figure 7 shows that the SV was perceived as more useful than the code editor. One participant mentioned that the communication lines are one of the most beneficial properties of the SV. In ExplorViz, the communication lines incorporate runtime information such as the method call’s frequency in the visualized snapshot. These information are important to comprehend runtime behavior. Additionally, the SV already maps the runtime information that the users would otherwise have to find and understand on their own.
RQ2*: Is the code editor perceived as more useful than the embedded SV?*
Traditionally, understanding a software system’s behavior is primarily achieved by comprehending the source code [1]. For this experiment, the results related to RQ1 show that our approach was, for example, used by the participants to gain an overview of the target system. This is a common and suitable use case for SV, as shown in the past [34]. However, professional developers question the need for SV [15], [36]. In our opinion, one of the reasons for that is the lack of properties such as code proximity [5], [6] and the SV tool’s setup [4]. In that context, we now examine how participants rate the usefulness of our approach.
Figure 7 depicts the results of the mid-questionnaires regarding the perceived usefulness of the tools for a task. For T1, overall 71 % agree with the posed statement ‘SV helped with the task’. The usefulness of the code editor was slightly (one person difference) more agreed to. However, for the SV the number of participants who neither agree nor disagree is higher and those who disagree is lower. Regarding T2, we see that overall 86 % agree with the posed statement ‘SV helped with the task’. In comparison, the code editor’s usefulness was slightly (one person difference) less agreed to.
RQ3*: Do subjects recognize the usefulness of collaborative SV features for specific tasks?*
With RQ3, we expand the results of our previous work [16] regarding the perceived usefulness of collaborative code cities. In this context, we asked all participants to state their level of agreement with two statements posed.
Figure 8 presents the related results. We see that 43 % of the participants agree or strongly agree with the statement ‘Collaborative SV features helped with the task’, respectively. The one person that disagrees with the statement mentioned that the collaborative SV features did not help in his case, since there was barely any input from the other participant. However, he agrees that the communication would be a big help in pair programming supported development in the real world. Presumably due to the low contribution of his collaborator, the same person also disagrees with the second statement that is concerned about voice communication. Due to low input from his collaborator, the same person also disagrees with the second statement, which refers to the perceived usefulness of voice communication. Nevertheless, all of the remaining thirteen participants strongly agree that voice communication was helpful in the task. This is consistent with our previous findings indicating that voice communication is one of the most useful collaborative tools in SV [16] .
RQ4*: What is the general perception of the usefulness and usability of the approach?*
The post-questionnaire was designed to capture participants’ overall perceptions of the approach’s usefulness and usability. By answering RQ1, we have seen that the participants indeed use the SV as supplement during the comprehension task. For RQ2, we concluded that participants perceived code editor and SV to be about equally useful in the context of a real-world task. Finally, Figure 9 shows:
Collaboration is obviously dependent on many factors, e.g., mutual perception of collaborators or motivation. In our context, we have seen this for RQ3 or in previously published results [16]. The participants rate the collaborative SV features slightly different when to be evaluated independently of a task. Figure 9 shows a shift in the distribution of approval ratings. The one person who previously disagreed with the usefulness of the collaborative features now neither agrees nor disagrees. That fits his previous mentions. Compared to perceived usefulness for T2, overall perceived usefulness of collaborative SV features shows less strong agreement. As a matter of fact, we could not find a reason why two participants downgraded their level of agreement to ‘agree’. However, the overall approval rate remains the same.
Although this evaluation is overall more concerned about the perceived usefulness of embedded SV, identified usability problems can help to identify desirable refinements. In this context, Figure 9 also presents the participant’s perceived usability of our approach. The results show that 86 % of the participants find the used combination of embedded SV and code editor usable. There are some desirable improvements that are mentioned via text response, e.g., better performance. However, the biggest usability problem was the unintended minimization of the embedded SV. The reason for that is that VS code opens files that have been clicked in the package explorer in the currently focused editor group. This behavior can be disabled by locking an editor group. However, at the current time of writing, the lock mechanism cannot be triggered from within a VS Code extension. Figure 9 also shows that another 86 % would use this approach for private purposes such as collaborative program comprehension with fellow students.
RQ5*: Is the approach perceived as useful in the envisioned usage scenarios?*
Our pilot study found that a single experiment run would take about an hour to complete. In order not to discourage potential participants due to the time to be spent, we decided to ignore the other usage scenarios and only use tasks in the experiment based on SC1. Nevertheless, the post-questionnaire was also used to capture participants’ perceived usefulness in applying the approach in the remaining, envisioned scenarios. In this case, they were described in text and subjects were asked to state their agreement on a 5-point Likert scale. Figure 10 depicts the related results. The complete scenario descriptions are available in the supplementary package of this paper [30], but essentially summarize the envisioned usage scenarios in Section III. The participants rated SC1 with the highest overall agreement and strong agreement, respectively. The experiment’s tasks and their introduction originate from SC1. SC2 has the highest amount of neutrality and disagreement. One person that answered with neither agreement nor disagreement mentioned that code changes are usually reviewed before deploying them. Since our approach only shows runtime behavior, he is not sure how changes will be visualized for the code review. This detail was in fact omitted in the textual description of SC2. We believe that this uncertainty is the reason for the highest amount of neutrality and disagreement for SC2. However, the majority consensus was positive for all scenarios.
A. Threats to Validity
Remote pair programming solution: As mentioned in Section II-B, we decided to implement our own remote pair programming approach, so that the reproducibility of our evaluation is not dependent on the availability of external services. However, this custom implementation lacks useful features compared to full-fledged solutions for remote pair programming. For example, one participant mentioned that he was unable to draw the attention of the collaborator to a specific code part. Although our study did not aim to evaluate whether one tool is better than the other, this custom implementation may have influenced the perceived usefulness or usability of the SV or code editor. In contrast, Figure 7 shows that the participants find the SV to be more suitable to understand dynamic program flows. With that being said, we conclude that more empirical research is required in this context.
Experiment duration: The average time spent on the user study was about one hour, both median and mean. It follows that the attention span of the participants and thus the results might have been influenced. To mitigate this, we told participants during the introduction that breaks could be taken at any time and participation could be aborted. Moreover, T2 was solved collaboratively and therefore presumably relieved the experimental situation.
Target system: The prepared target system contains 26 application logic-related Java files that are distributed among four Maven subprojects. As a result, the small project size may have influenced the perceived usability of the SV, as also mentioned by one participant. We agree, but also emphasize that we did not intend to evaluate usability based on the scalability of the visualization, but on the overall concept. Overall, this evaluation is more concerned about the perceived usefulness of SV incorporating distributed tracing for the onboarding process. In addition, we argue that a real-world application of the onboarding scenario with SV should guide new developers through the software system’s behavior with increasingly large portions of the code base.
Participants: The use of students in experiments is a valid simplification that is often said to possibly compromise external validity [31]. In our case, the participants’ experiences might have influenced their perception regarding the usefulness of the SV as well as their time spent using the SV. In this context, professional developers can benefit from their experience, e.g. with the Spring framework, and can understand the source code faster. As a result, we will repeat the experiment with professional developers.
Authors:
(1) Alexander Krause-Glau, Software Engineering Group, Kiel University, Kiel, Germany ([email protected]);
(2) Wilhelm Hasselbring, Software Engineering Group, Kiel University, Kiel, Germany ([email protected]).