Crypto Trends

User Study Evaluates Eyeglass Reflection Risks in Webcam-Based Attacks

Abstract and I. Introduction

II. Threat Model & Background

III. Webcam Peeking through Glasses

IV. Reflection Recognizability & Factors

V. Cyberspace Textual Target Susceptibility

VI. Website Recognition

VII. Discussion

VIII. Related Work

IX. Conclusion, Acknowledgment, and References

APPENDIX A: Equipment Information

APPENDIX B: Viewing Angle Model

APPENDIX C: Video Conferencing Platform Behaviors

APPENDIX D: Distortion Analysis

APPENDIX E: Web Textual Targets

V. CYBERSPACE TEXTUAL TARGET SUSCEPTIBILITY

The evaluations so far are based on the text’s physical size and carried out in controlled environments to better characterize user-independent components of the reflection model as well as the range of theoretical limits for webcam peeking. In this section, we start by mapping the limits to common cyberspace objects in order to understand the potential susceptible targets. We then conduct a 20-participant user study with both local and Zoom recordings to investigate the feasibility and challenges of peeking these targets and various factors’ impact.

A. Mapping Theoretical Limits to Targets

We use web texts as an enlightening example of cyberspace textual targets considering their wide use and the relatively mature conventions of HTML and CSS. The discussion is based upon (1) a previous report [48] scraping the most popular 1000 websites on Alex web ranking [8], and (2) a manual inspection of 117 big-font websites archived on SiteInspire [10]. We further divide the inspected web texts into 3 groups (G1, G2, G3, see Appendix E and Table III) in order to discuss separately how the webcam peeking attack with current and future cameras could have effects on them. As pointed out in Section III-B, the conversion between digital point size and physical cap height is dependent on specific user settings such as browser zoom ratio. The cap height values in Table III are thus measured with the Acer laptop with default OS and browser settings as a case study.

Based on the results in Figure 5, we hypothesize that the smallest cap heights adversaries can peek using mainstream 720p cameras is 7-10 mm. We then calculate the corresponding limits with 1080p and 4K cameras with Equation 3 and show them in the Theoretical column of Table III. Considering participants are most likely to use 720p cameras, we then choose point sizes S1-S6 in Table III for evaluations.

B. User Study

The user study (Section VII-D) is designed in the following challenge-response way: An author generates HTML files each with one randomly selected headline sentence containing 7-9 words [4] from the widely-used “A Million News Headlines” dataset [46]. Only each word’s first letter is capitalized. The participants display the HTML page in their browsers when they are recorded, and another author acting as the adversary tries to recognize the words from the videos containing the 20 participants’ reflections without knowing the HTML contents by using the same techniques as in Section IV. We then calculate the percentage of correctly recognized words.

Data Collection. Each participant was given 6 HTML files of increasing point sizes from S1 to S6 as shown in Table III. Note that the 6 sizes are specified in point size in HTML so that user-dependent factors such as screen size and browser zoom ratio can be studied (Equation 1). The participants display each HTML file on their own computer display in their accustomed rooms and behave normally as in video conferences. We allow participants to choose their preferred environmental lighting condition except asking them

Fig. 8. The recognition results of textual reflections collected with local and Zoom-based remote video recordings from 20 user study participants. Participants 4, 14, and 3, 6, 10, 11 did not generate glass reflections that allow successful recognition due to problems of out-of-range viewing angles and very low light SNR respectively and are thus omitted from the figure.Fig. 8. The recognition results of textual reflections collected with local and Zoom-based remote video recordings from 20 user study participants. Participants 4, 14, and 3, 6, 10, 11 did not generate glass reflections that allow successful recognition due to problems of out-of-range viewing angles and very low light SNR respectively and are thus omitted from the figure.

Fig. 9. (a) The degree of influence of different factors on the reflection recognition performance evaluated by the correlation scores. Factors highlighted with boxes are computed with other raw factors according to our model. (b-d) The joint distribution of three factors and the recognition results.Fig. 9. (a) The degree of influence of different factors on the reflection recognition performance evaluated by the correlation scores. Factors highlighted with boxes are computed with other raw factors according to our model. (b-d) The joint distribution of three factors and the recognition results.

to avoid other close light sources besides the screen in front of their face. The reason is that we found a close frontal light source can seriously decrease light SNR, which can potentially be used as a physical mitigation against this attack but prevents us from examining the impact of all the other factors. We did not tell the participants to stay stationary and let them behave normally as in browsing screen contents. Their webcams record their image for 30 seconds for each HTML.

Network bandwidth and resulted video quality are artifacts of video conferencing platforms that improve in a rapid way [4] compared to other user-dependent physical factors. To study the present-day and possible future impact of video conferencing platforms, we record the 20 participants’ videos both locally and remotely through Zoom. Our experiments focused on Zoom since it is the most used platform and also provides the most detailed video and network statistics.

General Adversary Recognition Results. The recognition results achieved by the adversary with local and remote recordings are shown in Figure 8 (upper and lower respectively). Two participants (4 and 14) did not generate glass reflections of their screens in the video recordings due to the problem of outof-range vertical viewing angles as predicted in Section III-B. Four participants (3, 6, 10, 11) yield 0% textual recognition accuracy due to a very low light SNR.

With local video recordings, the percentage out of the 20 participants that are subjected to non-zero recognition accuracy against S6-S1 are 70%, 60%, 30%, 25%, 15%, and 0% respectively. Videos of participants 7 and 17 using 720p cameras allowed the adversary to achieve 12.5% and 25% accuracies on recognizing S2. Videos of participant 16 using a 480p camera allowed the adversary to achieve an 37.5% accuracy on recognizing S3. These results translate to the predicted susceptible targets with cameras of different resolutions as listed in the User column of Table III, where 720p webcams pose threats to large-font webs (3 ) and future 4K cameras pose threats to various header texts on popular websites (1 and 2 ). As expected, this result is worse than the theoretical limits in the table that are derived with prescription glass data in the controlled lab setting (Section IV). Our observations suggest the main reasons include: (1) The environmental lighting conditions of the users are more diverse and less advantageous to screen peeking than the lab setup, generating reflections with worse light SNR. (2) Texts in the user study are mostly lower-case and have thus smaller physical sizes than the upper-case letters used in Section IV. (3) The prescription glasses used in Section IV have a larger focal length than the average user’s glasses. (4) More intentional movements exist in the user study leading to more motion blur.

With Zoom-based remote recordings, the percentage of participants with non-zero recognition accuracy against S6-S1 degraded to 65%, 55%, 30%, 25%, 5%, and 0% respectively. We logged the video network bandwidth and resolution reported by Zoom as shown in Figure 8. The correlation between Zoom bandwidth, resolution, and their impact on video quality agrees with the observations in Section IV-C. Generally, bandwidths smaller than 1500 kbps led to 360p resolutions for most of the time and decreased the recognizable text size by 1 level. Zoom’s 720p videos also caused degradation in recognition accuracy but mostly kept the recognizable text size to the same level as the local recordings, suggesting the same predictions of susceptible text sizes and corresponding cyberspace targets.

Besides the mostly used platform Zoom, we also acquired remote recordings of participant 19 with Skype and Google Meet. The adversary achieved better results with Skype than Zoom by recognizing S3 and S2 with 89% and 25% accuracies respectively, which is likely due to Skype’s capability of maintaining better-quality video streams with a 1200 kbps bandwidth. The web-based Google Meet platform provided the lowest quality videos and only allowed the adversary to achieve 22% accuracy on recognizing S4.

Underlying Reasons. To find out the dominant reasons enabling easier webcam peeking by analyzing the correlation between the recognition results and different factors, we turn each participant’s results (6 sizes) into a single attack score that is a rectified weighted sum of the recognition accuracy of the six text sizes tested. Figure 9 (a) shows correlation scores with 11 factors that affect reflection pixel size (left) and light SNR (right) respectively when w = 1.5. The glass type includes prescription (15/20) and prescription with BLB coatings (5/20). The physical text size and reflection-environment light ratio highlighted in the boxes are two composite factors. In short, the physical text size represents the ratio between the actual physical size of texts displayed on each participant’s screen and the case study values in Table III and is calculated with Equation 1 with other raw factors such as browser zoom ratios. The reflection-environment light ratio represents how strong the screen brightness is compared to the environmental light intensity and is calculated by dividing glass luminance by environmental luminance. Basically, these two composite factors represent our model’s prediction of reflection pixel size and light SNR and are found to generate higher correlation scores than the other raw factors, which validates the effectiveness of our models. Figure 9 (b-d) further show the joint distribution of the attack score and three representative factors. It can be seen from (b) that the 40 mm screen-glass distance used in the evaluation of Section IV is about the average of the participants’ values, and distances of these participants actually only have a very weak correlation with

Fig. 10. Accuracy of recognizing Alexa top 100 websites from eyeglass reflections. Each participant browsed 25 websites. Participant 0 and 4 did not yield recognizable reflections due to bad light SNR and viewing angles.Fig. 10. Accuracy of recognizing Alexa top 100 websites from eyeglass reflections. Each participant browsed 25 websites. Participant 0 and 4 did not yield recognizable reflections due to bad light SNR and viewing angles.

he easiness of webcam peeking attack. Figure 9 (d) suggests that when the screen brightness-environmental light intensity ratio gets lower than a certain threshold, the likelihood of preventing adversaries from peeking is very high, which may be considered as a temporary mitigation.

Authors:

(1) Yan Long, Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA ([email protected]);

(2) Chen Yan, College of Electrical Engineering, Zhejiang University, Hangzhou, China ([email protected]);

(3) Shilin Xiao, College of Electrical Engineering, Zhejiang University, Hangzhou, China ([email protected]);

(4) Shivan Prasad, Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA ([email protected]);

(5) Wenyuan Xu, College of Electrical Engineering, Zhejiang University, Hangzhou, China ([email protected]);

(6) Kevin Fu, Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA ([email protected]).


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

[4]Uniform lengths (e.g., all 8 words) are avoided to prevent the adversary from guessing the words by knowing how long the sentences are.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button