Bitcoin

AI Learns Common Sense from Touch, Not Just Vision

Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).

VI. EXPERIMENTAL RESULTS

To address the above questions, we evaluated OCTOPI using (i) accuracy on the physical understanding tasks in PHYSICLEAR’s test set, (ii) accuracy on scenario reasoning tasks, (iii) task success rate on a real robot, and (iv) property prediction accuracy on unseen objects. We tested two versions of OCTOPI, OCTOPI-7b and OCTOPI-13b, which use Vicuna7b v1.5 and Vicuna-13b v1.5 as their LLMs respectively.

TABLE VII. Results on PHYSICLEAR Scenario Reasoning Tasks. During scenario reasoning, we do not provide ground-truth property descriptions. Our experiments show that leveraging object properties significantly improves scenario reasoning for OCTOPI.TABLE VII. Results on PHYSICLEAR Scenario Reasoning Tasks. During scenario reasoning, we do not provide ground-truth property descriptions. Our experiments show that leveraging object properties significantly improves scenario reasoning for OCTOPI.

A. Tactile-grounded Physical Understanding with Object Property Descriptions

During tactile feature alignment and end-to-end fine-tuning, we trained OCTOPI with comparison tasks (i.e. PC, PSS and POM) to align its physical understanding of our physical properties and objects with our labels. We evaluated OCTOPI’s physical understanding with the same single-step prompts used during training and on 500 question-answer pairs in total across the three tasks. The results for physical understanding of unseen test objects are shown in Table VI.

Our results show that both OCTOPI-7b and OCTOPI-13b perform well on all three physical understanding tasks when they are trained to predict property descriptions. Using physical property descriptions, OCTOPI-7b achieves accuracies of 48.10% on PC, 74.67% on PSS and 44.39% on POM. OCTOPI13b outperforms OCTOPI-7b by 6.96% on PC, 9.33% on PSS and 16.04% on POM. This suggests that OCTOPI’s physical understanding improves significantly with LLM size.

Further, we explored the effect of using physical property descriptions by fine-tuning both OCTOPI-7b and OCTOPI13b on the physical understanding tasks without intermediate physical property predictions. We found that predictions based on object properties notably improve physical understanding in both OCTOPI-7b and OCTOPI-13b.

B. Scenario Reasoning

We assessed the usefulness of our physical property categories by testing how OCTOPI can reason about everyday scenarios using the physical properties. For reference, the different scenario questions are provided in Table V with the prompts shown in Table IV.

Our results are summarized in Table VII. For both OCTOPI7b and OCTOPI-13b, including the object property significantly improves performance, which supports our overall hypothesis that leveraging these properties is helpful for these tasks. Interestingly, we observed that the 7b model marginally outperformed the 13b model.

We provide two qualitative examples to show OCTOPI-13b performing commonsense physical reasoning effectively. In the first task, we provide a tactile video of a scoop of uncooked rice and first instruct it to describe the tactile video. We then follow up with an instruction to OCTOPI-13b to determine if the rice is uncooked or cooked. OCTOPI-13b is able to reason that the scoop of rice is uncooked due to its rough surface, as shown in Fig. 4.

Next, we gave OCTOPI-13b two tactile videos corresponding to two different parts of the same toothbrush – the handle and the bristles. It is instructed to describe both objects using

Fig. 4. Rice (Cooked v.s. Uncooked) Reasoning. OCTOPI-13b is prompted to reason about whether a scoop of rice is more likely to be cooked or uncooked based on a tactile video of a scoop on uncooked rice. It reasons about the rice state correctly without being trained to do so.Fig. 4. Rice (Cooked v.s. Uncooked) Reasoning. OCTOPI-13b is prompted to reason about whether a scoop of rice is more likely to be cooked or uncooked based on a tactile video of a scoop on uncooked rice. It reasons about the rice state correctly without being trained to do so.

Fig. 5. Toothbrush Part Reasoning. Given a tactile video of a toothbrush’s handle and the same toothbrush’s bristles, OCTOPI-13b is prompted to reason which tactile readings belong to the handle and which belongs to the bristles.Fig. 5. Toothbrush Part Reasoning. Given a tactile video of a toothbrush’s handle and the same toothbrush’s bristles, OCTOPI-13b is prompted to reason which tactile readings belong to the handle and which belongs to the bristles.

the physical properties. We then instruct it to determine which tactile video belongs to each object part using the physical properties. Fig. 5 shows that OCTOPI-13b is able to reason about the property-object match correctly.

C. Avocado Ripeness Classification

To evaluate OCTOPI’s usefulness as a tactile-grounded physical reasoning system for real world tasks, we integrated two GelSight sensors on a 7-DoF Franka Emika Panda robot and used it for avocado ripeness classification. While ripe avocados generally appear in a shade of brown, their ripeness is difficult to determine using vision alone. At the same time, ripe avacados are softer then unripe ones and thus, tactile sensations can improve classification.

We performed property prediction and ripeness classification evaluations using a set of 10 avocados with 20 tactile samples collected from each avocado (i.e. 200 total samples). During ripeness classification, 100 pairs of avocado samples were selected and OCTOPI was tasked to identify which avocado is riper. Each pair contains avocados in different stages of ripeness. At test time, the Franka robot grasped each avocado once to collect the tactile readings, before passing it to the model. The model gave instructions on which avocado

TABLE VIII. Avocado Property Prediction and Ripeness Classification Results. OCTOPI-13b predicts avocado properties reasonably well with only a pressing motion. For avocado ripeness classification, OCTOPI-13b is able to leverage its commonsense knowledge to use both hardness and bumpiness properties.TABLE VIII. Avocado Property Prediction and Ripeness Classification Results. OCTOPI-13b predicts avocado properties reasonably well with only a pressing motion. For avocado ripeness classification, OCTOPI-13b is able to leverage its commonsense knowledge to use both hardness and bumpiness properties.

to place in the ripe bin, and this was relayed through ROS for execution.

Table VIII shows OCTOPI-13b’s performance on both avocado property prediction and avocado ripeness classification. We split the evaluation into two parts and evaluate the avocado property prediction and ripeness classification separately.

For avocado property prediction, OCTOPI-13b achieves an accuracy of 35.50%, which is significantly higher than that of the random baseline (3.70%). Unlike the PHYSICLEAR dataset, these tactile videos are collected with only pressing and without any rotation. This suggests that OCTOPI-13b’s physical property prediction capability is robust to differences in tactile exploratory procedures.

OCTOPI-13b obtains a ripeness prediction accuracy of 63.00%. To set up this task, we first elicit OCTOPI’s commonsense reasoning for avocado ripeness using our properties by querying “You will be given tactile descriptions that consist of three physical properties: hardness, roughness, bumpiness. Hardness categories in increasing hardness: soft, moderately hard, hard. Roughness categories in increasing roughness: smooth, slightly smooth, rough. Bumpiness categories in increasing bumpiness: no bumps, small bumps, big bumps. Which of these properties help to determine avocado ripeness? Rank them.” OCTOPI-13b responds with “The properties that help determine avocado ripeness are hardness and bumpiness. Roughness is not a reliable indicator of avocado ripeness. In terms of hardness, a ripe avocado will be moderately hard, while an unripe avocado will be hard. In terms of bumpiness, a ripe avocado will have small bumps, while an unripe avocado will have no bumps or small bumps.” This exchange is provided as context to OCTOPI-13b during ripeness classification.

For comparison, we evaluated a physically-grounded visionlanguage model, PG-InstructBLIP [17], on property predictions of our avocados. PG-InstructBLIP was trained to infer a predetermined set of physical properties from visual images of real objects in the EgoObjects dataset [65]. Table VIII shows PG-InstructBLIP’s performance on property prediction for our avocados was poor. Possible reasons for this are that (i) the definitions of the physical properties may not be wellaligned with PHYSICLEAR, and/or (ii) the physical properties of avocados are not clearly apparent using only the visual modality. We could not coax the PG-InstructBLIP model to directly classify avocado ripeness despite trying various prompts; it would always pick the first object.

TABLE IX. Results on PHYSICLEAR Object Property Description Test Set. FT CLIP is the combination of the fine-tuned CLIP visual encoder and the three separate trained classification layers. OCTOPI-7b and OCTOPI-13b perform above the random baseline for object property predictions and have similar performance to the finetuned CLIP. OCTOPI-13b performs better than OCTOPI-7b on the prediction task.TABLE IX. Results on PHYSICLEAR Object Property Description Test Set. FT CLIP is the combination of the fine-tuned CLIP visual encoder and the three separate trained classification layers. OCTOPI-7b and OCTOPI-13b perform above the random baseline for object property predictions and have similar performance to the finetuned CLIP. OCTOPI-13b performs better than OCTOPI-7b on the prediction task.

TABLE X. CLIP Fine-tuning Ablation Results on Object Property Prediction. FT refers to fine-tuned. Using the CLIP fine-tuned on property prediction improves OCTOPI’s performance in property prediction.TABLE X. CLIP Fine-tuning Ablation Results on Object Property Prediction. FT refers to fine-tuned. Using the CLIP fine-tuned on property prediction improves OCTOPI’s performance in property prediction.

D. Object Property Description Prediction

The physical understanding and scenario reasoning capabilities of OCTOPI depends on its initial physical property predictions. We evaluated OCTOPI’s physical property prediction on the PHYSICLEAR test set and show the results in Table IX. Both OCTOPI-7b and OCTOPI-13b perform well above the random baseline for combined and individual property prediction and have similar performance to the fine-tuned CLIP model, indicating that OCTOPI can be used for object property prediction. OCTOPI-13b has a higher combined accuracy (i.e. all three physical properties are correctly predicted for a given object) when compared to OCTOPI-7b, suggesting there are performance gains with larger LLMs for tactile signal grounding (apart from the bumpiness property).

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button