Go to the 1:21 mark to see what was the eureka moment for me. Note that "thirsty" does not show up anywhere in my code, just actions like "pick" and "place" and the word "bottle" comes from vision.
The robot stack contains solely basic motion functions. The LLM performs the "compute" on the higher level, connecting user intent, robotic affordances, and environmental context. The reasoning where robotics has always hit a ceiling in the past.
Imagine the possibilities - let's use it for good.