AmViSH 2.0 — Giving a LLM a body

This project, which I would like to be called an experiment, tries to give a Large Language Model (LLM) a body and a sense of its environment, and then observe what happens when it is put in a continuous loop of sensing, processing, and acting.

The Observation

When we think deeply about how we humans are actually "living", we get to an idea very close to the following:
  • We sense the world around us.
  • We process the environment.
  • We act upon the output and also learn the events and consequences.
  • We repeat this loop continuously, each time thinking/processing with the new version of our opinions and beliefs.

A LLM is trained on a large corpus of text data, which also has the information of our day to day life, and even knows how we live. (as the TBs of the training data contains much of our lives and the world than we know.)

But actually making the LLM do control the actuators all by itself and also talk to the user is a very different thing, and that is what this experiment is about, as the LLM can only process and output text.

It is also about making the LLM feel itself as a part of the environment and not just a text processing machine, and then seeing how it behaves in such a scenario.

The Hypothesis

If we create a system of computers, sensors, and actuators, where the LLM is integrated as the decision-making layer, we can observe how it interacts with its environment and whether it exhibits intelligent behavior.

To allow the system (likely a robot😅) to both speak and act (i.e., move the actuators), we can use JSON formatting.


  Example LLM output format:
{"ACT":"<ACT>","SAY":"<text or null>"}

Where the ACT field contains a predefined action command for the robot, and the SAY field contains the text to be spoken or "null" if no speech is needed.

Note that the LLM is free to select any ACT from a predefined set of valid actions or it may also decide to do nothing (and maybe eat a Five Star⭐😁).

So here the project/experimen/system gets its two divisions, the software and the hardware.
The hardware consisting the computer and microcontrollers running STT, TTS and the programs for the sensors and actuators, while the Software running the LLM API, doing memory and context management and controlling the robot states as a FSM.

Software Components

The software can be divided into the following componets/modules:

  • STT - Speech to Text — to convert the user's words in text inputable to a LLM
  • Sensor reading manager — to allow the LLM to access the sensor readings on its intent and also automatically, depending upon the role of the sensor
  • Brain LLM — LLM-based interpretation and the output in the JSON format
  • SAY and ACT separator — to separate the verbal responses from the physical actions and route the ACTs to the microcontrollers and the SAYs to the TTS

Hardware Components

Outputs are constrained into two channels:

  • STT - Speech to Text — to convert the user's words in text inputable to a LLM
  • Sensor reading manager — to allow the LLM to access the sensor readings on its intent and also automatically, depending upon the role of the sensor
  • Brain LLM — LLM-based interpretation and the output in the JSON format
  • SAY and ACT separator — to separate the verbal responses from the physical actions and route the ACTs to the microcontrollers and the SAYs to the TTS
  • ACT: physical actions
  • SAY: verbal responses

Continuous Loop Design

The system operates as a non-blocking infinite loop.


while True:

  audio = record_chunk()
  text = transcribe(audio)

  if not text:
    continue

  response = llm_process(text)

  act = response.get("ACT", "IDLE")
  say = response.get("SAY", "null")

  handle_act(act)

  if say and say.lower() != "null":
    speak(say)

This loop exposes the LLM to continuous environmental interaction, timing variability, and non-deterministic inputs.

Adaptive Audio Thresholding

Fixed thresholds fail in real-world environments due to changing noise levels.

The system continuously estimates ambient noise using exponential smoothing and detects speech only when signal energy exceeds a dynamic threshold.


ALPHA = 0.1
MARGIN = 1.2

speech_threshold = ambient_energy * MARGIN

This reduces false triggers and improves robustness in noisy conditions.

Speech Recognition Constraints

  • Duration limits applied
  • Language filtering enforced
  • Low-confidence outputs rejected

Only valid, high-confidence speech is processed, preventing unintended actions.

LLM Control Constraints

The LLM operates in a strictly sandboxed environment.

  • Only predefined actions are allowed
  • Invalid outputs are replaced with safe states
  • All responses are parsed and validated

ALLOWED_ACTS = {
  "MOVE_FORWARD",
  "MOVE_BACKWARD",
  "TURN_LEFT",
  "TURN_RIGHT",
  "STOP_ALL",
  "WAVE_HAND",
  "IDLE",
  "NULL"
}
The system does not execute intelligence blindly. It validates every decision.

Dynamic Temperature Control

The LLM temperature is adjusted based on input type.

  • Low temperature → control commands (more deterministic)
  • High temperature → conversation (more flexible)

if command:
  return 0.25

if creative:
  return 0.7

This reduces hallucinations during control tasks.

Fault Tolerance

  • Dynamic network discovery (UDP broadcast)
  • Subsystem isolation
  • Non-critical errors ignored

Failures in one component do not crash the entire system.

Observed Behaviour

  • Stable operation in noisy environments
  • Recovery from network interruptions
  • Consistent identity handling

No long-term learning or adaptation was observed.

System behaviour depends on prompt design and inference latency.

Observed Behaviour

  • Stable operation in noisy environments
  • Recovery from network interruptions
  • Consistent identity handling

No long-term learning or adaptation was observed.

System behaviour depends on prompt design and inference latency.

Limitations

  • No raw sensor understanding by LLM
  • Dependent on text abstraction
  • Sensitive to prompt design
  • External inference latency

Conclusion

This work demonstrates that a Large Language Model can be integrated into a real-time robotic system only when surrounded by strict deterministic controls.

The model does not learn or understand the environment. Its role is limited to probabilistic interpretation within defined boundaries.

The contribution is not intelligence. The contribution is control.