This project, which I would like to be called an experiment, tries to give a Large Language Model (LLM) a body and a sense of its environment, and then observe what happens when it is put in a continuous loop of sensing, processing, and acting.
If we create a system of computers, sensors, and actuators, where the LLM is integrated as the decision-making layer, we can observe how it interacts with its environment and whether it exhibits intelligent behavior.
To allow the system (likely a robot😅) to both speak and act (i.e., move the actuators), we can use JSON formatting.
Example LLM output format:
{"ACT":"<ACT>","SAY":"<text or null>"}
Where the ACT field contains a predefined action command for the robot, and the SAY field contains the text to be spoken or "null" if no speech is needed.
Note that the LLM is free to select any ACT from a predefined set of valid actions or it may also decide to do nothing (and maybe eat a Five Star⭐😁).
So here the project/experimen/system gets its two divisions, the software and the hardware.
The hardware consisting the computer and microcontrollers running STT, TTS and the programs for the sensors and actuators, while the Software running the LLM API, doing memory and context management and controlling the robot states as a FSM.
The software can be divided into the following componets/modules:
Outputs are constrained into two channels:
The system operates as a non-blocking infinite loop.
while True:
audio = record_chunk()
text = transcribe(audio)
if not text:
continue
response = llm_process(text)
act = response.get("ACT", "IDLE")
say = response.get("SAY", "null")
handle_act(act)
if say and say.lower() != "null":
speak(say)
This loop exposes the LLM to continuous environmental interaction, timing variability, and non-deterministic inputs.
Fixed thresholds fail in real-world environments due to changing noise levels.
The system continuously estimates ambient noise using exponential smoothing and detects speech only when signal energy exceeds a dynamic threshold.
ALPHA = 0.1
MARGIN = 1.2
speech_threshold = ambient_energy * MARGIN
This reduces false triggers and improves robustness in noisy conditions.
Only valid, high-confidence speech is processed, preventing unintended actions.
The LLM operates in a strictly sandboxed environment.
ALLOWED_ACTS = {
"MOVE_FORWARD",
"MOVE_BACKWARD",
"TURN_LEFT",
"TURN_RIGHT",
"STOP_ALL",
"WAVE_HAND",
"IDLE",
"NULL"
}
The LLM temperature is adjusted based on input type.
if command:
return 0.25
if creative:
return 0.7
This reduces hallucinations during control tasks.
Failures in one component do not crash the entire system.
No long-term learning or adaptation was observed.
System behaviour depends on prompt design and inference latency.
No long-term learning or adaptation was observed.
System behaviour depends on prompt design and inference latency.
This work demonstrates that a Large Language Model can be integrated into a real-time robotic system only when surrounded by strict deterministic controls.
The model does not learn or understand the environment. Its role is limited to probabilistic interpretation within defined boundaries.