GPT4 robot development

This article is a robot GPT4 interactive development document based on mini pupper. The development progress will be updated simultaneously. The final program will be open sourced in the github gpt4_ros and github mini pupper projects.

mini pupper GPT4 development project

Abstract

Important feature

  1. the robot ask question to make it more Interactive, it actually does feel like you are almost chatting with the real person
  2. exit mode: audio[hello loona] to exit the chatgpt mode
  3. Nothing in this video is pre scripted - the model is given a basic prompt describing Ameca, giving the robot a description of self - its pure ai

Technology alternative library

1. For Process

[Video Process] Amazon KVS:

Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs. Kinesis Video Streams enables you to playback video for live and on-demand viewing, and quickly build applications that take advantage of computer vision and video analytics through integration with Amazon Rekognition Video, and libraries for ML frameworks such as Apache MxNet, TensorFlow, and OpenCV. Kinesis Video Streams also supports WebRTC, an open-source project that enables real-time media streaming and interaction between web browsers, mobile applications, and connected devices via simple APIs. Typical uses include video chat and peer-to-peer media streaming.
Overall, amazon Kinesis Video Streams (KVS) is used for securely streaming, storing, indexing, and processing video data from connected devices at scale, enabling analytics, machine learning, playback, and the development of applications that leverage computer vision and real-time media streaming.

  1. Data Collection (at the camera/device): The process starts with the camera or other video devices capturing images.

  2. Data Upload (from the device to the cloud): The data is then securely transmitted to Amazon Kinesis Video Streams.

  3. Data Storage and Indexing (in the cloud on AWS servers): KVS automatically stores, encrypts, and indexes the video data, making it accessible through APIs.

  4. Data Processing (in the cloud on AWS servers): The video data is processed and analyzed using Amazon Rekognition Video or other machine learning frameworks such as Apache MxNet, TensorFlow, and OpenCV. In this case, the processing would include facial recognition.

  5. Result Presentation (on the user’s display device): The processed results (for example, recognized faces) can be retrieved via APIs and displayed on the user’s screen through an application.

The main components involved in this process include the video device, Amazon KVS, the machine learning/facial recognition system, and the user interface.
Amazon Kinesis Video Streams

[Command Process] code-as-policies

Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these codewriting LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions (“faster”) depending on context (i.e., behavioral commonsense). This paper presents code as policies: a robot-centric formalization of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io
code-as-policies

[ROS Command Process] ROSGPT by Anis Koubaa

ROSGPT is a pioneering approach that combines the power of ChatGPT and ROS (Robot Operating System) to redefine human-robot interaction. By leveraging large language models like ChatGPT, ROSGPT enables the conversion of unstructured human language into actionable robotic commands. This repository contains the implementation of ROSGPT, allowing developers to explore and contribute to the project.
youtube rosgpt
github rosgpt

Paper: Anis Koubaa. “ROSGPT: Next-Generation Human-Robot Interaction with ChatGPT and ROS”, to appear.

2. For voice input and output

[ASR] Amazon Lex:

Amazon Lex is a service provided by Amazon Web Services (AWS) that enables developers to build conversational interfaces for applications using voice and text. It uses advanced deep learning algorithms in natural language understanding (NLU) and automatic speech recognition (ASR) to interpret user input and convert it into actionable commands or responses.

Amazon Lex is designed to make it easy for developers to create chatbots or virtual assistants for various applications, such as customer support, task automation, or even as part of a larger AI-driven system. With Amazon Lex, developers can build and deploy conversational interfaces on platforms like mobile devices, web applications, and messaging platforms.

The service provides tools for designing conversation flows, defining intents and slots, and specifying prompts and responses. It also integrates with other AWS services, such as AWS Lambda, to facilitate processing of user input and execution of backend logic.

[ASR] Whisper openai

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. ASR technology is designed to convert spoken language into written text, and Whisper aims to provide high-quality and accurate speech recognition. It is trained on a massive amount of multilingual and multitask supervised data collected from the web.

While I don’t have specific information on “Whisper For speech recognition and synthesis,” it’s possible that you’re referring to a combination of Whisper for ASR and a separate speech synthesis system. Speech synthesis, also known as text-to-speech (TTS), is the process of converting written text into audible speech.

By combining the capabilities of both ASR and TTS systems, it becomes possible to create applications and tools that facilitate communication, improve accessibility, and enable voice-activated systems like virtual assistants, transcription services, and more.
official whisper openai
github whisper openai

[TTS] ElevenLabs

Prime Voice AI
The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling.
ElevenLabs

[local TTS/maybe online TTS] waveglow nvidia

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.
github waveglow nvidia

[TTS] Text to speech Google cloud

Google Text-to-Speech (TTS) is a speech synthesis technology developed by Google as part of its suite of machine learning and artificial intelligence services. It converts written text into natural-sounding speech, allowing applications and devices to provide auditory output instead of relying solely on visual text displays. Google Text-to-Speech is used in various applications, such as virtual assistants, navigation systems, e-book readers, and accessibility tools for people with visual impairments.

Google’s TTS system is based on advanced machine learning techniques, including deep neural networks, which enable it to generate human-like speech with appropriate intonation, rhythm, and pacing. The technology supports multiple languages and offers a range of voices to choose from, allowing developers to customize the user experience in their applications.

Developers can integrate Google Text-to-Speech into their applications using the Google Cloud Text-to-Speech API, which provides a straightforward way to access and utilize the TTS functionality in their products and services.
text to speech Google cloud

Reference

[Two Wheeled Robot] loona by Keyi Tech

1 answer any question
2 tell interactive stories
3 play a variety of game

more than 700+ emoji
state->audio/robot reaction
loop:
video/audio->AMS Lex->robot->AMS KVS ->feedback
audio[chatgpt]->robot reaction[thinking][screen display state change]
audio[hello loona]->robot reaction[shake head]
problem:
loona reagards all voice input as a question
youtube loona
kickstarter loona
facebook loona
official loona

[Two Wheeled Robot] Sarcastic robot powered by GPT-4

github selenasun1618/GPT-3PO
youtube Sarcastic robot

[Humanoid Robot] ameca human-like robot

This #ameca demo couples automated speech recognition with GPT 3 - a large language model that generates meaningful answers - the output is fed to an online TTS service which generates the voice and visemes for lip sync timing. The team at Engineered Arts ltd pose the questions.
youtube ameca
youtube Ameca expressions with GPT3 / 4

[Quadruped Robot Dog] Boston Dynamics Spot by Underfitted

file[json]->GPT
explain what the structure is and how to read json
GPT can read the json and answer"Battery level or next mission"
describe the [next mission such as location]
one is system prompt, another one is from whisper
question:[Spot, are you standing up? yes or no] ->robot reaction[shake head]
youtube Boston Dynamics

[Biped Servo Robot] EMO desktop Robot

offcial emo

[robotic arm] GPT robotic arm by vlangogh

twitter robotic arm

progress:

Amazon KVS

When using Amazon Kinesis Video Streams (KVS) to process facial recognition data, the process of data recording from the camera to finally being displayed on the user's screen is as follows:

  1. Data acquisition (on camera/device): The process begins with a camera or other video device capturing an image.

  2. Data upload (from device to cloud): The data is then sent to Amazon Kinesis Video Streams over a secure connection.

  3. Data storage and indexing (on AWS servers in the cloud): KVS automatically stores, encrypts, and indexes video data, making it accessible via API.

  4. Data processing (on AWS servers in the cloud): Use Amazon Rekognition Video or other machine learning frameworks such as Apache MxNet, TensorFlow, and OpenCV to process and analyze video data. In this example, processing might include face recognition.

  5. Results presentation (on the user's display device): Processing results (e.g., recognized faces) can be extracted via the API and displayed on the user's screen via the application.

The main components involved in this process include video equipment, Amazon KVS, machine learning/facial recognition systems, and user interfaces.

ROSGPT

  1. Utilize LLMs (specifically ChatGPT) for prompt engineering , taking advantage of its unique features such as ability induction , thought chaining , and command conditioning
  2. Develop ontologies to transform unstructured natural language commands into structured robot instructions specific to the application context.
  3. Utilize the zero-shot and few-shot learning capabilities of LLMs, Few-Shot Learning , to elicit structured robot instructions from unstructured human language input
  4. Well-known LLMs include OpenAI’s GPT-3 [6] and GPT-4 [9,10], Google’s BERT [11] and T5 [12]
  5. LLMs' superior on-the-fly learning capabilities are built on prompt engineering technology
  6. Structured command data is typically represented in the standard format JSON
  7. GPTROSProxy: Prompt engineering module uses prompt engineering methods to process unstructured text input
  8. The robot needs to move or rotate. Precisely defining movement and rotation commands requires the development of ontologies covering domain-specific concepts such as Robot Motion and Robot Rotation. To fully describe these commands, the ontology must also contain key parameters such as Distance, Linear Velocity, Direction, Angle, Angular Velocity, and Orientation . By leveraging such an ontology, natural language commands can be structured more accurately and consistently, thereby improving the performance and reliability of robotic systems
  9. The ROSParser module is a key component of the rosGPT system, responsible for processing structured data called out from unstructured commands and converting it into executable code . From a software engineering perspective, ROSParser can be viewed as a middleware that facilitates communication between high-level processing modules and low-level robot control modules. The ROSParser module is designed to interface with ROS nodes and is responsible for controlling low-level robotic hardware components, such as motor controllers or sensors, using predefined ROS programming primitives
  10. Considering the above navigation use case, an ontology can be proposed to capture the basic concepts, relationships and properties related to the spatial navigation task, as shown in Figure 3. The ontology states that this use case has three actions: move, target navigation, and rotate , where the actions have one or more parameters that can be inferred from the speech prompt. This ontology includes concepts such as position, motion and speed. This ontology can be viewed as a limited range of possible robot system actions from which human commands are expected to be inferred. Therefore, this can be converted to the following ROS 2 primitives: • move (linear velocity, distance, forward or not) • rotate (angular velocity, angle, clockwise or not) • go to goal (position) b) JSON serialization structured command Design: We now have a clear idea of ​​how to design a structured command format that is consistent with the above ontology and ROS 2 primitives. Therefore, we propose a JSON serialization command format that can be used for prompt engineering with ChatGPT.
{
    
    
"action": "go_to_goal",
"params": {
    
    
"location": {
    
    
"type": "str",
"value": "Kitchen"
}
}
}
{
    
    
"action": "move",
"params": {
    
    
"linear_speed": 0.5,
"distance": distance,
"is_forward": True
"unit": "meter"
}
}
{
    
    
"action": "rotate",
"params": {
    
    
"angular_velocity": 0.35,
"angle": 40,
"is_clockwise": is_clockwise
"unit": "degrees"
}
}
  1. Even though this specific action, "take a photo", is not defined in the learning sample, ChatGPT still generates this action. This result highlights the potential limitations of this model in interpreting and generating contextually accurate commands without the guidance of a structural framework like an ontology. Without ontological keywords, ChatGPT generates actions based on its understanding, which may not always be consistent with the required actions defined in the learning samples . Therefore, this may lead to situations where the output does not comply with the specific constraints and requirements in the learning sample.
  2. The application of an ontology effectively constrains the model's output, ensuring that it is consistent with the desired schema and adheres to the requirements of the specific context. This experimental study illustrates the important role of ontologies in guiding large-scale language models such as ChatGPT to generate contextually relevant and accurate structured robot commands. By incorporating ontologies and other structured frameworks during training and fine-tuning , we can significantly improve the model 's ability to generate outputs that fit the application scenario and comply with predefined constraints and requirements.
  3. This package consists of ROS 2 Python code for the ROSGPT REST server and its corresponding ROS2 node, a web application that utilizes the Web Speech API to convert human speech into text commands. Web API communicates with ROSGPT through its REST API to submit text commands. ROSGPT uses the ChatGPT API to convert human text into JSON serialized commands that the robot uses to move or navigate. We are committed to making the implementation of ROSGPT an open platform for further developments in human-computer interaction
  4. An ontology is a formal representation of what is known in a specific domain, allowing for a more consistent and explicit understanding of the relationships between entities and concepts. By leveraging this structured knowledge, ROSGPT systems can more efficiently process natural language commands and translate them into precise, executable robot actions.

Guess you like

Origin blog.csdn.net/m0_56661101/article/details/130596913