System architecture

From perception to robotic execution.

The system links YOLO11 perception, RGB-D localization, coordinate transformation, and robot control into one sorting pipeline.

Sensor layer

RealSense D435i input*

RGB and depth streams define the base observation layer for bottle and cup sorting in real time.

Intel RealSense D435 depth camera on a tripod.

* External reference image. Source: Marc Auledas, CC BY-SA 4.0.

Perception layer

Improved YOLO11 detection*

Model inference produces bottle and cup labels plus 2D boxes before the system estimates physical target position.

Object detection example with labeled bounding boxes.

* External reference image. Source: Hughesperreault, CC BY-SA 4.0.

Calibration layer

Coordinate transformation*

Depth-aware visual data is translated into robot coordinates so the manipulator can act on detections physically.

Chessboard calibration setup illustration.

* External reference image. Source: Ibai Gorordo, CC BY-SA 4.0.

Action layer

Seek-and-follow + grasp execution*

The arm aligns above the target, adapts to height changes, and then executes categorized grasp-and-place behavior.

Service robot grasping an object on a table.

* External reference image. Source: Paul Beaudry, CC BY-SA 2.0.

System visuals

Calibration and perception evidence.

These visuals show the calibration setup and live perception outputs used during robot operation.

Calibration

Using the calibration board

This calibration procedure establishes the relationship between camera observations and robot coordinates.

Team member using the calibration board with the robot and camera for coordinate alignment.
Setup

Calibration board installation

This setup image documents the physical preparation needed before coordinate transformation and end-to-end motion could be trusted.

Team members positioning the calibration board in the robot workspace.
Control evidence

Seek-and-follow arm motion

This clip isolates the alignment stage, showing how the arm maintains a controlled relationship to the target before grasp execution.

Perception evidence

Live detection and localization

This detection clip shows the perception output directly, including object labels and the targeting information used by the rest of the pipeline.

Data flow

The core technical chain.

1

RGB-D stream

Sensor frames arrive from the D435i and establish the live scene.

Input: color + depth
2

YOLO11 detection

Bottle and cup targets are classified with live bounding boxes.

Output: class + bbox
3

Depth query

Depth is sampled around the target to estimate camera-space XYZ.

Output: camera XYZ
4

Coordinate transform

Visual coordinates are converted into actionable robot-space targets.

Output: robot target
5

Seek and follow

The arm tracks the object and maintains a proper approach distance.

Action: align + approach
6

Grasp and place

The dexterous hand grips with tuned force and releases into the correct bin.

Feedback: logs + operator view
Interface boundary

Operator visibility and runtime feedback

The web layer helps operators and judges understand what the robot is seeing, tracking, and doing during the run.

Validation boundary

Subsystem validation before full runs

Each subsystem was validated separately before end-to-end runs, including detection, localization, calibration, and manipulator motion.

Expansion boundary

Prototype first, deployment later

Current results focus on prototype validation, while broader deployment, additional categories, and longer autonomous runs remain future work.