System architecture

From perception to robotic execution.

The system links YOLO11 perception, RGB-D localization, coordinate transformation, and robot control into one sorting pipeline.

Sensor layer

RealSense D435i input*

RGB and depth streams define the base observation layer for bottle and cup sorting in real time.

Intel RealSense D435 depth camera on a tripod.

* External reference image. Source: Marc Auledas, CC BY-SA 4.0.

Perception layer

Improved YOLO11 detection*

Model inference produces bottle and cup labels plus 2D boxes before the system estimates physical target position.

Object detection example with labeled bounding boxes.

* External reference image. Source: Hughesperreault, CC BY-SA 4.0.

Calibration layer

Coordinate transformation*

Depth-aware visual data is translated into robot coordinates so the manipulator can act on detections physically.

Chessboard calibration setup illustration.

* External reference image. Source: Ibai Gorordo, CC BY-SA 4.0.

Action layer

Seek-and-follow + grasp execution*

The arm aligns above the target, adapts to height changes, and then executes categorized grasp-and-place behavior.

Service robot grasping an object on a table.

* External reference image. Source: Paul Beaudry, CC BY-SA 2.0.

System visuals

Calibration and perception evidence.

These visuals show the calibration setup and live perception outputs used during robot operation.

Calibration

Using the calibration board

This calibration procedure establishes the relationship between camera observations and robot coordinates.

Setup

Robot staged for operation

This system view documents the working robot setup, including the sensing and manipulation arrangement used before live sorting runs.

The robotic arm positioned in its working area before a sorting run.

Control evidence

Seek-and-follow arm motion

This clip isolates the alignment stage, showing how the arm maintains a controlled relationship to the target before grasp execution.

Perception evidence

Live detection and localization

This detection clip shows the perception output directly, including object labels and the targeting information used by the rest of the pipeline.

Data flow

The core technical chain.

RGB-D stream

Sensor frames arrive from the D435i and establish the live scene.

Input: color + depth

YOLO11 detection

Bottle and cup targets are classified with live bounding boxes.

Output: class + bbox

Depth query

Depth is sampled around the target to estimate camera-space XYZ.

Output: camera XYZ

Coordinate transform

Visual coordinates are converted into actionable robot-space targets.

Output: robot target

Seek and follow

The arm tracks the object and maintains a proper approach distance.

Action: align + approach

Grasp and place

The dexterous hand grips with tuned force and releases into the correct bin.

Feedback: logs + operator view

Interface boundary

Operator visibility and runtime feedback

The web layer helps operators and judges understand what the robot is seeing, tracking, and doing during the run.

Validation boundary

Subsystem validation before full runs

Each subsystem was validated separately before end-to-end runs, including detection, localization, calibration, and manipulator motion.

Expansion boundary

Prototype first, deployment later

Current results focus on prototype validation, while broader deployment, additional categories, and longer autonomous runs remain future work.