Fast Active Speaker Detection
This function is an optimized, production-ready implementation of active speaker detection, which is the ability to detect when people in video are actively talking through object detection and facial analysis. Read more about the research area here.
For pricing, click here.
For further information, click here.
Key Features
- Reliable Speaker Detection: The speaker detection is powered by our optimized version of the Talknet-ASD model, which greatly improves on the quality and cost of the open source variant.
- Robust Face Detection: You can use either YoloV8 and Mediapipe Face Detection as backends to detect faces.
- Segment by Scene: Get scene-by-scene active speaker data using our optimized implementation of Talknet-ASD
- Customize Speed and Cost: Choose which face detection algorithms and what fps to use to cut costs or boost quality. Choose to process certain sections of video.
Pricing
We price per minute of video. We bucket pricing into standard definition (<=720p), high definition (<=1080p), and 4k (>1080p) videos. We've listed the pay-as-you-go rates below.
Additionally, there's a small compute-based fee associated with processing object detection on YoloV8 or Mediapipe. Still, it is quite small relative to the rest of the processing, so we've not listed it here. If you increase the FPS of object detection, the compute price of the object detection algorithms will go up.
The fee for Talknet-ASD is baked into the pricing below.
Resolution | Price / Minute |
---|---|
> 1080p (4k) | $0.117 |
> 720p (up to 1080p) | $0.065 |
≤ 720p | $0.052 |
Note: If your video is poorly encoded, we will re-encode it for you as it would otherwise cause the pipeline to be prohibitively slow. For this, we charge $0.01 per compute minute to re-encode the video.
Notes
Output Format
The output is a generated stream of JSONs, where each item in the stream is a specific scene. Within each item, the output is a dictionary. You can view the example format below. Note that the "related_scene" key is only outputted if the return_scene_data
param is enabled.
[
{
"frame_number": int,
"time_seconds": float,
"faces": [
{
"x1": int,
"y1": int,
"x2": int,
"y2": int,
"speaking_score": float (negative is not speaking, positive is speaking),
"active": bool (whether the person is speaking)
}
],
// "related_scene": { // Only returned if the `return_scene_data` param is enabled
// "start_seconds": float,
// "end_seconds": float,
// "start_frame": int,
// "end_frame": int,
// "start_timecode": string formatted like "00:00:00.000",
// "end_timecode": string formatted like "00:00:16.614",
// "scene_number": int,
// }
},
...
]
Speed Boost
The speed_boost
param, when toggled, uses Mediapipe Face Detection to detect faces in the video, which is faster but a bit less reliable than the default YoloV8 backend.
Getting Scene Data
It's common in active speaker detection to deal with cuts in the video jumping from speaker to speaker. Returning this scene information is an optional feature of this pipeline, which you can trigger with the return_scene_data
parameter. It gets returned with the key "related_scene".
If you only want the information about the start of each scene, you can use the return_scene_cuts_only
parameter to accomplish this, which truncates the output payload to the first image per scene.