Waymo: Utilizing key point and pose estimation for the task of autonomous driving

Imagine: you’re driving a car and you see a person standing on the corner of a street. How do you know if they're going to cross?

Thumbnail

Imagine: you’re driving a car and you see a person standing on the corner of a street. How do you know if they’re going to cross? Interpreting another road user’s intent and actions can be challenging and complex, even for human drivers. It is a driver’s job to gauge whether another road user wants you to wait and let them cross, whether they are waiting for you to cross after you pass, or if they are just waiting there for a different reason. Even then, what an individual signals might differ from the action they complete. To help navigate these nuanced situations, one of the important signals the Waymo Driver uses is key points.

Making more accurate and efficient models with less compute

Key points are a simplified way to represent the complex human form using a limited number of points, which typically correspond to body joints.

Historically, computer vision relies on rigid bounding boxes to locate and classify objects within a scene; however, one of the limiting factors in detection, tracking, and action recognition of vulnerable road users, such as pedestrians and cyclists, is the lack of precise human pose understanding. While locating and recognizing an object is essential for autonomous driving, there is a lot of context that can go unused in this process. For example, a bounding box won’t inherently tell you if a pedestrian is standing or sitting, what their actions or gestures are, and so forth. For example, a bounding box won’t inherently tell you if a pedestrian is standing or sitting, or what their actions or gestures are.

Key points are a compact and structured way to convey human pose information otherwise encoded in the pixels and lidar scans for pedestrian actions. These points help the Waymo Driver gain a deeper understanding of an individual’s actions and intentions, like whether they’re planning to cross the street. For example, a person’s head direction often indicates where they plan to go, whereas a person’s body orientation tells you which direction they are already heading. While the Waymo Driver can recognize a human’s behavior without using key points directly using camera and lidar data, pose estimation also teaches the Waymo Driver to understand different patterns, like a person propelling a wheelchair, and correlate them to a predictable future action versus a specific object, such as the wheelchair itself.

Applying state of the art technology to the autonomous driving domain

While we incorporate this exciting technology onboard our vehicles to help the Waymo Driver navigate the real world, key points and pose estimation have been advancing many industries for more than a decade –- from animating beloved cartoons and creating realistic video game characters to augmenting reality on popular social media apps – but applying this state of the art technology to the autonomous driving domain is magnitudes more challenging.

Up until now, key points have been used in relatively controlled environments to help make them easier to apply, such as augmenting a dinosaur next to a singular person or filming a set number of actors to control a video game. The Waymo Driver generates key points in the “wild” for all nearby road users, which is orders of magnitudes harder as our Driver often encounters up to hundreds of pedestrians at a single intersection, many of which can be occluded by other objects.

The Waymo Driver uses real-time data from our sensor suite, including our lidars, which feed into our neural-network models to localize key points in three dimensional space. Waymo created its own methodologies to generate high-quality labels to identify the joints in a 3D space, which enabled training human pose models to further improve the safety of the Waymo Driver. This also means that Waymo’s key point technology doesn’t identify an individual person, but rather aggregates data points and provides us with a better capability to recognize that a person exists and where they may be going, which is especially beneficial for partially visible pedestrians that might be stepping out of a vehicle or sitting near the road. Additionally, we’ve optimized our system to run onboard the vehicle in real-time, with high precision and low latency, to enhance its behavior-prediction models and allow the Waymo Driver to quickly and safely handle any situation.

Key points in action

For some time now, key points have been providing the Waymo Driver a more nuanced understanding of the world around it, creating a more predictable and comfortable driving experience for all road users, including our Waymo One riders. Here are a handful of examples of how key points are helping the Waymo Driver navigate San Francisco.

Crowds

From morning commutes and weekend rallies to farmers markets and the rush of fans on game day, any driver can count on seeing large groups of people in cities like San Francisco. By combining key points with Waymo’s fifth-generation sensing suite, the Waymo Driver has a deeper understanding of how pedestrians, cyclists, and other road users might interact with it to navigate dense and complex situations.

Gestures

While the Waymo Driver can detect various gestures from raw camera data or lidar point clouds, like a cyclist or traffic controller’s hand signals, it is advantageous for the Waymo Driver to use key points to determine a person’s orientation, gesture, and hand signals. Earlier and more accurate detection allows the Waymo Driver to better plan its move, creating a more natural driving experience.

Occlusions

By design, cities are denser environments leading to more unique challenges. Narrow city streets are lined with cars and large crowds of people, and objects are often blocked or hidden with people walking out of buildings or popping out from behind vehicles. With the addition of key points, the Waymo Driver can better understand and recognize partially occluded objects, such as just a leg or arm of a person stepping out of a vehicle or a person hidden between two vehicles, and reason about their next move.

Indecisive pedestrians

When pedestrians in San Francisco reach a red light, they commonly wait on the very edge of a crosswalk. While it often looks as if they might walk in front of a car, they may end up crossing behind the vehicle. The Waymo Driver is prepared for either reaction, and our key points capability provides the Driver with a more layered understanding of the situation.

SOURCE: Waymo

Waymo: Utilizing key point and pose estimation for the task of autonomous driving

Making more accurate and efficient models with less compute

Applying state of the art technology to the autonomous driving domain

Key points in action

Join our LinkedIn Group

Sign up for our weekly newsletters