Ross Cutler

Sailing in the Puget Sound

Ross Cutler

I’m a Distinguished Engineer at Microsoft. I’ve been with Microsoft since 2000, joining Microsoft Research as a researcher after completing my Ph.D. in computer vision at the University of Maryland, College Park. My bachelor degrees are in computer science, math, and physics, and I’m comfortable building both software and hardware technologies. On this page, I’ll include brief descriptions of some of the projects I’ve worked on, which have mostly been in the telecommunication domain - a very rich and satisfying area for applied research.

You can find more info about me on my LinkedIn page and my Microsoft Research page.

Publications

My Google Scholar page is here. A complete list of publications is in my CV.

Recent talks

IEEE VCIP Keynote, 2024. Photorealistic Avatars for Video Conferencing
IEEE IEMCON Keynote, 2022. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
RTC @ Scale, 2022. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
Intel Speech Conference Keynote, 2021. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
INTERNOISE Keynote, 2021. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype

Projects

Below are some of the projects I’ve worked on.

Photorealistic avatars

The goal for this project is to create photorealstic avatars for video conferencing scenarios that can provide a better user experience as measured by meeting effectiveness, inclusiveness, trust, empathy, and fatigue (as well as other metrics). We have defined and implemented the first multidimensional measurement of photorealistic avatar quality of experience. We have also organized the first academic challenge in this area. The ultimate goal is to replace the traditional video grid with a virtual meeting space that improves the quality of experience in all metrics. Some publications for this project are:

A multidimensional measurement of photorealistic avatar quality of experience
CVPR 2025 Photorealistic Avatar Challenge

ML speech enhancement

The goal of this project is to replace millions of lines of DSP-based speech enhancement code in Teams/Skype with much better performing ML-based models that also offer new functionality. To do so we first created a scaleable crowdsourcing framework to rate 100’s of thousands of clips cheaply and accurately. We created massive training and test sets to enable training ML models. We built the first accurate (PCC > 0.95) non-intrusive speech quality assessment models for speech in the presence of noise, echo, packet loss, reverberation, and PSTN distortions. We then created the first academic challenges for noise suppression, echo cancellation, and packet loss concealment. Finally, the models we built were integrated into Teams/Skype and AB tested to show significant end-to-end improvement. An example video of Deep Noise Suppression is here. A more recent example of Deep Echo Cancellation, Noise Suppression, and Dereverberation in a single model is shown here. Examples of Deep Packet Loss Concealment are shown here. Voice Isolation is shown here. These models are now used by 100’s of millions of Teams/Skype users for all audio calls.

Some publications for this project are:

Challenges:

Speech quality assessment models:

Speech quality labeling:

Models, optimizations, other:

ML video codec

The goal of this project is to create an ML-based video codec with >2X the coding efficiency of H.266/VVC. We are following a similar approach to our ML speech enhancement process. We first created an accurate video quality assessment tool that can label millions of clips cheaply. We helped organize an ML video codec challenge that uses this video quality assessment tool. And we will work with other companies to create a new ML video codec standard.

Challenges:

Publications:

Meeting effectiveness and inclusiveness

Meetings are a pervasive method of communicating in companies and consume a lot of time and resources. The goal of this project was to determine if we could effectively measure the effectiveness and inclusiveness of meetings, and ultimately improve them. There has been research in modeling meeting effectiveness, but no prior research on measuring or modeling meeting inclusiveness. We provide the first methodology to measure and model meeting effectiveness and inclusiveness and show how they can be improved. One specific area to improve meeting inclusiveness is to better allow remote speakers to interrupt and talk in meetings, which we address with the first failed speech interruption model we are aware of. A demo video is here.

Publications:

ML bandwidth estimation and control

The goal of this project is to estimate how much bandwidth we can send across the internet for video calls. Traditionally this has been solved with classic control theory. We are using machine learning and in particular deep reinforcement learning. This is an especially challenging problem because, unlike audio or video, the data here is the internet, which is impossible to simulate accurately.

Publications:

Active speaker detection

I’ve been working on active speaker detection (ASD) since graduate school, where I developed the first neural network solution for multi-modal fusion to detect active speakers with a single microphone and camera. Since then I’ve implemented ASD several more times using microphone arrays, depth cameras, and deep learning, and have shipped it in multiple products, including Microsoft RoundTable. ASD is still an active area of research and still a project I’m working on.

Publications:

Light field camera/display video conferencing

The ultimate type of remote conferencing will preserve eye gaze, who is looking at who, have accurate spatial geometry, and achieve the same level of trust, empathy, meeting effectiveness and inclusiveness, and fatigue as face-to-face meetings. It will require new types of displays and cameras to achieve this. Two prototypes I’ve worked on called TeleWall and TeleWindow are designed to meet these goals using light field cameras and AR glasses.

Patents:

Camera designs

I’ve designed several types of cameras for human motion capture and teleconferencing besides the above light field cameras. My first commercial design was the RoundTable camera, a high-resolution 360-degree camera using a pentagonal prism and view cameras to minimize stitching error. That design was later made HD and shipped by Polycom. A newer 360-degree cost-reduced design uses just two cameras. I’ve also designed some front-of-the-room cameras, privacy-preserving webcams, and whiteboard cameras. Some of these are described below.

Publications:

Design and implementation of the University of Maryland Keck Laboratory for the analysis of visual movement

Patents:

Video DSP

I’ve implemented many video DSP components, including panoramic stitching, smart gain control, camera calibration for panoramic cameras, normalizing the size of heads in video conferences, and demosaicing (used in Matlab and NASA Mars missions). Some of these are described below:

Publications:

Patents:

Audio DSP

I’ve implemented a number of audio DSP components also, such as speaking while muted detection, improved echo cancellers, and audio-based device discovery. Some of these are described below:

Patents:

Acoustic designs

I’ve designed the physical acoustics for several shipping speakerphones. Some are described below:

Patents:

Audio / video device quality

I wrote the first several versions of the Skype / Teams video quality certification specifications and tools, which is the PC industry standard for how to measure the video quality for webcams. I also co-authored the Skype / Teams audio quality certification specifications, which is the PC industry standard for how to measure the send and receive audio quality. Both were developed after designing my own cameras and audio endpoints.

Specifications:

Periodic motion detection and analysis

Part of my Ph.D. thesis was on detecting and analyzing periodic motion, especially biological motion like human gait. This work showed you can detect biological motion with very few pixels, and even identify people using this motion.

Publications:

Mathematics genealogy

I have rich lineage of Ph.D. advisers, including these famous researchers:

David Hilbert
Felix Klein
Rudolf Otto Sigismund Lipschitz
Carl Friedrich Gauss
Siméon Denis Poisson
Jean-Baptiste Joseph Fourier
Joseph Louis Lagrange
Pierre-Simon Laplace
Leonhard Paul Euler
Johann Bernoulli
Jacob Bernoulli
Friedrich Leibniz
Johannes Kepler
Nicolaus Copernicus

For more details see here.