Ross Cutler
I’m a Distinguished Engineer at Microsoft. I’ve been with Microsoft since 2000, joining Microsoft Research as a researcher after completing my Ph.D. in computer vision at the University of Maryland, College Park. My bachelor degrees are in computer science, math, and physics, and I’m comfortable building both software and hardware technologies. On this page, I’ll include brief descriptions of some of the projects I’ve worked on, which have mostly been in the telecommunication domain - a very rich and satisfying area for applied research.
You can find more info about me on my LinkedIn page and my Microsoft Research page.
Publications
My Google Scholar page is here. A complete list of publications is in my CV.
Recent talks
- IEEE IEMCON Keynote, 2022. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
- RTC @ Scale, 2022. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
- Intel Speech Conference Keynote, 2021. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
- INTERNOISE Keynote, 2021. Developing Machine Learning-Based Speech Enhancement Models for Teams and Skype
Projects
Below are some of the projects I’ve worked on.
ML speech enhancement
The goal of this project is to replace millions of lines of DSP-based speech enhancement code in Teams/Skype with much better performing ML-based models that also offer new functionality. To do so we first created a scaleable crowdsourcing framework to rate 100’s of thousands of clips cheaply and accurately. We created massive training and test sets to enable training ML models. We built the first accurate (PCC > 0.95) non-intrusive speech quality assessment models for speech in the presence of noise, echo, packet loss, reverberation, and PSTN distortions. We then created the first academic challenges for noise suppression, echo cancellation, and packet loss concealment. Finally, the models we built were integrated into Teams/Skype and AB tested to show significant end-to-end improvement. An example video of Deep Noise Suppression is here. A more recent example of Deep Echo Cancellation, Noise Suppression, and Dereverberation in a single model is shown here. Examples of Deep Packet Loss Concealment are shown here. Voice Isolation is shown here. These models are now used by 100’s of millions of Teams/Skype users for all audio calls.
Some publications for this project are:
Challenges:
- ICASSP 2024 Speech Signal Improvement Challenge
- ICASSP 2024 Audio Deep Packet Loss Concealment Challenge
- ICASSP 2023 Speech Signal Improvement Challenge
- ICASSP 2023 Acoustic Echo Cancellation Challenge
- ICASSP 2023 Deep Noise Suppression Challenge
- INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
- ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications
- ICASSP 2022 Deep Noise Suppression Challenge
- ICASSP 2022 Acoustic Echo Cancellation Challenge
- INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
- ICASSP 2021 Deep Noise Suppression Challenge
- ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, testing framework, and results
- INTERSPEECH 2021 Deep Noise Suppression Challenge
- The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, subjective testing framework, and challenge results
Speech quality assessment models:
- PLCMOS – a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms
- DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors
- AECMOS: A speech quality assessment metric for echo impairment
- DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors
- Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network
- DNN no-reference PSTN speech quality prediction
- Non-intrusive speech quality assessment using neural networks
- Supervised classifiers for audio impairments with noisy labels
Speech quality labeling:
- Multi-dimensional Speech Quality Assessment in Crowdsourcing
- Crowdsourcing approach for subjective evaluation of echo impairment
- An open source implementation of ITU-T recommendation P.808 with validation
- Subjective evaluation of noise suppression algorithms in crowdsourcing
- Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method
Models, optimizations, other:
- DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation
- Aura: Privacy-preserving augmentation to improve test set diversity in noise suppression applications
- Performance optimizations on deep noise suppression models
- MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
- A scalable noisy speech dataset and online subjective test framework
- Weighted speech distortion losses for neural-network-based real-time speech enhancement
ML video codec
The goal of this project is to create an ML-based video codec with >2X the coding efficiency of H.266/VVC. We are following a similar approach to our ML speech enhancement process. We first created an accurate video quality assessment tool that can label millions of clips cheaply. We helped organize an ML video codec challenge that uses this video quality assessment tool. And we will work with other companies to create a new ML video codec standard.
Challenges:
- CVPR 2022 Workshop and Challenge on Learned Image Compression
- DCC 2024 Workshop and Challenge on Learned Image Compression
Publications:
- VCD: A Video Conferencing Dataset for Video Compression
- A crowdsourcing approach to video quality assessment
- Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs
Meeting effectiveness and inclusiveness
Meetings are a pervasive method of communicating in companies and consume a lot of time and resources. The goal of this project was to determine if we could effectively measure the effectiveness and inclusiveness of meetings, and ultimately improve them. There has been research in modeling meeting effectiveness, but no prior research on measuring or modeling meeting inclusiveness. We provide the first methodology to measure and model meeting effectiveness and inclusiveness and show how they can be improved. One specific area to improve meeting inclusiveness is to better allow remote speakers to interrupt and talk in meetings, which we address with the first failed speech interruption model we are aware of. A demo video is here.
Publications:
- Meeting effectiveness and inclusiveness: large-scale measurement, identification of key features, and prediction in real-world remote meetings
- Real-time Speech Interruption Analysis: From Cloud to Client Deployment
- Meeting effectiveness and inclusiveness in remote collaboration
- Improving meeting inclusiveness using speech interruption analysis
ML bandwidth estimation and control
The goal of this project is to estimate how much bandwidth we can send across the internet for video calls. Traditionally this has been solved with classic control theory. We are using machine learning and in particular deep reinforcement learning. This is an especially challenging problem because, unlike audio or video, the data here is the internet, which is impossible to simulate accurately.
Publications:
- Reinforcement learning for bandwidth estimation and congestion control in real-time communications
- ACM MMSys 2024 Bandwidth Estimation in Real Time Communications Challenge
Active speaker detection
I’ve been working on active speaker detection (ASD) since graduate school, where I developed the first neural network solution for multi-modal fusion to detect active speakers with a single microphone and camera. Since then I’ve implemented ASD several more times using microphone arrays, depth cameras, and deep learning, and have shipped it in multiple products, including Microsoft RoundTable. ASD is still an active area of research and still a project I’m working on.
Publications:
- A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism
- Multimodal active speaker detection and virtual cinematography for video conferencing
- Boosting-based multimodal speaker detection for distributed meeting videos
- Look who’s talking: Speaker detection using video and audio correlation
Light field camera/display video conferencing
The ultimate type of remote conferencing will preserve eye gaze, who is looking at who, have accurate spatial geometry, and achieve the same level of trust, empathy, meeting effectiveness and inclusiveness, and fatigue as face-to-face meetings. It will require new types of displays and cameras to achieve this. Two prototypes I’ve worked on called TeleWall and TeleWindow are designed to meet these goals using light field cameras and AR glasses.
Patents:
- Dynamic detection and correction of light field camera array miscalibration
- Light field camera modules and light field camera module arrays
- Device pose detection and pose-related image capture and processing for light field based telepresence communications
- Telepresence devices operation methods
- Telepresence device
Camera designs
I’ve designed several types of cameras for human motion capture and teleconferencing besides the above light field cameras. My first commercial design was the RoundTable camera, a high-resolution 360-degree camera using a pentagonal prism and view cameras to minimize stitching error. That design was later made HD and shipped by Polycom. A newer 360-degree cost-reduced design uses just two cameras. I’ve also designed some front-of-the-room cameras, privacy-preserving webcams, and whiteboard cameras. Some of these are described below.
Publications:
Patents:
- 360 degree camera
- Multi-view integrated camera system with housing
- Maintenance of panoramic camera orientation
- Omni-directional camera with calibration and up look angle improvements
- Whiteboard view camera
- Omni-directional camera design for video conferencing
- Camera lens shuttering mechanism
Video DSP
I’ve implemented many video DSP components, including panoramic stitching, smart gain control, camera calibration for panoramic cameras, normalizing the size of heads in video conferences, and demosaicing (used in Matlab and NASA Mars missions). Some of these are described below:
Publications:
- High-quality linear interpolation for demosaicing of Bayer-patterned color images
- Automatic head-size equalization in panorama images for video conferencing
- Head-size equalization for improved visual perception in video conferencing
- Practical calibrations for a real-time digital omnidirectional camera
Patents:
- Parametric calibration for panoramic camera systems
- Temperature compensation in multi-camera photographic devices
- Privacy camera
- System and method for camera calibration and images stitching
- Minimizing dead zones in panoramic images
- Speaker and person backlighting for improved AEC and AGC
- Automatic gain and exposure control using region of interest detection
- Automatic detection of panoramic camera position and orientation table parameters
- System and method for camera color calibration and image stitching
- Automatic video framing
- Techniques for detecting a display device
Audio DSP
I’ve implemented a number of audio DSP components also, such as speaking while muted detection, improved echo cancellers, and audio-based device discovery. Some of these are described below:
Patents:
- Mute control in audio endpoints
- Removing near-end frequencies from far-end sound
- Endpoint echo detection
- Reducing echo
- Enhanced discovery for ad-hoc meetings
- System and process for discovery of network-connected devices at remote sites using audio-based discovery techniques
- System and method for communicating audio data signals via an audio communications medium
- Audio start service for Ad-hoc meetings
Acoustic designs
I’ve designed the physical acoustics for several shipping speakerphones. Some are described below:
Patents:
- Audio device
- Method and system of varying mechanical vibrations at a microphone
- Satellite microphones for improved speaker detection and zoom
- Satellite microphone array for video conferencing
- Microphone array for a camera speakerphone
- Boundary binaural microphone array
Audio / video device quality
I wrote the first several versions of the Skype / Teams video quality certification specifications and tools, which is the PC industry standard for how to measure the video quality for webcams. I also co-authored the Skype / Teams audio quality certification specifications, which is the PC industry standard for how to measure the send and receive audio quality. Both were developed after designing my own cameras and audio endpoints.
Specifications:
- Microsoft Teams and Skype for Business specifications for USB peripherals, PCs, and Microsoft Teams Room systems, versions 1-3
- Microsoft Lync Specifications for USB peripherals, PCs, and Lync room systems, versions A-F
Periodic motion detection and analysis
Part of my Ph.D. thesis was on detecting and analyzing periodic motion, especially biological motion like human gait. This work showed you can detect biological motion with very few pixels, and even identify people using this motion.
Publications:
- Gait recognition using image self-similarity
- View-invariant estimation of height and stride for gait recognition
- Stride and cadence as a biometric in automatic person identification and verification
- Motion-based recognition of people in eigengait space
- Eigengait: Motion-based recognition of people using image self-similarity
- Robust real-time periodic motion detection, analysis, and applications
Mathematics genealogy
I have rich lineage of Ph.D. advisers, including these famous researchers:
- David Hilbert
- Felix Klein
- Rudolf Otto Sigismund Lipschitz
- Carl Friedrich Gauss
- Siméon Denis Poisson
- Jean-Baptiste Joseph Fourier
- Joseph Louis Lagrange
- Pierre-Simon Laplace
- Leonhard Paul Euler
- Johann Bernoulli
- Jacob Bernoulli
- Friedrich Leibniz
- Johannes Kepler
- Nicolaus Copernicus
For more details see here.