Artificial Intelligence in Real Time Communications

Study Overview

Thanks to advances in Machine Learning techniques, intelligence is being added to the products and services we use every day. We routinely speak to voice assistants, use vision processing to identify friends and family in photos, and quietly benefit from behind the scenes algorithms that improve quality and reliability.  Advances in consumer oriented AI technologies are now finding new applications and use cases as these capabilities become democratized. The communications industry, which was once at the forefront of many of these technologies, is now presented with a plethora of new options for improving existing applications, finding new cost advantages, and redefining existing communications modalities.

This study examines the role of Artificial Intelligence (AI) and Machine Learning in Real Time Communications. It is designed to help product, strategy, and business development decision makers communications service providers, technology vendors, communications-centric app providers, and enterprise information technology organizations.

The report authors have years of experience in technical, product management, and consulting roles evaluating and applying new technologies, including practical work with speech analytics, computer vision, voice bots, and performance algorithms. Together they bring unique insight and supporting data for product owners, analysts, and anyone that has a key stake in advancing the communications market.

“AI and ML will transform every industry. Just as computers were like bicycles for your mind, this will be like the space shuttle. This is an excellent and thorough report. It is written by people that actually understand technology and how products are made.”

Omar Javaid, Chief Product Officer @ Vonage

“There is a real demand for AI in communication products. Our partners are expecting us to have a plan for AI. This report came right on time for us. It contains a wide and detailed view of the industry, assisting us in honing our AI roadmap for our web collaboration product.”

John Logsdon, CEO and Founder @ Drum


  • Company Interviews – more than 40, in-depth 1-on-1 interviews with key industry technology suppliers, leading vendors in speech analytics, computer vision, voicebots, and machine learning tools
  • Web Survey – a survey with nearly 100 distinct company respondents was used for broader real time communications market insights
  • Industry and technology events – the authors attended and reviewed key RTC and AI events to understand general themes, new technologies, and emerging players
  • Product Reviews – detailed reviews of leading machine learning-based products, and AI in RTC solutions identified through primary and secondary research
  • Analysis – using the inputs above, the authors synthesized analytical frameworks, aiming to highlight key trends and present their findings in an easy to read report for (Insert Target Audience)



  • Communications service and application providers –  looking to enhance their products and services
  • Contact Centers and IT organizations – aiming to leverage AI in their customer and employee communications
  • AI technology vendors – targeting businesses in communications vertical and working to enhance telephony oriented features
  • Investors – looking for an impartial analysis of the opportunities in communications-centric AI applications




KGR is a subsidiary of Kranky Geek, LLC, a technology event company. Kranky Geek has held 7 international events. Their last event focus on AI topics including speech analytics and natural language processing for telephony, blending WebRTC with Augmented Reality (AR), using computer vision for detecting inappropriate behavior on video, Machine Learning for improving RTC, video quality, and using Tensorflow to optimize congestion control.


Chad Hart

Chad Hart is an independent consultant and blogger at and, a blog focused on exploring the intersection of AI and communications. Chad recently ran a new product incubator where he brought many new product experiments to market including Emergency Service 911 calling for Alexa and launching a production speech analytics service. Chad’s professional experience includes authoring several large market research studies as an analyst, corporate business intelligence and industry analysis, and product management.

Learn more about Chad at


Tsahi Levent-Levi

Tsahi is an independent analyst and consultant for WebRTC, communications, and AI where has has authored many widely read reports and whitepapers. Tsahi Levent-Levi has over 15 years of experience in the telecommunications exploring and implementing new technologies as an engineering, manager, marketer and CTO. Tsahi holds a Computer Science Masters Degree from where his thesis was on machine learning in computational linguistics.

Learn more about Tsahi at


Contact & Ordering

You can learn more about the report in our prospectus (PDF).

Please email us at research at if you have any further questions.

Table of Contents

  • List of Figures
  • List of Tables

Executive Summary

  • Use Cases
  • Market Dynamics
    • Drivers
    • Inhibitors
    • Market Landscape
  • Recommendations

Scope & Methodology

  • Scope
    • Real Time Communications Applications
    • Artificial Intelligence Technologies
  • Research Methodology
  • Expertise
    • KGR
    • Chad Hart
    • Tsahi Levent-Levi

Machine Learning Overview

  • Introduction
  • AI and ML
  • Learning Approaches
    • Supervised
    • Unsupervised
  • Deep Learning
  • Data Flow in Machine Learning
  • Product Aspects of Machine Learning
  • Limitations
  • AI in RTC
    • Edge versus Client
    • Voice and Video
  • What Next?

Speech Analytics

  • Introduction
  • Use Cases
  • Speech Analytics Technology Stack
    • Media Server
    • Speech-to-Text Engine
    • NLU Engine
    • Analytics Applications
  • Market Dynamics
    • Drivers
      • Deep Learning
      • Open Source
      • Start-Ups
      • Cloud Platforms Driving STT Commoditization
    • Inhibitors
      • Language Support
      • Legacy System Recording Quality
      • Custom Vocabularies
      • Compliance
      • Using the Data
    • Emerging Features and Trends
      • Paralinguistics
      • Speaker Separation
      • Speaker Recognition
      • Real Time
      • Deeper Analytics
      • Edge Transcription
    • Market Landscape
      • Stakeholder Groups
      • Vendor Groups
  • Selection Criteria
    • Media Server
    • Speech-to-Text (STT) Engine
    • NLU Engine
    • Analytics Applications
  • Recommendations
    • Dealing with Legacy Telephony Environments
    • Be Careful with Word Error Rates
    • SIPREC Recording in Multi-Vendor Environments
    • Speech Engine Build vs. Buy


  • Introduction
  • Use Cases
    • Inbound Interactive Voice Response (IVR)
    • Outbound IVR
    • Agent Assistant
    • User Assistant
    • Conference Call Assistant
    • Smart Conference Room Devices
  • Voicebot Technology Stack
    • RTC-Bot Gateway
    • Wake Word Detector
    • Speech-to-Text Engine (STT)
    • Bot Engine
      • Natural Language Understanding (NLU)
    • Text-to-Speech (TTS)
    • Bot Application
  • Market Dynamics
    • Drivers
      • Chatbots
      • Consumer Voice Market
      • Speech Technology Improvements
      • Call Center Economics
    • Inhibitors
      • Telephony Integration
      • Language Support
      • Voice User Interface (VUI) Expertise
    • Emerging Features and Trends
      • Knowledge Extraction
      • API Simplification
      • Graphical Development
      • Speaker Recognition
    • Market Landscape
      • Major Vendor Groups
      • Implementation Approaches and Implications
  • Selection Criteria
    • RTC-Bot Gateway
    • Wake Word Detector
    • Speech-to-Text
    • Text-to-Speech
    • Bot Engine
  • Recommendations
    • Cloud Implementations for Agility
    • Replace Outdated IVR Technology
    • Many VoIP Devices Could be Smart Devices
    • Owning the Voicebot Technology Stack

Computer Vision

  • Introduction
  • Computer Vision Technology Stack
    • Image Data
      • From Image to Video
    • Train
    • Optimize
    • Inference
      • Cloud Inference
      • Edge Inference
  • Use Cases
    • Silly Hats
    • Image Enhancement
      • Improving Colors
      • Replacing the Background
      • Improving a Participant’s Looks
    • Face Recognition
    • Face Tracking
      • Automatic Zoom
      • Head Counting
      • Assist with Speaker Diarization
    • Emotion Detection
    • Body and Gesture Tracking
    • Not Safe for Work (NSFW)
    • Image Classification and Object Detection
      • Whiteboard Detection
      • AR, IoT, and Healthcare
    • Optical Character Recognition (OCR)
  • Market Dynamics
    • Drivers
      • Deep Learning
      • Open Source Projects
      • Cloud APIs
      • iOS and Android Edge Inference
    • Inhibitors
      • Video Algorithms
      • Real Time Processing
      • Cloud Cost
      • Inference on Edge Devices
      • Big Brother Concerns
    • Market Landscape
  • Selection Criteria
  • Recommendations
    • Start from the Business Value
    • Focus on Quick Wins
    • Decide on Cloud vs. Edge Inference
    • Owning the Data

RTC Quality and Cost Optimization

  • Introduction
  • Technology Overview
    • Network Level Optimization
    • The Media Processing Pipeline
      • Capture
      • Encode
      • Send
      • Receive
      • Decode
      • Play
  • Market Dynamics
    • Drivers
      • AI Adoption
      • Media Quality
    • Inhibitors
      • Existing Heuristic Algorithms
      • ROI Calculation
      • Edge Inference
  • Market Trends
    • Edge Inference
    • In-House Implementations
  • Selection Criteria
  • Recommendations
    • Differentiate on Quality
    • RTC Quality Build vs. Buy
    • Pick Algorithms to Focus On
      • Noise Suppression
      • Bandwidth Estimation
      • Packet Loss Concealment

RTC Survey Results

  • Demographics
  • AI Adoption Challenges
  • AI Adoption Drivers
  • AI Initiatives
  • AI Technology Use
  • Inference Locations
  • Open Source vs. Commercial
  • Top Vendors
    • Machine Learning Tools and Frameworks
    • Speech Analytics
    • Voicebots
    • Computer Vision

List of Figures

  • Figure 1: Top AI in RTC drivers and inhibitors web survey results
  • Figure 2:Example of hotdog not hotdog app made by HBO’s Silicon Valley
  • Figure 3: Twitter trending topics for San Francisco region on May 26, 2018
  • Figure 4: Neural network architecture in deep learning
  • Figure 5: Communications versus machine learning architectures
  • Figure 6: Typical layers of a speech analytics application
  • Figure 7: Speech analytics dashboard application example
  • Figure 8: Google data showing transcription accuracy improvement
    Figure 9: Speech analytics stakeholder groups and vendor types
  • Figure 10: Possible SIPREC configuration
  • Figure 11: Voicebot technology stack for a telephony system
  • Figure 12: Visualization of a speech waveform
  • Figure 13: Consumer voicebot market; notable milestones
  • Figure 14: Results from a January 2018 consumer survey of smart speaker users
  • Figure 15: Marks & Spencer used voicebot technology to replace DTMF IVRs
  • Figure 16: Cloud API interaction options
  • Figure 17: Voicebot implementation approaches
  • Figure 18: Cisco’s Spark Assistant is a voicebot for its video conferencing hardware
  • Figure 19: Computer vision model development and deployment process
  • Figure 20: Computer vision cloud APIs receiving and decoding the media stream in
    parallel to the actual service
  • Figure 21: Computer vision making use of the edge device encoder to reduce
    processing requirements
  • Figure 22: Cloud inference in computer vision
  • Figure 23: Edge inference in computer vision
  • Figure 24: Image enhancements using Facebook AR Studio
  • Figure 25: Improving a participant’s look
  • Figure 26: The media processing pipeline
  • Figure 27: Survey respondents segmented by company type
  • Figure 28: AI adoption challenges
  • Figure 29: Primary drivers of AI technology adoption for communications
  • Figure 30: AI strategy within company
  • Figure 31: AI technology used or provided by respondents
  • Figure 32: Locations where ML inference is run
  • Figure 33: Preference to leverage open source vs. commercial preferences
  • Figure 34: Top named machine learning tools and frameworks
  • Figure 35: Top named speech Analytics solutions
  • Figure 36: Top named voicebot platforms
  • Figure 37: Top named computer vision tools and platforms

List of Tables

  • Table 1: Summary of AI in RTC use cases by domain
  • Table 2: Summary of AI in RTC drivers
  • Table 3: Summary of RTC in AI inhibitors
  • Table 4: Supervised vs. unsupervised learning
  • Table 5: Speech analytics use case examples
  • Table 6: Speech analytics drivers and inhibitors summary
  • Table 7: Examples of speech analytics start-ups that are less than three years old
  • Table 8: Major cloud provider transcription and analytics offers
  • Table 9: List of regulations where compliance may be required and where
    recording and speech analytics can help
  • Table 10: Speech analytics enabling technology and application vendor categories
  • Table 11: Telephony gateway & recorder selection criteria
  • Table 12: STT engine selection criteria
  • Table 13: NLU engine selection criteria
  • Table 14: Analytics application selection criteria
  • Table 15: TTS and NLU technology build vs. buy considerations
  • Table 16: Voicebot in RTC market drivers and inhibitors
  • Table 17: Comparison of major cloud platform TTS vendors
  • Table 18: Sample of LinkedIn job posting titles that include Voice User Interface requirements
  • Table 19: Voicebot stakeholder groups & trends
  • Table 20: RTC-Bot gateway options for major cloud vendor bot systems
  • Table 21: Wake Word Detector selection criteria
  • Table 22: Speech-to-Text for voicebots selection criteria
  • Table 23: Text-to-Speech selection criteria
  • Table 24: Bot Engine selection criterion
  • Table 25: Computer vision use cases in real time communications
  • Table 26: Computer vision drivers and inhibitors summary
  • Table 27: Computer vision services by cloud vendors
  • Table 28: Computer vision-related services in mobile operating systems
  • Table 29: Major cloud provider video analytics features
  • Table 30: Specialized computer vision vendors
  • Table 31: Selection criteria in computer vision
  • Table 32: RTC quality and cost optimization drivers and inhibitors summary
  • Table 33: Vendors and optimization use cases across the media processing stack
[ninja_form id=”2″]

Contact us to inquire about our discounts.

“AI and ML will transform every industry. Just as computers were like bicycles for your mind, this will be like the space shuttle. This is an excellent and thorough report. It is written by people that actually understand technology and how products are made.”

Omar Javaid, Chief Product Officer @ Vonage

“There is a real demand for AI in communication products. Our partners are expecting us to have a plan for AI. This report came right on time for us. It contains a wide and detailed view of the industry, assisting us in honing our AI roadmap for our web collaboration product.”

John Logsdon, CEO and Founder @ Drum

[product id=”4355″]