Part Two: Managing the Velocity of Voice

A Decade of Opportunity for the Software Tech Sector

In this second of a three part series of articles on the rapidly evolving importance of audio (voice) and video content, Mark Edwards and John Symon explore some of the potential technical challenges as well as the considerable market opportunities, for Voice Recognition technologies in the Software Tech sector.

To gain a broader overview, Boss Equity also speaks with industry veterans, Alan Pelz-Sharpe, Harvey Spencer and George Parapadakis, to get their views on some of the drivers, challenges, applications and sectors where they see the greatest demand and potential for vendors within the Software Tech sector
 
Voice to follow the Path of OCR?

Voice recognition is currently at the stage that OCR was at more than 20 years ago. When scanned images were introduced in the 1990’s, OCR was not as effective as it is today. As this technology evolved over the next decade, the results were sufficiently accurate to provide time efficiencies and millions of dollars in cost savings.

In the early days, there were many OCR vendors and outsourced service bureaux that overstated the accuracy and created disappointment and mistrust. This hampered the adoption of OCR with clients who misunderstood the real issues regarding accuracy and speed.

There are similar, potential pitfalls that could occur, resulting in slow adoption of voice transcription. Since the early days of OCR, the OCR algorithms, capture technology and the processes around the successful use of OCR, have dramatically improved and the scepticism has been appeased.
 
Will IVR be the new IDR for the next Decade?

Apologies for the acronyms!  Although the IT industry is frustratingly riddled with them, "IDR" (Intelligent Document Recognition) is well known to the document and content capture subset of the Software Tech industry. 

The new term, "IVR" (Intelligent Voice Recognition) is now well-established as part of the lexicon in the further evolution of this sector. Voice Recognition technologies – particularly speech to text - will soon need to be integrated as another data source to be captured, processed, analysed, searched and stored, with the same ease as other record systems, such as Word documents, emails, IM messages etc.

Even in the early stages of its development, OCR was an easy sell for suppliers in the early nineties. Typically, somebody reading a document of 350 words and making brief notes, could manage an A4 document in about 2/3 minutes.

A scanner and OCR engine would process about 40 pages in the same timeframe. The system could also remember every word it read and dynamically create a searchable database as it went along. Later versions of OCR added Intelligence, based on knowledge of data in the user’s existing records management. This Intelligent Character Recognition (ICR) is now perceived as the benchmark for character recognition.
 
Learn the Lessons, Don’t Repeat the Mistakes

The lessons learned from the adoption of OCR must be applied to the ever increasing need, for the use of voice technology, if the Software Tech Industry is not to encounter equally slow and frustrating adoption and growth. Demand is certainly upon us right now - and increasing - as highlighted in our first article, "Part One: The Importance of Audio Capture for Software Tech Vendors"

Just to be clear, we are NOT talking about voice recognition technologies for the commercial and automobile sector with a variety of "Siri-like" offerings. Nor are we referring to the application of intelligent voice recognition for call centres and other customer service applications, although great strides have already been made in this area. Such applications are not relevant to this article and can create a confusing fog around the subject. 

We are focusing instead on the missing piece of content capture and, most importantly, the process and management of video and audio files that are accumulating at a growing rate. By treating voice and video like any other data source, a significant opportunity to drive value arises, which will further the evolution of this sector.

" People generally mistrust automation, even when it has been proven that accuracy over large volumes is better than manual processing. ........voice and video are now becoming more viable options.  "
George Parapadakis

The caveat here though, is that Software Tech vendors need to lift their heads now and start to understand the applications, issues and benefits. They need to create solutions to overcome their clients’ innate, unexplored, voice problems which are occurring now and will only continue to multiply dramatically in the future!
 
"  re: Healthcare sector - Voice could well be next and the potential there is not only enormous but the timing is perfect as it is already committed to digital change, has the regulatory and market pressure to do so  "
Alan Pelz-Sharpe
 
Follow the Money!
So, to the real meat of this article:  Software Tech, ECM, Content Management, Document Management, software and services vendors can make more profit, create greater growth for their company and greater differentiation by becoming the early adopters of voice / video technologies for the Enterprise Applications market. 

This means Software Tech vendors need to address the delay in incorporating such technology into their existing solution offerings.  If they wait until clients come knocking on their door, demanding such a solution, then they will inevitably be too late; they will be behind the curve and will have lost ground - as well as the lion’s share of the new business opportunities.

In the true sense of solution selling, Software Tech vendors need to alert their clients to impending problems before they arise.  The solution also needs to be designed for them as, typically, clients are notoriously bad at self-diagnosis and self-medication.  If you are not talking to your clients, alerting them to issues and potential benefits, then you are not selling – just taking orders.

Voice solutions have already been implemented in the Banking, Healthcare and Legal sectors with varying degrees of sophistication and with increasing success.  However, in talking with our clients, Boss Equity understands that the typical Software Tech vendor is not the first point of contact for such customers when it comes to Voice; this compounds the problem of being unaware of the demand or the potential size of the opportunity. 

Voice and Audio are both content management issues but, as we type, new silos are being created for such content.  This is not a good state of affairs for either the client or the Software Tech vendors.

The big profits will come to the true solutions sellers, who can take their clients through a process of discovery, geared towards unearthing solutions that circumvent problems, deliver significant benefits and high ROI.  Those vendors who don’t take this course of action, will find their companies relegated to the lower divisions of order takers, in competition with the margin-cutting masses.
 
"Needle in a Haystack” or “Smoking Gun"?

In an increasingly regulatory and litigious environment, being able to find and substantiate who said what to whom and when, can mitigate hefty fines and even possible jail sentences, whilst also ensuring compliance and best practice in meeting Information Governance regulations.  

Given the enormous volumes of voice communications being generated, one of the biggest challenges is in looking for a needle in a haystack or the “smoking gun”, and avoiding wasting time searching for the wrong or irrelevant information.
 
Finding the Unknown Unknowns

One of the key challenges in capturing, processing, analysing and searching through thousands of hours of voice recordings, is the ability to scrutinise what is said - whether via telephone or other voice recording devices - and establish the relevance, context and sentiment of such communications - These are the "unknown unknowns".

“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know  ”
Former US Defence Secretary Donald Rumsfeld
 
Safeguards: Financial Traders
Two of the major areas where Voice Recognition is being deployed are e-Discovery and Litigation Support.  As mentioned in our first article, the fallout following the financial crisis of 2008, which gave rise to the Dodd Frank Act and subsequent scandals such as Libor, in the Financial Services sector, has accelerated the need to capture and analyse what is being said by traders.  To manually transcribe telephone calls for only 50 traders for just one year will cost a minimum of $2M.

"  ...today we have pretty accurate technology (voice); the barrier is informing people of its existence and then taking all of that and leveraging it in common work situations  "
Alan Pelz-Sharpe
 
How Quickly Do You Want to Stop a Rogue Trader?
If a trader has gone rogue, how soon do you want him or her stopped? Recording the telephone conversations of 50 traders is fine but what then?  Do you have another 50 people listening to this audio in real time to check for compliance and regulation breaches? - Hugely costly and totally unrealistic.

Voice is now at the stage where it has reached the tipping point, just as happened with the impact of OCR many years ago. It’s not perfect but it’s better than the alternative, which is manually checking such voice communications in real time.

According to current estimates, typically, no more than 0.1% of audio files are ever heard. Even taking the Enron example and with all the inaccuracies involved at that time in voice transcription, an automatic check of 100% of the audio recordings at say 70% accuracy would have been better than no check at all, or simply doing a manual checking of just a small percentage.

Currently, Audio is all but inaccessible to an ECM system. As with the introduction of OCR and now, with Audio/Video technology, petabytes of information can now be unlocked. It’s never going to be 100% accurate but it is a huge step forward.
 
Other Technology Challenges for Voice Recognition

One of the biggest challenges is in the ability to accurately capture telephone or mobile voice communications. Telephone speech is highly compressed and stored as 8 KHz rather than 16 KHz for desktop speech or from other voice recording devices. Sound quality and storage space present significant issues. Speed of processing power, accuracy and significant cost issues need to be factored in to leverage the real value of Voice Recognition technologies.

Further challenges centre around how to accurately capture, recognise and transcribe foreign languages. Not just languages, but also local accents and dialects, colloquial phrases and slang that you will hear spoken, but not necessarily written.

In addition, idiomatic nuances such as sarcasm and innuendo, as well as sentiment in voice communications also need to be analysed to avoid looking for a needle in the haystack, or searching for the wrong information. Can the technology identify an individual's voice, no matter which phone the audio is taken from? Can it understand concepts as well as keywords? 

Early stage speech to text has struggled to achieve anywhere near the accuracy levels of early OCR. Character pictures for Latin based languages are all fairly similar and with only 26 characters to recognise, options are limited. Voices, on the other hand, rarely sound the same in pronouncing a single word. Include the confusion of different accents and poor recordings and accuracy levels drop alarmingly.

There is another critical problem with Voice, which is not common with paper: in a normal conversation, people talk over each other so there are overlapping streams. It is very difficult for technology to discern multiple, concurrent voices, particularly if the sound quality is also affected by ambient noise.

"It's all about managing business in today's real time world. It is all elements that affect all processes at all times. Documents, transactions, voice, pictures.  If it comes in, you have to deal with it somehow. Ignoring it just does not work.  "
Harvey Spencer

Processing Speed Will be the Biggest Differentiator
People speak at about 150 words per minute, so the equivalent of our A4 document above would take about 2/3 minutes to listen to and take notes. Speech to text has been running at about three times slower than the original recording speed, with less accuracy than OCR. Hence the reluctance of people to buy into text conversion of audio and video.

Processing speed will therefore clearly be one of the most critical factors in the ability to transcribe, process, convert to text and understand audio communications.
Despite some challenges, “Intelligent” Voice Recognition would appear to now be coming of age. Jen-Hsun Huang, CEO of Nvidia, speaking at their GPU Technology Conference, commented that Deep Learning, using their GPU technology was at such a stage of development that Intelligent Video Tagging and Voice Translation would never be the same again.  Andrew Ng, Chief Scientist at Baidu Research, speaking at the same conference, went on to predict the death of the phone and the arrival of 99% speech to text accuracy.

It is this technology that is allowing “IVR” to achieve the type of speeds that are available with OCR, running typically at 40-100 times real speech speed. The addition of client and speaker based intelligence data, enables dramatically improved recognition results.
 
And what about metadata?
The ability to tag relevant information, such as the time and date of any conversation and the ID of the caller, is essential in the retrieval and analysis of such communications.In addition, Voice Recognition solutions, need to comply with Data Protection with the addition of redaction of sensitive voice data.

As industry giants, Apple, Amazon, IBM and Microsoft wade into the arena, we are already witnessing how Audio and Video capture are set to become the next big battleground for the Software Tech industry. As this technology continues to mature and evolve over the next 3-5 years, we anticipate increased M&A activity as contenders vie for supremacy.

To address some of the issues raised in this article, Boss Equity speaks with industry veterans, Alan Pelz-Sharpe and George Parapadakis to get their views on some of the drivers, challenges, applications and sectors where they see the highest demand and opportunity for Software Tech vendors.
 
Q1 What has been holding back audio and video content management over the past few years?
Alan Pelz-Sharpe:
The technology for the management of audio and video files – the capture, decoding, tagging and transformation is not new and has been coming along at a healthy pace. The issue is that it has been considered a completely separate niche market/industry under the banner of Digital Asset Management (DAM). So it has grown up with little input from or connection with the world of ECM. It's also an underfunded and poorly understood area, so though folks in creative and media utilize this technology regularly, its value and use elsewhere has been limited. So the answer to what is holding back audio and video content management is two-fold, the fact that it remains a standalone separate market and the fact that this market is underfunded.

George Parapadakis:
People generally mistrust automation, even when it has been proven that accuracy over large volumes is better than manual processing. Audio and video files have traditionally been stored on a “just in case” basis to support compliance, but rarely as a key input source for content management. Hopefully, that will change as the voice recognition quality matures. Another key reason has been storage capacity: Video and audio files typically require disproportionate amounts of hard disk space to manage and most ECM platforms would struggle to manage such volumes and justify the costs. Again, with ECM platforms improving, Bid Data platforms offering massively parallel processing, and storage costs diminishing, voice and video are now becoming more viable options.
 
Q2. What are some of the specific technical barriers that need to be overcome?
Alan Pelz-Sharpe:

To be fair many of the technical barriers have already been overcome but they have been optimized for the world of broadcast and web publishing not true enterprise needs. So, for example, ingesting and decoding a voice file and converting it to text is a common enough practice, but the specific considerations of eDiscovery or enterprise search and associated workflows are not something that vendors have considered much. Of course, accuracy can always improve, but today we have pretty accurate technology; the barrier is informing people of its existence and then taking all of that and leveraging it in common work situations.
 
George Parapadakis:
Voice and audio processing has its own unique issues to overcome: Voice recognition requires good quality input to achieve good results. Sound capture quality varies quite dramatically between devices and the problem grows exponentially when voices of multiple participants overlap or if there is ambient noise. The human ear has a huge dynamic range and can easily distinguish and isolate sounds, something that recording technologies cannot yet match. On top of that, you need to account for different languages, local and international accents, regional variations, local dialects, regional vocabulary, differences between male, female and children’s voices, etc.  From an ECM platform perspective, the key barrier will be the handling of very large audio stream files, both in terms of storage but also in terms of intelligently distributing stream segments – rather than whole files - over networks.
 
Q3 In which sectors and applications do you see voice recognition having the biggest initial impact?
Alan Pelz-Sharpe:

I would say that the world of social media is the next one to really want to leverage voice recognition, particularly as many of its core customers are already reluctant to use keyboards beyond basic texting. Siri and its like are finally gaining popularity at the personal device level and vendors, such as Facebook and Google, as well as gaming sites (which are essentially social sites), are looking to invest and move beyond the keyboard.

In broader terms though, Healthcare for me is at least an obvious sector here and one that is already undergoing a massive digital transformation – They are mass producers of rich media files and struggle to leverage them effectively. Even the best equipped and funded hospitals in the US typically have no real strategy or understanding as to how to manage and get full value from the lifecycle of a digital asset. Yet it’s a sector that is increasingly reliant on video and imaging with handheld devices becoming common on the wards. Voice could well be next and the potential there is not only enormous but the timing is perfect as it is already committed to digital change, has the regulatory and market pressure to do so, along with (most importantly) the deep pockets to do the job well.
 
George Parapadakis:
I believe that real-time voice analytics is one of the most promising application areas in this space. The ability for a system to monitor human dialogue, analyse what is being said, the tone of the conversation and offer just-in-time knowledge - be it targeted marketing or problem resolution -  has a huge potential for improved customer experience. In addition, voice recognition could take over from OCR where manual form filling and data entry is concerned, making a huge impact, both in terms of cost and accuracy.

I would not dismiss Call centres as a main target for voice recognition in the context of the ECM/Information management sector. Not in terms of self-service customer navigation, but in terms of real-time analytics and customer profiling, in order to drive targeted, real-time marketing; or alternatively, real-time knowledge navigation to offer “most likely” problem solving guidance to the telephone operators; or to automatically launch complaints handling or service requests based on automated transcripts instead of the operators having to type the request from the customer. These can show direct bottom line benefits, which is what will drive growth and wider adoption in this segment.

Your example of the need to capture and analyse voice communications in the financial trade floor environment in real time has parallels with other technologies available today, which can identify and stop both email and IM communications in real time, if they contravene regulations, HR guidelines or corporate policy.

Q4. What are the main drivers that you are seeing now and are there additional drivers that we will see in the next few years?
Alan Pelz-Sharpe:

Tools such as Facetime and Snapchat have already made deep inroads into the way people use applications and devices and though I love my keyboard, younger people do not and will want to move beyond the limitations of basic texting - Voice and video is the logical next step. So, in other words, the consumerization of IT and the trends we are seeing in the personal app world will be the biggest single driver for change.  The other driver is likely to be simply a dawning realization that these technologies exist and are more advanced than many believe – add analytics and machine learning/AI to the equation (voice and video generates a lot of minable data) and you will have something interactive and of high value in many work situations.
 
George Parapadakis:
Reduced storage costs and improved network bandwidth are pre-requisites for working with audio and video. As infrastructure improves, we will inevitably see more and more ubiquitous use of verbal communication. Compliance will always be a key driver, where audio and video are used as evidentiary artefacts, but commercial benefits in marketing and operations are starting to become strong drivers for these technologies too. Finally, as voice recognition starts to become mainstream through smart devices, people’s trust in the quality of the outcome will increase, driving more demand for these capabilities in the workplace.
 
About George Parapadakis:
George is an internationally recognised, ECM strategist and thought leader. With over ten years’ experience in driving IBM's ECM portfolio in the UK; he specialises in business growth and new market development. http://linkd.in/1BPHmgK
About Alan Pelz-Sharpe:
Alan is a renowned analyst who’s career spans many decades and includes research roles in Information Management (ECM, RM, DAM, WCM) as well as, more recently, advising on future development, disruption and innovation in the world of business applications. Alan is also Director on the AIIM Board (Association for Information and Image Management) http://linkd.in/19zncOX

About Harvey Spencer
Since 1989, President and CEO of HSA, a New York based company, specializing in analyzing and following the World Wide capture software market and vendors. Traditionally, this has been paper, fax, PDF's, but has expanded into voice records, photographs and video information. http://linkd.in/1x3qXX9

13 June 2017