Part One: The Importance of Audio Capture for Software Tech Vendors

In this first article of a three-part series, Boss Equity senior executives identify the next significant opportunity for Software Tech vendors, outlining some of the key developments and drivers that are creating a growing demand for Voice Recognition Technologies.

When most people think about voice technology, they tend to think about voice activated apps on smart phones & tablets that only work optimally approximately 50% of the time. Personal experiences of voice recognition on such devices have probably influenced the attitudes of Software Tech business executives, as many have, so far, failed to grasp the depth of the business opportunity.

Audio & Video to Go Mainstream Content Management  

Despite these experiences, we anticipate that both audio and video will be mainstream within the next 3 years, meaning that they will be a predominant feature in the planning, development and implementation of any Enterprise Content Management system.  In the past, many solution and development companies have been slow to respond to the market, allowing their more flexible competitors to gain the upper hand in delivering innovation and thus, racing ahead of the pack in the learning curve.  To make a sporting analogy - it’s no good arriving ready with your cricket bat after the bowler has bowled.

Content Management Has Gone Multi-channel & Multi-source

The management of content is now multi-channel and multi-source. Audio and video are two additional sources of content and data that need to be managed with all the same objectives as managing a simple WORD document. Voice activated technology is being used in the automobile & electronics industry to activate electronic devices. It is also now used increasingly often in call centres with IVR (interactive voice response) to automate requests to access information, minimise human intervention and improve customer service. However, these software applications only skim the surface of voice technology and are of less interest in information management and are therefore, not the specific focus of this article. 

For Software Tech vendors, audio and video content is set to become a business-critical factor with the potential for skirmishes amongst contenders for greater differentiation and increased value against the competition. In Part One of this two part series, we focus on the data capture implications of voice recognition technologies to highlight the increasing importance of audio and video files for Software Tech vendors and service providers. 

 

Enormous Potential for Audio & Video Information Management

Any change in the market creates a threat but also a huge opportunity for those who grasp these changes and market drivers and adapt their business accordingly, to meet the demand. The ongoing increases in audio and video files are no exception.  In a previous article "2015 - The Next Big Wave”, we reported that Gartner had forecast that by 2017, 50% of all business content, held by companies, will be audio and video files. We knew that audio and video files were increasing but that massive leap surprised us.  As with all analyst predictions, one tends to take them with a large pinch of salt. However, the reality is that the increase is enormous and, even if it proves to be only 25%, that is still a highly significant percentage of business content.

The explosion of audio and video files is due, in part, to the prevalence of Social Media platforms which present and share such content, as well with the use of both audio and video in various forms of marketing, alongside images and/or text. Such Social Media communications are increasingly becoming an essential part of information flow and communication sharing, both inside and outside the enterprise. In addition to this, there is a much greater regulatory and compliance driver that is forcing change.  This became critical following the financial crisis of 2008. What resulted from the various government agency enquiries was the Dodd-Frank Wall Street Reform and Consumer Protection Act.  This covers thousands of pages and introduces many provisions to create stability and exert greater control, in the hope of preventing anything similar happening again.

The Dodd-Frank Reform Act

Apart from a whole host of provisions, the Dodd-Frank reform introduces record-keeping and disclosure requirements which mean that, under section 1.31 of the Commodity Exchange Act, all electronic records need to be kept "exclusively in a non-rewritable, non-erasable format".  This includes all communications in all formats - including telephone calls. 

Even outside of the Dodd-Frank reform, storing and retrieving audio is critical for many organisations, including, but not exclusive to, insurance, telecommunications, government and finance.  The ramifications of such record-keeping incorporate legal, security, compliance and forensic investigations.

Where litigation is concerned, there have been several examples where the importance of capturing key telephone calls or meeting conversations, have cost big corporations sizeable sums of money in fines.  One such scenario arose in the case of Guy Hands and his battle with Citigroup over the purchase of EMI.  Hands claimed that Citigroup had acted unfairly and there were several telephone conversations where he claims he was given incorrect information.  His inability to retrieve three alleged telephone calls cost his company billions.  The fact is, that, had a recording been attached to the client file, the case might never have gone to court. Consequently, capturing such verbal communication is no longer seen as “Gee whizz!” technology; it is far more about risk mitigation and compliance.  And, as demonstrated above, the business savings of getting this right can be substantial.

If audio and video are captured in vast quantities in the future, they will also become a source of valuable data and knowledge for the benefit of the enterprise. The requirements then, are for efficient, accurate, intelligent multi-channel capture, analysis, storage and, crucially, retrieval.

 

Is Voice Recognition & Translation an Issue for Now or the Future?

As with all advancements in technology, the issues will boil down to a just few important elements. Are there sufficient market drivers? In this instance, just looking at the legal and compliance issues, we can answer an unequivocal “Yes!” to that. Subsequent to this, come the cost and practicality of the technology.  There are numerous players in this area and there has been a growing number of acquisitions of voice recognition companies. Large sums of money have already changed hands and we are a long way from the end of the acquisition trail for voice recognition technologies.

Over the next several years, there will be a growing wave of practical solutions and systems that will be endorsed by end users. No single company has cracked the magic formula that will open up the market for them and afford them dominance. Yet.  The likelihood is that, in the initial phase of uptake, a number of players will dominate in their various vertical sectors.

 

The Missing Piece of the Information Management Ecosphere

In the Information Management ecosphere, the one missing element of content to be properly captured, processed, analysed, accessed and stored, has always been voice and audio communications.  Converting speech to text and processing this with the same ease and accuracy as with documents and other content is likely to become the new frontier for the sector.  There are a number of challenges the Software Tech community needs to understand. The first is the issue of speed and processing power.  If you have 50 traders, each creating hours of audio files each working day, you need the speed and processing power to manage this amount of processing and translating to text. By doing this, companies can be certain to respond rapidly to problems as they occur as well as having the ability to stop them continuing.  If, for example, they have a rogue trader, they can take decisive action before the situation continues for too long and causes a far greater problem.

 

Managing the Velocity of Audio & Video Information

The second point is regarding the speed advantage of having audio files translated to text.  If you have 40 hours of audio being recorded daily and you have two people (sometimes highly paid lawyers) checking the audio recordings, they will simply be listening in real time, which will take them a minimum of 4 hours.  In fact, with going back over parts of recordings and stopping for breaks, it will most likely take them considerably longer.  Therefore, they will only ever be able to spot check – and, being realistic, if they are doing this in real time, using audio files, they would be doing exceptionally well to spot-check even 20%. Thus, if you translate to text and undertake an intelligent search of the translation, even though speech to text is only 70% accurate, because you are able to check 100% of the audio recordings, it is still a far better solution than simply spot-checking 20% - As, even if that 20% is 100% accurate, it’s still not as good as checking 100% at 70% accuracy.

What is clear, is that audio and video will be stored as business content, like any other information or document. Software Tech players need to look carefully at this growing sub-sector of content management. Currently, none of them seems to have taken a lead in this area and, with margins for current software solutions becoming keener, fewer big deals being won and the differentiators between competitors becoming finer, then it seems highly possible that audio and video could be the final battleground to win the war.

It is no longer a case of IF Software Tech vendors should embrace this new technology, but rather, WHEN?

 

Input and Thoughts from Harvey Spencer Associates

Boss Equity explores some of the issues around audio and video content capture with Harvey Spencer, a leading industry expert in Capture technologies, for his views on where this may all be heading. Harvey terms this new area, as part of their ongoing research and analysis, as “Capture 2.0” -  systems that represent the new face of capture, leveraging modular cloud technologies to provide real time understanding of unstructured and semi-structured information, including documents, voice, photographic and video.

 

Q.1: Gartner estimated that by 2017 audio and video files will be 50% of the content held by companies.  Would you agree with this statement?  If not, what is your estimate?

I do agree with Gartner’s estimate and think it will go much higher, but the issue is how much of that is relevant to transactions or records management and compliance. 

Much of this is generated through social media – I use Apple Quick Time for example for video talks, replacing a conventional telephone call.  Then, if you look at YouTube, for example – According to latest statistics (jeffbullas.com)

  • 60 hours of video are uploaded every minute, or one hour of video is uploaded to YouTube every second.
  • Over 4 billion videos are viewed a day
  • Over 800 million unique users visit YouTube each month
  • Over 3 billion hours of video are watched each month on YouTube
  • More video is uploaded to YouTube in one month than the 3 major US networks created in 60 years
  • 70% of YouTube traffic comes from outside the US
  • YouTube is localized in 39 countries and across 54 languages
  • In 2011, YouTube had more than 1 trillion views
  • In 2011 there were almost 140 views for every person on Earth

 

Of course, much of this is irrelevant but increasingly, business is using YouTube for advertising and training. People are being encouraged to video accidents and video feeds are used for law enforcement.  This then, starts to cross over into a need to manage the records, to retrieve information and to integrate.

From our needs perspective, I think Video consists of three elements: Photographs, Voice and Movement.  Each contains information.  But this article focuses on the voice element which I don’t think we have focused on enough.
 

Q.2: What have the main barriers been, thus far, for the wider adoption of voice recognition for the Enterprise Applications markets in both the commercial and public sectors?

Accuracy, probably.  I tend to think of voice recognition at about the same stage as OCR was about 10 years ago and, in some ways, analogous to the challenges associated with cursive handwriting recognition.  We were all frustrated with accuracy levels which frequently did not live up to the claims of the manufacturers.  The reasons behind this were many and varied, but slowly, more technology, combined with more processor power, improved the accuracy levels to something that is now key and highly valuable in the capture space. Given the massive amount of data that needs to be recorded and converted to text, processing power and speed will become increasingly critical for voice recognition.

Voice is likewise, frustrating, but it is improving very quickly as more history is included and as multiple algorithms are embedded – similar to the way that voting between different technologies was a key to improving OCR and sentence understanding is improving handwriting.

A further key factor, of course, is sentiment analysis. Especially when coupled with social media look ups, this can be extremely important in ECM systems when trying to analyse what a person is actually discussing over the phone. This is further complicated as it is language dependent and the use of figures of speech, sarcasm or innuendo make the contextual meaning of the conversation difficult to analyse.

Q.3: Which particular vertical sectors are using audio or video content management now and which vertical sectors will be next to convert? 

I think we are talking two different issues here.  Audio can be used a great deal more than at present - We are seeing automated language translation start to appear with Skype from Microsoft, currently just English/Spanish, Spanish/English.  But this sort of language independence has HUGE implications.   A small Canadian company I spoke with recently has developed software that analyses ones accent and intonation and then corrects the pronunciation – we can all relate to the need for that in interacting with some distant call centre staff.

We are seeing this voice understanding being used for training - for example, in hospitals which often have a number of foreign born nurses and aides.  In an operating theatre they need to be able to immediately locate and hand the relevant implement to the requesting surgeon as he or she cannot spend time explaining. 

For the Insurance market we have started to follow sentiment analysis and voice based stress analytics software – where, in the case of accidents, this can be used to identify whether emergency help needs to be summoned.

As mentioned above, law enforcement relies on video feeds - The UK population is now said to be more surveyed by CCTV than any other citizens in any other country.  Most is ignored – until something happens, such as the riots a few years ago and then video feeds are scanned.  The military relies on understanding video for intelligence.  Likewise, security cameras are key and when there is an insurance claim, video feeds become extremely important. 

In fact, I believe that the Insurance market can use video understanding more effectively in claims adjudication.   The call centre world is using voice, but it is logical that they will move to video as it provides more information

Increasingly Banking is becoming more remote with video tellers and loan officers interacting with customers - This begins to force video down the compliance avenue.

 

Q.4:  What are the key implications of the increase in audio and video content for the traditional ECM vendors and where is this heading?

The main ECM markets are in Financial Services, Insurance, Healthcare and Government.  All these are mentioned above.  Video and voice contain content and content must be managed.  Traditional ECM vendors cannot ignore this new media.

 

Contact Details

Harvey Spencer is the owner of Harvey Spencer Associates www.hsassocs.com
Twitter: HSAssocs
Email: harvey@hsassocs.com

HSA is dedicated to helping the industry evolve into this new area.  For more information inquiries@hsassocs.com

Capture 2.0 systems represent the new face of capture, leveraging modular cloud technologies to provide real time understanding of unstructured and semi-structured information including documents, voice, photographic and video.                                            

Mark Edwards CEO of Document Boss

John Symon, SVP, International Markets

 

 

 

12 June 2017