Sign in to follow this  
Murdoc

Speech Recognition?

Recommended Posts

Hey,

 

The community seems to be made up of a lot of diverse people, so on the off chance someone knows anything about speech recognition, I figured I'd post some pretty basic questions.

 

I am not a programmer or much of a technical person and so far my knowledge on the subject is mostly based on this article: http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm

 

Any basic speech recognition software I've used either over the phone, the xbox, or whatever has been not the greatest and unreliable. This is understandable because of the incredible complex nature of what it needs to understand and the computing power required. It's not my intention to try to get or have something made that is on this level, but since I know very little about this type of software, I'm curious to know if there is a middle ground that could be useful to my needs.

 

Does anyone have any experience working with or programming this type of software? How plausible it is to make a very basic system that recognizes a small dictionary rather than trying to understand millions of complex words? Is there middleware for this sort of thing or is it something where a specialist would be needed to program something bespoke?

 

 

Share this post


Link to post
Share on other sites

I spent some time messing with the Microsoft .NET speech recognition libraries. They will let you specify a dictionary of phrases and fire events when one is hit, or attempt to capture all words. I personally found them to be a pretty inaccurate, but my use case was also pretty weird...I was trying to match audio being "spoken" by a text-to-speech app I wrote and was piping through the voice chat of a game. There were a few other weird things with these libs that I think would have made them hard to use to make a game or use in Unity, like the small windows application that apparently had to constantly be running for the recognition to capture audio at all. I spoke to some developers at PAX East who were making a squad tactics game (the name of which escapes me) where you controlled your troops using spoken commands and they seemed very happy with the MS tools, so maybe I just didn't spend enough time investigating. Docs: http://msdn.microsoft.com/en-us/library/system.speech.recognition(v=vs.110).aspx I might be wrong but I believe all this MS speech stuff was originally based on CMU Sphinx, an open source Carnegie Mellon speech recognition project. http://cmusphinx.sourceforge.net/

 

Also, not totally related, but while I was looking into this stuff, I ended up reading this white paper from a Shazam developer that mostly went over my head, but was pretty interesting regardless. I imagine some of how their music-matching works would be very similar to matching speech. http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

 

A couple more libraries I know of:

 

I haven't used it personally, but there's an Intel library on the Unity asset store called Intel Perceptual Computing that has some voice recognition that I believe also uses a dictionary of commands, and I think has some other funky stuff in it like motion controls/gesture recognition.

 

This is a javascript voice recognition library for controlling websites via voice. I haven't done any web development or html5 game dev that would call for it, but I skimmed the docs and it seemed to be very flexible and incredibly easy to use. https://www.talater.com/annyang/

 

Edit: So to more directly answer your question, while the basics of the tech definitely exist and are actually pretty easy to get started with, it seems like tweaking those tools to fit every situation (different mics, voices, accents/dialects/languages, etc) is much harder. But, I think as people get comfortable with tools like Siri and Kinect and expect more from them, development will accelerate.

Share this post


Link to post
Share on other sites

Thanks for the links Dinosaurssssssss.

Haven't had much time these days go delve into it all, but looking at the third party options/extensions/sdk stuff I think I have a better idea about the vocabulary I should be using when asking questions.

The MS site talks about creating grammars, which I guess are recognized phrases/words?

The libraries is the dictionary of words that can be used?

I'm curious if it is at all possible to edit or make your own library, which I assume is the hard/impossible part.

I'll keep reading when I have more of a chance to.

Share this post


Link to post
Share on other sites

Haha it didn't even occur to me writing that post but I imagine my use of the word "library" is probably really confusing in this context. What I'm talking about when I say library is really just a software package, I guess. A set of functions and objects that let you do a specific task, in this case voice recognition. Wikipedia definition here.

 

But yeah, a grammar is the set of recognizable words and phrases. In the MS suite I believe you just pass a list of strings in to serve as the grammar. 

 

Also I remembered the name of the voice controlled squad tactics game I mentioned in my other post: There Came An Echo. Which I think is a really good title. Also here's a Polygon piece with some more info. http://www.polygon.com/2014/4/15/5606644/there-came-an-echo-preview-voice-controls-pax-east-2014

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this