Optimal Speech Recognition: Putting Your Words into Action

min read

Key Takeaways

  • Despite the technology's considerable promise, users continue to view speech recognition skeptically, particularly when it comes to total hands-free dictation.
  • Making speech recognition useful as a personal productivity tool requires defining a personal strategy for where, when, and how to use it.
  • Increasing the speed and accuracy of speech recognition depends on optimizing supporting technologies, including CPU speed and microphone sound quality, as well as properly configuring your speech software — and your speech habits.
  • Speech recognition's benefits can be quickly realized by optimizing your balance between speech and the keyboard.

Speech recognition — we know it's out there, but what can it actually do for us? Vendors are integrating voice interaction features into mobile devices, and credible efforts are being made to perform real-time language translation. Desktop speech recognition has become more natural, more accurate, and less expensive than ever before. But what does this really mean in terms of productivity? Is speech recognition technology helping people do the things they need to do? Is it "there" yet?

Despite all the promise the technology shows, users seem reluctant to embrace speech recognition as a productivity tool. Speech recognition's most proven productivity boost at the PC level is found in text dictation: It can transform a hunt-and-peck keyboardist into a 100+ words-per-minute typist. However, this productivity boost often remains unrealized because users are intimidated by the voice commands entailed in starting the program, moving the cursor, correcting mistakes, and making revisions. At the core of user reluctance is the lack of a strategy for using speech recognition.

I have an advanced form of muscular dystrophy, and speech recognition empowers me to do things completely hands-free. My speech recognition strategy is quite easy to define: use speech to do as much as possible hands-free; invest deeply in learning to use the technology effectively; and adapt myself to the technology as necessary. Ironically, too many people believe that this strategy is the only strategy for using speech recognition. With a relatively small investment of time and effort, however, you can combine speech with keyboard input and reap many benefits. The key to making speech recognition a productive tool is to define your own personal strategy for where, when, and how to use it.

This tutorial focuses on optimizing speech recognition performance and identifying strategies for using Dragon NaturallySpeaking software. Dragon is an industry leader for desktop speech recognition and offers advanced options that let you develop a personal strategy for composing, correcting, and controlling computer input. Figures 1 and 2 show how the software's DragonBar appears when interfacing with Microsoft Word and Internet Explorer, respectively.

Figure 1
Figure 1. DragonBar above Microsoft Word window

Figure 2
Figure 2. DragonBar above Internet Explorer window

Optimize Speed and Accuracy

The first step in realizing speech recognition productivity is to optimize your approach to ensure maximum speed and accuracy. Too often, potential users give speech recognition a "test drive" without paying heed to CPU speed, microphone sound quality, speech software configuration, and proper speech habits. The following steps outline best practices for making the most of your initial speech recognition experience.

Optimize Hardware

With regard to Dragon NaturallySpeaking, and speech recognition software in general, faster processors produce faster performance. An Intel Pentium 2.4 GHz (dual 1.8 GHz core processor) or equivalent AMD processor is recommended. L2/L3 cache is fast temporary read/write memory located on the processor chip. Access time to cache is 1,000 times faster than access to RAM. Thus, 2 Mbytes or more of processor cache can significantly ease the heavy computational load required to accurately convert spoken utterances into digital text. For 32-bit and 64-bit Windows 7 systems, 4 Gbytes of RAM is recommended. You'll notice processor performance in terms of the time delay between speaking words and seeing the words typed on the screen. Minimizing this delay helps maintain your flow of thought while dictating text. Robust CPU power also lets you adjust software options that will increase recognition accuracy without compromising speed.

Video demo of Dragon NaturallySpeaking 11.5 and optimizing hardware for it (2:06 minutes):

Optimize Sound Quality

Speech recognition success is heavily dependent on sound quality, which is influenced by the intentional sound from the microphone and the unintentional sounds from your surrounding environment.

Microphone Input. It is imperative to choose a microphone that is specified for speech recognition and includes noise cancellation. Microphone sound quality is best with a wired USB headset, which keeps the microphone consistently positioned a few inches from your mouth. (Microphone position tends to vary dramatically with handheld or desktop microphones.) Wireless headsets can be quite effective, but performance varies depending on hardware compatibility. Even wireless headsets rated high for compatibility with Dragon still sacrifice some sound quality by virtue of their wireless transmission. In many cases this sacrifice is negligible, but a wired USB connection offers cost-effective, high-quality sound with minimal compatibility concerns.

Having a USB connection lets you avoid sound card compatibility issues and prevents hardware "noise" from diminishing microphone sound quality. Nuance's Hardware Compatibility List provides an accuracy scale for many common wired and wireless microphones.

Background Noise. Background noise in your work environment will also affect system performance in terms of processing speed and recognition accuracy. A binaural (two ear piece) noise-cancelling headset can produce surprisingly good results in noisy environments. However, noise-cancelling technology is designed to filter out lower frequencies associated with ambient noise. As a result, people speaking loudly in the immediate area will often diminish response time and accuracy while the system attempts to differentiate the speaker from this particular background noise, sometimes trying to transcribe both.

Optimize the Software: Training the Technology

The focus of this step is software optimization — that is, training the speech recognition software (in this case, Dragon NaturallySpeaking) to the individual user.

Clean Installation. To make the most of your initial experience, it is important to begin with a clean installation of the latest version of Dragon NaturallySpeaking. Before installing the latest version, (currently 12), uninstall any other version you have used. Additionally, you should run a Dragon Remover utility to remove all traces of previous versions.

New Speech Profile. Once you have completed a clean installation of Dragon's latest version, you should create a new user profile. In the past, this was a somewhat involved process requiring users to train their voices by reading a lengthy script and repeating numerous commands. In recent versions the time and effort associated with voice training has been significantly reduced. Within 10 minutes, you can create a new user profile with amazingly accurate speech recognition results. Although Dragon does support importing, exporting, and upgrading of user profiles, creating a new profile after upgrading to the latest version is often the best practice for optimizing accuracy.

This is particularly important if you switch to a new headset microphone, which will produce different acoustic representations than was modeled on your existing user profile. As I describe later, your largest investment in a user profile is in creating, maintaining, and transferring your custom commands and custom words list.

Initial Speech Settings. Dragon NaturallySpeaking has many software options that can accommodate various performance strategies. Initially, it is best to choose settings that simplify your speech dictation experience and minimize the likelihood of unexpected results. The Dragon Options Guide describes the changes you should make to the Dragon default settings. Figure 3 shows Dragon's Options screen with the Commands tab chosen.

Figure 3
Figure 3: Dragon Options screen and Commands tab

Custom Words List. You can significantly improve Dragon recognition accuracy by importing a list of commonly used words and phrases to your user vocabulary. Although you can add uncommon names and acronyms to the vocabulary by correcting dictation errors as you go, it is best to create and import a list immediately after creating a new user profile. This list can also include phrases that can further increase recognition accuracy of street addresses, organization names, and frequently used reference codes and model numbers. The Dragon Custom Words Guide describes how to create a Dragon custom words list.

Optimize the User: Training the Person

This step is focused on tailoring individual users to the speech recognition software. As the name Dragon NaturallySpeaking denotes, the software is intended to mimic natural language as much as possible. Although having to change your behavior to accommodate recognition accuracy is a serious faux pas in the speech recognition industry, a willingness to adapt to the technology can significantly increase your productivity.

Speech Habits. Clarity of speech dictation need not be exaggerated with long pauses between words, but you should enunciate each word in a consistent manner. Dragon can even learn heavy accents if the words and phrases are pronounced consistently.

Speech recognition software is most likely to misrecognize single-syllable words, such as "to," "the," "on," and "up." Clear pronunciation is the only effective way to reduce such errors. You can increase speech accuracy and decrease the time required for corrections by practicing the following dictation habits:

  • Position the microphone a few inches to one side of your mouth.
  • Dictate punctuation (this will become natural with practice).
  • Speak clearly and evenly (without slurring words together).
  • Pause before and after voice commands (such as "Save File").

Microphone State and Volume. Practicing good speech recognition habits includes understanding some basic, yet important, Dragon features. First among these is the microphone state and volume indicators displayed on the left end of the DragonBar (see figure 4). The microphone can be in one of three states: off, on but asleep, or on and awake.

Figure 4
Figure 4. DragonBar indicators of microphone status

  • The microphone is off if the indicator is red and shows a horizontal microphone. Clicking the indicator icon, or pressing the numeric keypad's plus key, will toggle the microphone on.
  • The microphone is on but asleep if the indicator is yellow-orange and shows a diagonal microphone. You can activate it by speaking the "Wake Up" or "Listen to Me" commands. Clicking the indicator icon, or pressing the numeric keypad's plus key, will also wake up the microphone.
  • The microphone is on and awake — that is, ready for dictation — if the indicator is green and shows a vertical microphone. In this state, Dragon will try to translate everything you say into text or commands. Clicking the indicator icon, or pressing the numeric keypad's plus key, will toggle the microphone off.

When dictating commands or text, the volume display to the immediate right of the microphone state indicator should turn from yellow to green. If the volume display shows red, you are speaking too loudly or the microphone is too close to your mouth. When you stop dictating, the volume display should turn yellow. If it remains green, there might be too much background noise or a problem with your audio input.

Because Dragon easily mistakes other words or background noise for the "Wake Up" command — and thus begins typing text — it is best to toggle the microphone on/off with the numeric keypad's plus key when it is not in use.

Full Text Control. Dictating text involves Dragon transcribing your words and typing them within a document or text field. The target application must be active, with the cursor placed where you want text to appear. If the text control indicator (located to the immediate right of the microphone volume display) shows a solid green checkmark, then Dragon dictation and selection commands are fully supported. If the text control indicator shows a gray checkmark, Dragon dictation might be unreliable and editing commands limited. In such cases, it is best to rely on the keyboard until full text control is available.

Results Box. The results box is a small window that displays words and phrases that are being recognized from your dictation (see figure 5). Words spoken continuously (without a pause) are often referred to as an "utterance." An utterance might be a single word, or it can be multiple sentences or even paragraphs. It is typically a good practice to speak more than four words but less than two sentences per utterance.

Dragon is more accurate when it can transcribe words within the context of other words. An utterance ends when you pause in your dictation. If Dragon does not recognize your utterance as a command, it will be transcribed into text. By default, transcribed text is not shown in the results box to avoid distracting users. Dragon commands must be spoken as a single utterance — that is, with a pause before and after the word or phrase. When Dragon recognizes your utterance as a command, it will execute the command and display the text of the command in the results box. As figure 5 shows, the results box also indicates the speech status.

Figure 5
Figure 5. Dragon Results Box

Recognition Corrections. Each time you correct misrecognized words, Dragon collects acoustic and language data to improve future recognition accuracy. It's therefore important to make such corrections using the Dragon Spelling Window (see figure 6), as described in the Dragon Keyboard Correction Guide. Making speech recognition corrections will significantly improve accuracy over time. However, if you simply select and type or re-dictate the misrecognized text, Dragon treats the correction as a discretionary revision and will likely misrecognize the same words again in the future. The "Acoustic & Language Model Optimizer" section in the Dragon Options Guide discusses the Dragon utility that updates the user profile based on the speech data collected.

Figure 6
Figure 6. Dragon Spelling Window

Recognition Modes. Dragon's recognition modes indicator is on the DragonBar to the right of the full text control indicator. As figure 7 shows, Dragon offers multiple recognition modes for distinguishing between text dictation, voice commands, spelling, and number dictation.

Figure 7
Figure 7. Recognition modes indicator

  • Normal mode. A dot within parentheses indicates that Dragon is in Normal mode, which permits text dictation and voice commands interchangeably. This is Dragon's default mode, and you use it for most purposes. You must pause before and after uttering a command, otherwise speech will be interpreted as text dictation.
  • Command mode. Parentheses without a dot indicate that Dragon is in Command mode. Here, Dragon interprets all speech as commands. Response time is much faster to commands issued from the Command mode because Dragon does not have to interpret dictated text.
  • Dictation mode. A dot without parentheses indicates that Dragon is in Dictation mode. In this mode, Dragon interprets all speech as dictated text. The only commands recognized in this mode are "dictation commands," which are used for formatting text as you dictate. You do not have to pause before and after using dictation commands. Dictation mode also lets you dictate sentences that contain words Dragon would normally interpret as commands (such as, "Select one from the top shelf").
  • Spell mode. When Spell mode is on, Dragon limits its vocabulary to letters, digits, and symbols. This mode offers quick and accurate results for dictating detailed information, such as URL addresses. You cannot dictate words in Spell mode.
  • Numbers mode. Dragon's Numbers mode offers quick and accurate results for dictating numerical values within applications such as spreadsheets.

Modes can be toggled on and off by dictating the mode name followed by "on" or "off." For example, saying "Spell mode on" will activate Spell mode. Saying "Spell mode off" will return Dragon to Normal mode.

Optimize Your Dictating Strategy

A keystone of speech recognition productivity is developing a personal strategy that defines where, when, and how to use speech. Formulating this strategy depends heavily on how you expect to benefit from speech recognition. Do you want to increase typing proficiency? Compensate for writing deficiencies? Relieve pain caused by repetitive stress or posture-related ailments? Streamline repetitive tasks through verbal commands? Completely replace the keyboard and mouse with hands-free control and dictation? Answering such questions will help you know when to use speech or keyboard/mouse input, what speech command subsets to learn, and how to optimize text production and revision.

Optimize Speech versus Keyboard

The least invasive way to introduce speech into your computing habits is to focus on composing text. Speech recognition's most significant productivity boost at the PC level is found in text dictation. This benefit can be quickly realized without investing time and energy in learning other speech commands. The Dragon Keyboard Correction Guide offers a detailed description of speech dictation using the keyboard for correcting recognition errors. It is important to follow the guide's procedure so that Dragon can learn from your speech and continually improve accuracy.

As you become more comfortable with speech recognition, you can begin using dictation commands (see table 1) while composing text. These commands control capitalization and spacing, and can be issued (without pausing) as you compose text. For example, saying, "all caps on optimize your dictating strategy all caps off," will produce OPTIMIZE YOUR DICTATING STRATEGY.  You can also use these commands to specify the use of single digits or Roman numerals.

Table 1. Dictation Commands
Table 1

You can gain further independence from the keyboard and mouse by learning to use Dragon direct editing commands to format blocks of text in a single step (see table 2). These commands must be issued as separate utterances (that is, pausing before and after the command). For example, saying "italicize you can gain through period," would italicize the first sentence of this paragraph.

Table 2. Direct Editing Commands
Table 2

Optimize Knowledge of Commands

If you want to further reduce keyboard and mouse dependency, you should learn voice commands associated with software applications you frequently use. Dragon provides thousands of natural language commands for controlling the Windows operating system and many popular programs. As the Nuance documentation touts, "Because the number of valid commands is so large, you should just try saying what you want to do. If you get unexpected results, say ‘Undo That' to undo the action and try a different way to speak the command." There is some merit in this approach, but many users find this a bit too reckless and prefer a more intentional method of learning new voice commands.

Active Accessibility Elements. When using Microsoft Office products, Dragon commands are associated with Active Accessibility element names (see figure 8). Placing the mouse pointer on any icon displayed on a command ribbon shows its element name, keyboard shortcut, and description. Typically, the element name is the speech command that you can use to select that element. For example, a tooltip window will be displayed showing the element name "text highlight color" when you position the mouse pointer on the corresponding command ribbon icon. Thus, saying "text highlight color" as a speech command will change any selected text to the active highlight color. This applies to any active ribbon. So, if you can see the element's name or icon, you can say the name as a Dragon command (that is, if you can see it, you can say it). Saying a ribbon's title will activate that ribbon and make all of its element names available as Dragon commands.

Figure 8
Figure 8. Active Accessibility Elements

This demo explains Dragon Active Accessibility commands in Microsoft Office (3:21 minutes):

Command Browser Subsets. Displaying the Dragon Command Browser shows lists of commands associated with specific software applications (see figure 9). To display it, select "Command Browser" from the DragonBar Tools menu or simply say, "Open Command Browser."

Figure 9
Figure 9. Dragon Command Browser

The list of global commands should appear when the Command Browser first opens. Choosing a specific application or context from the Context drop-down menu shows application specific commands. Some programs have command subsets for different contexts within the program (such as e-mail inbox window, individual e-mail window, or various dialogue windows.

Dragon Sidebar Lists. For another aid in learning commands, choose "Dragon Sidebar" from the DragonBar Help menu or say, "show Dragon Sidebar." Doing this displays a list of commands that dynamically change to correspond with the active application window (see figure 10).  If there are no commands specific to the active software application, the sidebar will display a list of global commands.

Figure 10
Figure 10. Dragon Sidebar with Microsoft Word.

Optimize Text Production

Some writers follow a linear writing process (plan, compose, revise), while others follow a more recursive process, going through several cycles of planning, composing, and revising. Thus, different writing styles necessitate different strategies for optimal text production with speech.

To optimize text production, you must first make the distinction between recognition corrections and content revisions. Recognition corrections are needed when the spoken utterance does not match the text produced so far (TPSF). For example, if the user says "I asked the concierge where I could find a decent meal" and the TPSF shows "I asked to comb my hair where I could find a decadent meal," there is an obvious need for recognition corrections. However, if you deleted the text, "find a decent meal," and dictated instead, "find a vegetarian sandwich," you would be performing a content revision.

Common Tendencies. In her study with Daniel Janssen and Luuk van Waes, "Error Correction Strategies of Professional Speech Recognition Users: Three Profiles," Mariëlle Leijten focused on variations in timing and frequency as10 expert users made speech recognition corrections and revisions.1 Leijten's study observed several telling and common tendencies among participants with regard to error correction.

  • Using the keyboard to make corrections. Despite being considered expert speech recognition users, study participants showed a strong preference for using the keyboard and mouse to make both recognition corrections and content revisions. This combined speech-keyboard approach lets users exploit many of the benefits of speech without sacrificing the keyboard's familiarity and dependability.
  • Delaying content revisions. Study participants often delayed content revisions, typically until the second draft (during the rereading phase of the writing task). Dragon's text selection commands can be very useful to writers who want to seamlessly revise content using speech rather than the keyboard.
  • Correcting at the end of the utterance. Recognition errors detected in the TPSF that occur close to the end of the utterance were more likely to be corrected immediately. Error detection often leads users to interrupt the text production process. To maintain a flow of thought, some writers must resist this tendency and habituate themselves to delaying recognition corrections.
  • Correcting one error leads to another. Another tendency observed among all study participants was that one correction led to another correction. When users interrupted the process of text production to make a specific recognition correction, they were likely to make another detected correction and/or content revision right away.

Error correction strategies. Given the common tendencies surrounding text dictation, it is important to be intentional with regard to forming and practicing a text production strategy. Three writing profiles defined by Leijten's study and outlined below offer a basic framework for exploring which strategy is right for you.

  • Handle immediately. Writers following the handle immediate strategy pay close attention to the TPSF and make both speech corrections and content revisions at the point of utterance during the first draft. This leads to a highly recursive writing process involving numerous interruptions, but results in a sculpted first draft.
  • Postpone revisions. Writers following the postpone revisions strategy make more than half of the recognition corrections to the TPSF immediately, but delay content revisions until the second draft. This leads to a slightly more linear writing process than the handle immediately strategy, but still involves numerous interruptions for recognition corrections.
  • Postpone corrections and revisions. Writers following the postpone corrections and revisions strategy pay minimal attention to the TPSF, focusing on the main idea to keep the text flowing. Content revisions are postponed to the second draft, and many recognition errors are deliberately left in the text to be corrected later. If you find this approach beneficial, you can set the "Save recorded dictation with document" option to "Always" or "Ask Me" to permit postponed audio playback of dictation (see the Dragon Options Guide).

Figure 11 shows the Dragon Extras toolbar and playback feature. Figure 12 shows how to save Dragon recorded audio (DRA) files. The save prompt appears when you close a document containing dictated text. If you want to postpone corrections, you must answer "Yes" to save audio data needed for the Playback feature. When you later open your document, you can compare the displayed text to your voice audio and correct it as needed. You can access Playback from the Spell Window or the Dragon Extras toolbar.

Figure 11
Figure 11. Dragon Extras toolbar and Playback

Figure 12
Figure 12. Saving Dragon recorded audio files

Conclusion

In a survey of 49 IT professionals working for the University of Missouri, 61.2 percent (30 respondents) said they had tried desktop speech recognition, most using Dragon NaturallySpeaking version 10 or older. Of those 30 respondents, only four currently use desktop speech recognition. Why so few? Many of the anecdotal comments focused on speech recognition being an annoyance to others within cube-office environments and/or a lack of privacy in the office when dictating. This is a formidable obstacle that could prevent the use of speech recognition in some situations. However, people make private phone calls within similar environments, and using a soft-spoken tone of voice is equally effective for speech dictation. Thirteen survey respondents expressed interest in trying again and exploring how speech recognition could make them more productive.

Take the Survey

Have you tried speech recognition? Your feedback would be appreciated by completing the very short survey (fewer than 10 questions) at the following link:

https://www.surveymonkey.com/s/TTDMBX3

The data collected will guide efforts in finding innovative ways to make speech recognition practical and productive. Results will be shared in a later briefing.

Why not just use Windows Speech Recognition (WSR), which is built in to Windows Vista and 7 operating systems? For some people, WSR provides an excellent, cost-effective option. Indeed, when speaking of WSR, a colleague of mine said, "I find it very convenient to browse the web while I am eating or doing anything else that takes both hands — it works amazingly well." WSR accuracy can be quite good (presumably good enough to overcome speaking with your mouth full), but it does not offer keyboard correction and lacks many of the advanced features needed to optimize your personal speech recognition experience. See the Dragon Internet Surfing Guide for using Dragon NaturallySpeaking with Internet Explorer.

The ultimate question is whether buying and learning to use Dragon is worth your time and effort. Will the spoils outweigh the cost of the hunt? Many professionals seek ways to cope with increasing demands to compose large amounts of computerized text. According to the Physicians Practice 2011 Technology Survey, 23.4 percent of 918 respondents said they use voice recognition technology to address the growing costs and demands associated with electronic health records.2 Other professionals turn to speech recognition due to computing-related repetitive stress and posture ailments. These individuals typically find increased productivity in addition to physical relief. For me, personally, there is no questioning the benefits of speech recognition. Indeed, the text and graphics for this article were produced completely hands-free. My advice for those considering speech recognition? "Be proactive." Define a speech recognition strategy that gives you a few quick wins, and keep your headset beside your keyboard, not in the drawer.

Notes
  1. Mariëlle Leijten, Daniel Janssen, and Luuk van Waes, "Error Correction Strategies of Professional Speech Recognition Users: Three Profiles," Computers in Human Behavior, 26(5) (2010): 964–975.
  2. Marisa Torrieri, "Talk vs. Type: Taking Another Look at Voice Recognition," Physicians Practice, vol. 21, no. 7, 2011.