Multi-modal Web browser project
Part of a traineeship in Risø National Laboratory
Alexandre Alapetite, 06/2003 - 09/2003


Most of the time, Web browsers only provide mouse and keyboard inputs.
These solutions are not so efficient for people with some handicap like arms paralysis. So it could be interesting to provide other inputs like voice and/or eye tracking. These extra inputs could be useful even for normal people, in situations when it is not convenient to use mouse or keyboard, like when cooking and following a recipe on the Internet.


Table of contents




This project is a secondary and more personal work, during my traineeship in Risø.
It has started by an advanced study of Internet standards (XHTML, CSS, etc.), which can be used, and ergonomics considerations like the list of spoken commands that are useful.
In addition to the very common mouse and keyboard, different input systems could be used to browse the Internet; voice and eye-tracking have been studied here.


Voice input

A voice input system allows the user to say a limited amount of standard commands like "next, previous, first, last, home, up, find, help, etc." to activate the corresponding page - if available -, or to do some more classic actions like scrolling the page, activating links, navigating to the previous page, etc...

The system developed here is based on a voice recognition program (using Microsoft Speech SDK 5.1) that sends keyboard events to a plug-in (integrated into the Web browser Mozilla), which does the corresponding actions.

Architecture of the system managing voice input
Voice recognition grammar Voice recognition program XUL plug-in Voice recognition system architecture Voice recognition system architecture

Standardised navigation in HTML Web pages

For standardised navigation feature to be used, Web pages have to respect some standard accessibility recommendations. In particular, pages will have to use the LINK tag to specify standard links like "next, previous, first, etc.".
This tag is included in HTML specification since HTML 2.0, is still present in the current XHTML 1.1 and its utilisation is very encouraged. This subject is developed in my documentation about META information in HTML Web pages.


<LINK rel="start" href="../../index.en.html" />

Currently, some browsers (Mozilla, Opera, etc.) use that kind of standard links, which can be activated thanks to a navigation bar.

Mozilla 1.4 navigation bar using LINK
Mozilla navigation bar

Voice recognition

A speech recognition engine based on a limited dictionary allows very high recognition accuracy, even without any learning process, providing any user a good usability. In the case of this project, there are only about thirty commands that are to be recognised here. Microsoft Speech SDK 5.1 has been chosen since it is easy to deploy and convenient for the final user.
The voice program has been developed in Visual Basic since other programs with voice recognition had already been developed like that for the main subject of my traineeship.

Voice recognition grammar

The recognition program is based on a Microsoft Sapi 5.0 SR Command and Control grammar, which specifies all the possible commands that could be recognised.
Here is an extract of this grammar:


<GRAMMAR LANGID="409"><!-- xml:lang="en-US" -->
      <PHRASE>computer</PHRASE><!-- introduction word -->
        <PHRASE><!-- scroll -->
            <LIST PROPNAME="{PGDN}">
            <LIST PROPNAME="{PGUP}">
            <LIST PROPNAME="{END}">
            <LIST PROPNAME="{HOME}">
            <LIST PROPNAME="{DOWN}">
              <PHRASE>step down</PHRASE>
            <LIST PROPNAME="{DOWN}">
              <PHRASE>step up</PHRASE>
        <!-- more code here -->

For each command, the PROPNAME attribute contains the value (token) that will be used by the recognition program, which will react consequently.

The grammar is very constraint, most of the time there is only one sentence that can be used for a specific action.
Here are some sentences that can be recognized, and their meaning:

The following commands generate events like if it was done by the keyboard:

computer scroll down
Scroll the window one page down
computer scroll up
Scroll the window one page up
computer scroll bottom
Scroll to the bottom of the page
computer scroll top
Scroll to the top of the page
computer scroll step down
Scroll the window 3 lines down
computer scroll step up
Scroll the window 3 lines up
computer go backward
Navigate to the previous visited page (same as "back" button)
computer go forward
Navigate to the next visited page (same as "forward" button)

The following commands are only available on Web pages respecting accessibility rules using the LINK tag:

computer go home
Navigate to home page of the Web site
computer go up
Navigate to the parent page
computer go first
Navigate to the first page in a list of documents
computer go previous
Navigate to the previous page in a list of documents
computer go next
Navigate to the next page in a list of documents
computer go last
Navigate to the last page in a list of documents
computer go table of content
Navigate to the table of content
computer go glossary
Navigate to the glossary
computer go index
Navigate to the index
computer go help
Navigate to the help page
computer go search
Navigate to the search page
computer go author
Navigate to the author page
computer go copyright
Navigate to the copyright page

This last set of commands deals with hyperlink navigation:

computer set display links on
Display links numbers (enabled by default)
computer set display links off
Hide links numbers
computer go link two three
(example) Navigate to the 23rd link

Voice recognition program

The voice recognition program has not been integrated into the Web browser, and has to be executed separately.

Voice recognition program's main window
Main window of the voice recognition program.

The user just has to click on the main start button then minimise the window. Indeed, the program tries to load the grammar from the file voicelink.xml that should be in the same directory; another grammar can be loaded if needed.
Some feedback on different events can be provided for testing in the main window, like in this example.

Here is an extract of the code of the voice recognition program.
It is the code that is executed when a good recognition has been made by the voice recognition engine.


'When a recognition event is launched by the voice recognition engine
Private Sub RC_Recognition(ByVal StreamNumber As Long, ByVal StreamPosition As Variant,
		ByVal RecognitionType As SpeechLib.SpeechRecognitionType,
		ByVal Result As SpeechLib.ISpeechRecoResult)
	Dim RecoNode As Node
	Dim Prop As ISpeechPhraseProperty
	Dim Properties As ISpeechPhraseProperties
	Dim myString As String

	myString = "" 'keyboard sequence that will be sent to the Web browser
	Set Properties = Result.PhraseInfo.Properties
	If Not Properties Is Nothing Then
		If Properties.Count > 0 Then 'get PROPNAME properties from the grammar
			For Each Prop In Properties
				If Not Prop Is Nothing Then
					myString = myString & Prop.Name
				End If
			SendKeys myString 'Simulate a keyboard event
		End If
	End If
	 'More code here
End Sub

Eye tracking

In a second time, an eye tracking system will be added to this multi-modal browser.
When these words were written, data extraction from an eye tracking system was being studying.

It could be used mainly to provide a scrolling function in Web pages.
The browser could scroll the Web page according to the eye position. Moreover, that kind of eye tracking does not require such a good precision, and could provide an intelligent and pleasant reaction. Nevertheless, it is possible to imagine a spoken command to turn on and off this feature, which can be annoying sometime.

If the first goal is achieved, eye tracking could be used to activate normal hyperlinks on the Web page. But this requires more precision, so also a more complicated and/or invasive hardware.


Integration in a Web browser

Developing a Web browser is long and heavy. Also the idea for this project is to build a plug-in that can be integrated into an existing Web browser.
The choice has been made on Mozilla, since it is one of the best current Web browsers, very dynamic, up-to-date and last but not the least: open-source. Netscape and other browsers are also based on Mozilla.


Xul plug-in

The XML-based User Interface Language (Xul) has been used to develop a plug-in for Mozilla that receives orders from the voice recognition and eye tracking programs.
This technology uses Internet standards like XML, CSS2, JavaScript, RDF.
Once a Xul plug-in has been installed once, it is loaded at start-up, exactly like other functionalities of Mozilla.

Plug-in icon in Mozilla
The plug-in appears as a small icon with a microphone at the bottom of Mozilla.
A help box appears when the mouse is over it.

Here is the main XML file, which is adding an icon into the status bar in the bottom right of Mozilla.


<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="chrome://voicelink/content/voicelink.css" type="text/css"?>
<overlay id="voicelinkOverlay" xmlns="">
 <script type="application/x-javascript" src="chrome://voicelink/content/voicelink.js" />
 <statusbar id="status-bar">
   tooltiptext="Risoe speech command - Disabled" />

A CSS2 file is used for style matters, exactly as if it was for a Web page. Here is the whole file:


statusbarpanel#risoespeech {
statusbarpanel#risoespeech[status="off"] {
statusbarpanel#risoespeech[status="on"] {
statusbarpanel#risoespeech[status="wait"] {

The programming language is JavaScript, and the code is stored in a separated file. It is possible to intercept events, interact with all the functions of Mozilla, access and modify the current Web document. Here is a short extract of this file:


var contentArea=document.getElementById("appcontent");

function voiceLinkUnload()

function voiceLinkLoad()
 if (wantsActive && wantsLinksNumber) displayLinksNumber();

//More code here

Orders are transmitted from the voice recognition program to the browser plug-in through the keyboard. This allows events to be managed by the user interface. With this solution, there is no need to create any other special dialog.


Link counting

In order for the user to be able to activate links by voice, it seems to be a good idea to give a number to each link. This has been done, and the user can chose to have after each links, its number in parenthesis.
Example: This is an example with a first link (1) and another one (2).
This works in any valid HTML page that respects standards for accessibility: i.e. it is not working on links using pictures and/or javascript.


Future outlets for the project

If this project is successful and if it appears to be interesting at a larger scale, it could be possible to go on developing it, and to provide support for other browsers like Microsoft Internet Explorer.

Also, this project could be used to test how several modes (voice, gestures) can be used at the same time.

See also

Firefox multimodal browser