Optimized Brain Computer Interface System for Unspoken Speech Recognition: Role of Wernicke Area
In this paper, we propose an optimized brain computer
interface (BCI) system for unspoken speech recognition, based on
the fact that the constructions of unspoken words rely strongly on the
Wernicke area, situated in the temporal lobe. Our BCI system has four
modules: (i) the EEG Acquisition module based on a non-invasive
headset with 14 electrodes; (ii) the Preprocessing module to remove
noise and artifacts, using the Common Average Reference method;
(iii) the Features Extraction module, using Wavelet Packet Transform
(WPT); (iv) the Classification module based on a one-hidden layer
artificial neural network. The present study consists of comparing
the recognition accuracy of 5 Arabic words, when using all the
headset electrodes or only the 4 electrodes situated near the Wernicke
area, as well as the selection effect of the subbands produced by
the WPT module. After applying the articial neural network on the
produced database, we obtain, on the test dataset, an accuracy of
83.4% with all the electrodes and all the subbands of 8 levels of the
WPT decomposition. However, by using only the 4 electrodes near
Wernicke Area and the 6 middle subbands of the WPT, we obtain
a high reduction of the dataset size, equal to approximately 19% of
the total dataset, with 67.5% of accuracy rate. This reduction appears
particularly important to improve the design of a low cost and simple
to use BCI, trained for several words.
Speech Enhancement Using Wavelet Coefficients Masking with Local Binary Patterns
In this paper, we present a wavelet coefficients masking
based on Local Binary Patterns (WLBP) approach to enhance the
temporal spectra of the wavelet coefficients for speech enhancement.
This technique exploits the wavelet denoising scheme, which splits
the degraded speech into pyramidal subband components and extracts
frequency information without losing temporal information. Speech
enhancement in each high-frequency subband is performed by binary
labels through the local binary pattern masking that encodes the ratio
between the original value of each coefficient and the values of the
neighbour coefficients. This approach enhances the high-frequency
spectra of the wavelet transform instead of eliminating them through
a threshold. A comparative analysis is carried out with conventional
speech enhancement algorithms, demonstrating that the proposed
technique achieves significant improvements in terms of PESQ, an
international recommendation of objective measure for estimating
subjective speech quality. Informal listening tests also show that
the proposed method in an acoustic context improves the quality
of speech, avoiding the annoying musical noise present in other
speech enhancement techniques. Experimental results obtained with a
DNN based speech recognizer in noisy environments corroborate the
superiority of the proposed scheme in the robust speech recognition
The Capacity of Mel Frequency Cepstral Coefficients for Speech Recognition
Speech recognition is of an important contribution in promoting new technologies in human computer interaction. Today, there is a growing need to employ speech technology in daily life and business activities. However, speech recognition is a challenging task that requires different stages before obtaining the desired output. Among automatic speech recognition (ASR) components is the feature extraction process, which parameterizes the speech signal to produce the corresponding feature vectors. Feature extraction process aims at approximating the linguistic content that is conveyed by the input speech signal. In speech processing field, there are several methods to extract speech features, however, Mel Frequency Cepstral Coefficients (MFCC) is the popular technique. It has been long observed that the MFCC is dominantly used in the well-known recognizers such as the Carnegie Mellon University (CMU) Sphinx and the Markov Model Toolkit (HTK). Hence, this paper focuses on the MFCC method as the standard choice to identify the different speech segments in order to obtain the language phonemes for further training and decoding steps. Due to MFCC good performance, the previous studies show that the MFCC dominates the Arabic ASR research. In this paper, we demonstrate MFCC as well as the intermediate steps that are performed to get these coefficients using the HTK toolkit.
A Two-Stage Adaptation towards Automatic Speech Recognition System for Malay-Speaking Children
Recently, Automatic Speech Recognition (ASR) systems were used to assist children in language acquisition as it has the ability to detect human speech signal. Despite the benefits offered by the ASR system, there is a lack of ASR systems for Malay-speaking children. One of the contributing factors for this is the lack of continuous speech database for the target users. Though cross-lingual adaptation is a common solution for developing ASR systems for under-resourced language, it is not viable for children as there are very limited speech databases as a source model. In this research, we propose a two-stage adaptation for the development of ASR system for Malay-speaking children using a very limited database. The two stage adaptation comprises the cross-lingual adaptation (first stage) and cross-age adaptation. For the first stage, a well-known speech database that is phonetically rich and balanced, is adapted to the medium-sized Malay adults using supervised MLLR. The second stage adaptation uses the speech acoustic model generated from the first adaptation, and the target database is a small-sized database of the target users. We have measured the performance of the proposed technique using word error rate, and then compare them with the conventional benchmark adaptation. The two stage adaptation proposed in this research has better recognition accuracy as compared to the benchmark adaptation in recognizing children’s speech.
Possibilities, Challenges and the State of the Art of Automatic Speech Recognition in Air Traffic Control
Over the past few years, a lot of research has been
conducted to bring Automatic Speech Recognition (ASR) into various
areas of Air Traffic Control (ATC), such as air traffic control
simulation and training, monitoring live operators for with the aim
of safety improvements, air traffic controller workload measurement
and conducting analysis on large quantities controller-pilot speech.
Due to the high accuracy requirements of the ATC context and its
unique challenges, automatic speech recognition has not been widely
adopted in this field. With the aim of providing a good starting
point for researchers who are interested bringing automatic speech
recognition into ATC, this paper gives an overview of possibilities
and challenges of applying automatic speech recognition in air traffic
control. To provide this overview, we present an updated literature
review of speech recognition technologies in general, as well as
specific approaches relevant to the ATC context. Based on this
literature review, criteria for selecting speech recognition approaches
for the ATC domain are presented, and remaining challenges and
possible solutions are discussed.
Advances in Artificial Intelligence Using Speech Recognition
This research study aims to present a retrospective
study about speech recognition systems and artificial intelligence.
Speech recognition has become one of the widely used technologies,
as it offers great opportunity to interact and communicate with
automated machines. Precisely, it can be affirmed that speech
recognition facilitates its users and helps them to perform their daily
routine tasks, in a more convenient and effective manner. This
research intends to present the illustration of recent technological
advancements, which are associated with artificial intelligence.
Recent researches have revealed the fact that speech recognition is
found to be the utmost issue, which affects the decoding of speech. In
order to overcome these issues, different statistical models were
developed by the researchers. Some of the most prominent statistical
models include acoustic model (AM), language model (LM), lexicon
model, and hidden Markov models (HMM). The research will help in
understanding all of these statistical models of speech recognition.
Researchers have also formulated different decoding methods, which
are being utilized for realistic decoding tasks and constrained
artificial languages. These decoding methods include pattern
recognition, acoustic phonetic, and artificial intelligence. It has been
recognized that artificial intelligence is the most efficient and reliable
methods, which are being used in speech recognition.
Combined Automatic Speech Recognition and Machine Translation in Business Correspondence Domain for English-Croatian
The paper presents combined automatic speech
recognition (ASR) of English and machine translation (MT) for
English and Croatian and Croatian-English language pairs in the
domain of business correspondence. The first part presents results of
training the ASR commercial system on English data sets, enriched
by error analysis. The second part presents results of machine
translation performed by free online tool for English and Croatian
and Croatian-English language pairs. Human evaluation in terms of
usability is conducted and internal consistency calculated by
Cronbach's alpha coefficient, enriched by error analysis. Automatic
evaluation is performed by WER (Word Error Rate) and PER
(Position-independent word Error Rate) metrics, followed by
investigation of Pearson’s correlation with human evaluation.
Recognition of Noisy Words Using the Time Delay Neural Networks Approach
This paper presents a recognition system for isolated
words like robot commands. It’s carried out by Time Delay Neural
Networks; TDNN. To teleoperate a robot for specific tasks as turn,
close, etc… In industrial environment and taking into account the
noise coming from the machine. The choice of TDNN is based on its
generalization in terms of accuracy, in more it acts as a filter that
allows the passage of certain desirable frequency characteristics of
speech; the goal is to determine the parameters of this filter for
making an adaptable system to the variability of speech signal and to
noise especially, for this the back propagation technique was used in
learning phase. The approach was applied on commands pronounced
in two languages separately: The French and Arabic. The results for
two test bases of 300 spoken words for each one are 87%, 97.6% in
neutral environment and 77.67%, 92.67% when the white Gaussian
noisy was added with a SNR of 35 dB.
Analysis of Combined Use of NN and MFCC for Speech Recognition
The performance and analysis of speech recognition
system is illustrated in this paper. An approach to recognize the
English word corresponding to digit (0-9) spoken by 2 different
speakers is captured in noise free environment. For feature extraction,
speech Mel frequency cepstral coefficients (MFCC) has been used
which gives a set of feature vectors from recorded speech samples.
Neural network model is used to enhance the recognition
performance. Feed forward neural network with back propagation
algorithm model is used. However other speech recognition
techniques such as HMM, DTW exist. All experiments are carried
out on Matlab.
Bidirectional Dynamic Time Warping Algorithm for the Recognition of Isolated Words Impacted by Transient Noise Pulses
We consider the biggest challenge in speech recognition – noise reduction. Traditionally detected transient noise pulses are removed with the corrupted speech using pulse models. In this paper we propose to cope with the problem directly in Dynamic Time Warping domain. Bidirectional Dynamic Time Warping algorithm for the recognition of isolated words impacted by transient noise pulses is proposed. It uses simple transient noise pulse detector, employs bidirectional computation of dynamic time warping and directly manipulates with warping results. Experimental investigation with several alternative solutions confirms effectiveness of the proposed algorithm in the reduction of impact of noise on recognition process – 3.9% increase of the noisy speech recognition is achieved.
Efficient DTW-Based Speech Recognition System for Isolated Words of Arabic Language
Despite the fact that Arabic language is currently one
of the most common languages worldwide, there has been only a
little research on Arabic speech recognition relative to other
languages such as English and Japanese. Generally, digital speech
processing and voice recognition algorithms are of special
importance for designing efficient, accurate, as well as fast automatic
speech recognition systems. However, the speech recognition process
carried out in this paper is divided into three stages as follows: firstly,
the signal is preprocessed to reduce noise effects. After that, the
signal is digitized and hearingized. Consequently, the voice activity
regions are segmented using voice activity detection (VAD)
algorithm. Secondly, features are extracted from the speech signal
using Mel-frequency cepstral coefficients (MFCC) algorithm.
Moreover, delta and acceleration (delta-delta) coefficients have been
added for the reason of improving the recognition accuracy. Finally,
each test word-s features are compared to the training database using
dynamic time warping (DTW) algorithm. Utilizing the best set up
made for all affected parameters to the aforementioned techniques,
the proposed system achieved a recognition rate of about 98.5%
which outperformed other HMM and ANN-based approaches
available in the literature.
Comparison of Parameterization Methods in Recognizing Spoken Arabic Digits
This paper proposes evaluation of sound parameterization methods in recognizing some spoken Arabic words, namely digits from zero to nine. Each isolated spoken word is represented by a single template based on a specific recognition feature, and the recognition is based on the Euclidean distance from those templates. The performance analysis of recognition is based on four parameterization features: the Burg Spectrum Analysis, the Walsh Spectrum Analysis, the Thomson Multitaper Spectrum Analysis and the Mel Frequency Cepstral Coefficients (MFCC) features. The main aim of this paper was to compare, analyze, and discuss the outcomes of spoken Arabic digits recognition systems based on the selected recognition features. The results acqired confirm that the use of MFCC features is a very promising method in recognizing Spoken Arabic digits.
Recognition by Online Modeling – a New Approach of Recognizing Voice Signals in Linear Time
This work presents a novel means of extracting fixedlength parameters from voice signals, such that words can be recognized
in linear time. The power and the zero crossing rate are first
calculated segment by segment from a voice signal; by doing so, two
feature sequences are generated. We then construct an FIR system
across these two sequences. The parameters of this FIR system, used
as the input of a multilayer proceptron recognizer, can be derived by
recursive LSE (least-square estimation), implying that the complexity of overall process is linear to the signal size. In the second part of
this work, we introduce a weighting factor λ to emphasize recent
input; therefore, we can further recognize continuous speech signals.
Experiments employ the voice signals of numbers, from zero to nine, spoken in Mandarin Chinese. The proposed method is verified to
recognize voice signals efficiently and accurately.
Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition
An emotional speech recognition system for the
applications on smart phones was proposed in this study to combine
with 3G mobile communications and social networks to provide users
and their groups with more interaction and care. This study developed
a mechanism using the support vector machines (SVM) to recognize
the emotions of speech such as happiness, anger, sadness and normal.
The mechanism uses a hierarchical classifier to adjust the weights of
acoustic features and divides various parameters into the categories of
energy and frequency for training. In this study, 28 commonly used
acoustic features including pitch and volume were proposed for
training. In addition, a time-frequency parameter obtained by
continuous wavelet transforms was also used to identify the accent and
intonation in a sentence during the recognition process. The Berlin
Database of Emotional Speech was used by dividing the speech into
male and female data sets for training. According to the experimental
results, the accuracies of male and female test sets were increased by
4.6% and 5.2% respectively after using the time-frequency parameter
for classifying happy and angry emotions. For the classification of all
emotions, the average accuracy, including male and female data, was
63.5% for the test set and 90.9% for the whole data set.
Face Localization Using Illumination-dependent Face Model for Visual Speech Recognition
A robust still image face localization algorithm
capable of operating in an unconstrained visual environment is
proposed. First, construction of a robust skin classifier within a
shifted HSV color space is described. Then various filtering
operations are performed to better isolate face candidates and
mitigate the effect of substantial non-skin regions. Finally, a novel
Bhattacharyya-based face detection algorithm is used to compare
candidate regions of interest with a unique illumination-dependent
face model probability distribution function approximation.
Experimental results show a 90% face detection success rate despite
the demands of the visually noisy environment.
Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech
In this study, the use of silicon NAM (Non-Audible
Murmur) microphone in automatic speech recognition is presented.
NAM microphones are special acoustic sensors, which are attached
behind the talker-s ear and can capture not only normal (audible)
speech, but also very quietly uttered speech (non-audible murmur).
As a result, NAM microphones can be applied in automatic speech
recognition systems when privacy is desired in human-machine communication.
Moreover, NAM microphones show robustness against
noise and they might be used in special systems (speech recognition,
speech conversion etc.) for sound-impaired people. Using a small
amount of training data and adaptation approaches, 93.9% word
accuracy was achieved for a 20k Japanese vocabulary dictation
task. Non-audible murmur recognition in noisy environments is also
investigated. In this study, further analysis of the NAM speech has
been made using distance measures between hidden Markov model
(HMM) pairs. It has been shown the reduced spectral space of NAM
speech using a metric distance, however the location of the different
phonemes of NAM are similar to the location of the phonemes
of normal speech, and the NAM sounds are well discriminated.
Promising results in using nonlinear features are also introduced,
especially under noisy conditions.
Improvement of MLLR Speaker Adaptation Using a Novel Method
This paper presents a technical speaker adaptation
method called WMLLR, which is based on maximum likelihood linear
regression (MLLR). In MLLR, a linear regression-based transform
which adapted the HMM mean vectors was calculated to maximize the
likelihood of adaptation data. In this paper, the prior knowledge of the
initial model is adequately incorporated into the adaptation. A series of
speaker adaptation experiments are carried out at a 30 famous city
names database to investigate the efficiency of the proposed method.
Experimental results show that the WMLLR method outperforms the
conventional MLLR method, especially when only few utterances
from a new speaker are available for adaptation.
Speech Coding and Recognition
This paper investigates the performance of a speech
recognizer in an interactive voice response system for various coded
speech signals, coded by using a vector quantization technique namely
Multi Switched Split Vector Quantization Technique. The process of
recognizing the coded output can be used in Voice banking application.
The recognition technique used for the recognition of the coded speech
signals is the Hidden Markov Model technique. The spectral distortion
performance, computational complexity, and memory requirements of
Multi Switched Split Vector Quantization Technique and the
performance of the speech recognizer at various bit rates have been
computed. From results it is found that the speech recognizer is
showing better performance at 24 bits/frame and it is found that the
percentage of recognition is being varied from 100% to 93.33% for
various bit rates.
A Modified Speech Enhancement Using Adaptive Gain Equalizer with Non linear Spectral Subtraction for Robust Speech Recognition
In this paper we present an enhanced noise reduction method for robust speech recognition using Adaptive Gain Equalizer with Non linear Spectral Subtraction. In Adaptive Gain Equalizer method (AGE), the input signal is divided into a number of subbands that are individually weighed in time domain, in accordance to the short time Signal-to-Noise Ratio (SNR) in each subband estimation at every time instant. Instead of focusing on suppression the noise on speech enhancement is focused. When analysis was done under various noise conditions for speech recognition, it was found that Adaptive Gain Equalizer method algorithm has an obvious failing point for a SNR of -5 dB, with inadequate levels of noise suppression for SNR less than this point. This work proposes the implementation of AGE when coupled with Non linear Spectral Subtraction (AGE-NSS) for robust speech recognition. The experimental result shows that out AGE-NSS performs the AGE when SNR drops below -5db level.
Architecture of Speech-based Registration System
In this era of technology, fueled by the pervasive usage of the internet, security is a prime concern. The number of new attacks by the so-called “bots", which are automated programs, is increasing at an alarming rate. They are most likely to attack online registration systems. Technology, called “CAPTCHA" (Completely Automated Public Turing test to tell Computers and Humans Apart) do exist, which can differentiate between automated programs and humans and prevent replay attacks. Traditionally CAPTCHA-s have been implemented with the challenge involved in recognizing textual images and reproducing the same. We propose an approach where the visual challenge has to be read out from which randomly selected keywords are used to verify the correctness of spoken text and in turn detect the presence of human. This is supplemented with a speaker recognition system which can identify the speaker also. Thus, this framework fulfills both the objectives – it can determine whether the user is a human or not and if it is a human, it can verify its identity.
Efficient System for Speech Recognition using General Regression Neural Network
In this paper we present an efficient system for
independent speaker speech recognition based on neural network
approach. The proposed architecture comprises two phases: a
preprocessing phase which consists in segmental normalization and
features extraction and a classification phase which uses neural
networks based on nonparametric density estimation namely the
general regression neural network (GRNN). The relative
performances of the proposed model are compared to the similar
recognition systems based on the Multilayer Perceptron (MLP), the
Recurrent Neural Network (RNN) and the well known Discrete
Hidden Markov Model (HMM-VQ) that we have achieved also.
Experimental results obtained with Arabic digits have shown that the
use of nonparametric density estimation with an appropriate
smoothing factor (spread) improves the generalization power of the
neural network. The word error rate (WER) is reduced significantly
over the baseline HMM method. GRNN computation is a successful
alternative to the other neural network and DHMM.
Online Collaborative Learning System Using Speech Technology
A Web-based learning tool, the Learn IN Context
(LINC) system, designed and being used in some institution-s
courses in mixed-mode learning, is presented in this paper. This
mode combines face-to-face and distance approaches to education.
LINC can achieve both collaborative and competitive learning. In
order to provide both learners and tutors with a more natural way to
interact with e-learning applications, a conversational interface has
been included in LINC. Hence, the components and essential features
of LINC+, the voice enhanced version of LINC, are described. We
report evaluation experiments of LINC/LINC+ in a real use context
of a computer programming course taught at the Université de
Moncton (Canada). The findings show that when the learning
material is delivered in the form of a collaborative and voice-enabled
presentation, the majority of learners seem to be satisfied with this
new media, and confirm that it does not negatively affect their
Speech Recognition Using Scaly Neural Networks
This research work is aimed at speech recognition
using scaly neural networks. A small vocabulary of 11 words were
established first, these words are “word, file, open, print, exit, edit,
cut, copy, paste, doc1, doc2". These chosen words involved with
executing some computer functions such as opening a file, print
certain text document, cutting, copying, pasting, editing and exit.
It introduced to the computer then subjected to feature extraction
process using LPC (linear prediction coefficients). These features are
used as input to an artificial neural network in speaker dependent
mode. Half of the words are used for training the artificial neural
network and the other half are used for testing the system; those are
used for information retrieval.
The system components are consist of three parts, speech
processing and feature extraction, training and testing by using neural
networks and information retrieval.
The retrieve process proved to be 79.5-88% successful, which is
quite acceptable, considering the variation to surrounding, state of
the person, and the microphone type.
A System of Automatic Speech Recognition based on the Technique of Temporal Retiming
We report in this paper the procedure of a system of
automatic speech recognition based on techniques of the dynamic
programming. The technique of temporal retiming is a technique
used to synchronize between two forms to compare. We will see how
this technique is adapted to the field of the automatic speech
recognition. We will expose, in a first place, the theory of the
function of retiming which is used to compare and to adjust an
unknown form with a whole of forms of reference constituting the
vocabulary of the application. Then we will give, in the second place,
the various algorithms necessary to their implementation on machine.
The algorithms which we will present were tested on part of the
corpus of words in Arab language Arabdic-10  and gave whole
satisfaction. These algorithms are effective insofar as we apply them
to the small ones or average vocabularies.
Speaker Independent Quranic Recognizer Basedon Maximum Likelihood Linear Regression
An automatic speech recognition system for the
formal Arabic language is needed. The Quran is the most formal
spoken book in Arabic, it is spoken all over the world. In this
research, an automatic speech recognizer for Quranic based speakerindependent
was developed and tested. The system was developed
based on the tri-phone Hidden Markov Model and Maximum
Likelihood Linear Regression (MLLR). The MLLR computes a set
of transformations which reduces the mismatch between an initial
model set and the adaptation data. It uses the regression class tree, as
well as, estimates a set of linear transformations for the mean and
variance parameters of a Gaussian mixture HMM system. The 30th
Chapter of the Quran, with five of the most famous readers of the
Quran, was used for the training and testing of the data. The chapter
includes about 2000 distinct words. The advantages of using the
Quranic verses as the database in this developed recognizer are the
uniqueness of the words and the high level of orderliness between
verses. The level of accuracy from the tested data ranged 68 to 85%.
SySRA: A System of a Continuous Speech Recognition in Arab Language
We report in this paper the model adopted by our
system of continuous speech recognition in Arab language SySRA
and the results obtained until now. This system uses the database
Arabdic-10 which is a corpus of word for the Arab language and
which was manually segmented. Phonetic decoding is represented
by an expert system where the knowledge base is translated in the
form of production rules. This expert system transforms a vocal
signal into a phonetic lattice. The higher level of the system takes
care of the recognition of the lattice thus obtained by deferring it in
the form of written sentences (orthographical Form). This level
contains initially the lexical analyzer which is not other than the
module of recognition. We subjected this analyzer to a set of
spectrograms obtained by dictating a score of sentences in Arab
language. The rate of recognition of these sentences is about 70%
which is, to our knowledge, the best result for the recognition of the
Arab language. The test set consists of twenty sentences from four
speakers not having taken part in the training.
Speech Activated Automation
This article presents a simple way to perform programmed voice commands for the interface with commercial Digital and Analogue Input/Output PCI cards, used in Robotics and Automation applications. Robots and Automation equipment can "listen" to voice commands and perform several different tasks, approaching to the human behavior, and improving the human- machine interfaces for the Automation Industry. Since most PCI Digital and Analogue Input/Output cards are sold with several DLLs included (for use with different programming languages), it is possible to add speech recognition capability, using a standard speech recognition engine, compatible with the programming languages used. It was created in this work a Visual Basic 6 (the world's most popular language) application, that listens to several voice commands, and is capable to communicate directly with several standard 128 Digital I/O PCI Cards, used to control complete Automation Systems, with up to (number of boards used) x 128 Sensors and/or Actuators.
Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems
Statement of the automatic speech recognition
problem, the assignment of speech recognition and the application
fields are shown in the paper. At the same time as Azerbaijan speech,
the establishment principles of speech recognition system and the
problems arising in the system are investigated. The computing algorithms of speech features, being the main part
of speech recognition system, are analyzed. From this point of view,
the determination algorithms of Mel Frequency Cepstral Coefficients
(MFCC) and Linear Predictive Coding (LPC) coefficients expressing
the basic speech features are developed. Combined use of cepstrals of
MFCC and LPC in speech recognition system is suggested to
improve the reliability of speech recognition system. To this end, the
recognition system is divided into MFCC and LPC-based recognition
subsystems. The training and recognition processes are realized in
both subsystems separately, and recognition system gets the decision
being the same results of each subsystems. This results in decrease of
error rate during recognition. The training and recognition processes are realized by artificial
neural networks in the automatic speech recognition system. The
neural networks are trained by the conjugate gradient method. In the
paper the problems observed by the number of speech features at
training the neural networks of MFCC and LPC-based speech
recognition subsystems are investigated. The variety of results of neural networks trained from different
initial points in training process is analyzed. Methodology of
combined use of neural networks trained from different initial points
in speech recognition system is suggested to improve the reliability
of recognition system and increase the recognition quality, and
obtained practical results are shown.
Voice Driven Applications in Non-stationary and Chaotic Environment
Automated operations based on voice commands will become more and more important in many applications, including robotics, maintenance operations, etc. However, voice command recognition rates drop quite a lot under non-stationary and chaotic noise environments. In this paper, we tried to significantly improve the speech recognition rates under non-stationary noise environments. First, 298 Navy acronyms have been selected for automatic speech recognition. Data sets were collected under 4 types of noisy environments: factory, buccaneer jet, babble noise in a canteen, and destroyer. Within each noisy environment, 4 levels (5 dB, 15 dB, 25 dB, and clean) of Signal-to-Noise Ratio (SNR) were introduced to corrupt the speech. Second, a new algorithm to estimate speech or no speech regions has been developed, implemented, and evaluated. Third, extensive simulations were carried out. It was found that the combination of the new algorithm, the proper selection of language model and a customized training of the speech recognizer based on clean speech yielded very high recognition rates, which are between 80% and 90% for the four different noisy conditions. Fourth, extensive comparative studies have also been carried out.
A New Vector Quantization Front-End Process for Discrete HMM Speech Recognition System
The paper presents a complete discrete statistical framework, based on a novel vector quantization (VQ) front-end process. This new VQ approach performs an optimal distribution of VQ codebook components on HMM states. This technique that we named the distributed vector quantization (DVQ) of hidden Markov models, succeeds in unifying acoustic micro-structure and phonetic macro-structure, when the estimation of HMM parameters is performed. The DVQ technique is implemented through two variants. The first variant uses the K-means algorithm (K-means- DVQ) to optimize the VQ, while the second variant exploits the benefits of the classification behavior of neural networks (NN-DVQ) for the same purpose. The proposed variants are compared with the HMM-based baseline system by experiments of specific Arabic consonants recognition. The results show that the distributed vector quantization technique increase the performance of the discrete HMM system.