Speech recognition - onseininshiki (English)

Japanese: 音声認識 - おんせいにんしき（英語表記）speech recognition

A technology that outputs the contents of speech by analyzing the speech signal obtained by collecting sound with a microphone. It estimates and outputs the most likely word sequence for the speech signal given as input by combining an acoustic model that describes the acoustic features of the phonemes thought to be included in the speech signal, a pronunciation dictionary that describes what kind of phonemes the words in the language to be recognized are composed of, and a language model that describes how words are connected to form sentences. In conventional speech recognition, the rough shape of the logarithmic power spectrum of the speech signal (spectral envelope, → envelope) is used as the acoustic feature, and the sequence of acoustic features is probabilistically described using a "hidden Markov model" (HMM) to construct the acoustic model. In addition, probabilistic regular grammars have been used for the pronunciation dictionary and language model. The parameters of these models (→ parameters) are acquired by "supervised learning" (→ machine learning) where correct answer data is given. A large vocabulary continuous speech recognition system that targets freely spoken sentences deals with the frequency of 30 to 70 types of phonemes, 10 to 1 million vocabulary words, and sequences of about 2 to 5 words. To improve the recognition accuracy, training using learning data is performed for about 100 to 1000 hours. Since around 2010, when deep learning became widely applied to various pattern recognition problems, deep learning techniques have been used to build acoustic models, improving recognition accuracy, and voice recognition functions, such as Apple's Siri software, have become widely used on consumer devices. (→Computer Science)

Source: Encyclopaedia Britannica Concise Encyclopedia About Encyclopaedia Britannica Concise Encyclopedia Information

Japanese:

マイクロホンで収音して得られた音声信号を解析することによって発話内容を出力する技術。音声信号に含まれると考えられる音素の音響的特徴を記述した音響モデル，認識しようとする言語に含まれる単語がどのような音素から構成されるかを記述した発音辞書，単語がどのようにつながって文となるかを記述した言語モデルなどを総合して，入力として与えられる音声信号に対して最も確からしい単語列を推定し，出力する。従来の音声認識では，音声信号の対数パワースペクトルのおおまかな形状（スペクトル包絡。→包絡線）を音響的特徴とし，音響モデルの構築に「隠れマルコフモデル」HMM; Hidden Markov Modelを用いて，音響的特徴の並びを確率的に記述した。また，発音辞書や言語モデルには確率正規文法が用いられてきた。これらのモデルのパラメータ（→媒介変数）は正解データの与えられる「教師あり学習」（→機械学習）で獲得する。自由に発話された文を対象とした大語彙連続音声認識システムでは，30～70種類の音素，1～100万個の語彙，2～5個程度の単語の並びの頻度を扱う。認識精度の向上のため，学習用データを用いたトレーニングを 100～1000時間程度行なう。ディープラーニングがパターン認識の諸問題に広く適用されるようになった 2010年前後からは，音響モデルの構築にディープラーニングの手法が用いられて認識精度が高まり，一般消費者用の端末でもアップルのソフトウェア Siriなど，音声認識機能が広く使用されるようになった。（→計算機科学）

出典　ブリタニカ国際大百科事典小項目事典ブリタニカ国際大百科事典小項目事典について　情報

<<: Syllable - Onsetsu (English spelling)

>>: Voice input‐output device