A machine learning technique. Instead of correct answer data, learning cues are given in the form of rewards. In humans, the basal ganglia uses dopamine as a reward, and it is believed that behavioral learning occurs by predicting and acquiring rewards. This is the learning principle used in machine learning. An agent (agent system) that interacts with the things around it (environment) collects information while acting in the environment and learns behavioral rules (policies) to maximize its reward. The environment is formulated using a probabilistic state transition model by the Markov decision process. When the agent takes an action that can be executed in each state of the environment, the state transitions according to a certain probability, and the agent receives a reward accordingly. Here, Markov property refers to the fact that the probability of state transitions and the associated rewards are determined only by the state of the environment at that time and the action taken by the agent. In reinforcement learning, the agent aims to maximize rewards by acquiring on its own through trial and error what is the correct action in various situations it encounters and which action determines which reward. There are several ways to design a reward function that gives rewards, such as estimating it from the behavior history (inverse reinforcement learning) and learning the behavior rules and estimating the reward function in parallel (apprenticeship learning). (→ Artificial Intelligence) Source: Encyclopaedia Britannica Concise Encyclopedia About Encyclopaedia Britannica Concise Encyclopedia Information |
機械学習の手法の一つ。正解データの代わりに報酬というかたちで学習の手がかりを与える。人間の大脳基底核では,ドーパミンを報酬として用いて,報酬の予測と獲得により行動学習をすると考えられているが,それを学習原理として機械学習に用いたものである。自分のまわりの事物(環境)と相互作用する行動主体(エージェント。→エージェントシステム)が,環境内で行動しながら情報を収集し,自分の報酬を最大化するための行動ルール(ポリシー)を学習する。環境はマルコフ決定過程によって,確率的な状態遷移モデルを用いて定式化される。行動主体が,環境のそれぞれの状態で実行可能な行動をとると,ある確率に従って状態が遷移し,それに応じて報酬がもらえる。ここでマルコフ性とは,状態遷移とそれに伴う報酬の確率が,環境のそのときの状態と行動主体がとった行動だけで決まることをさす。強化学習において,行動主体は,遭遇するさまざまな状況でなにが正しい行動であるか,どの報酬がどの行動によって決まるかを試行錯誤しながら自力で獲得し報酬の最大化を目指す。報酬を与える報酬関数の設計には,行動履歴から推定する方法(逆強化学習)や,行動ルールの学習と報酬関数の推定を並行して行なう方法(徒弟学習)などがある。(→人工知能)
出典 ブリタニカ国際大百科事典 小項目事典ブリタニカ国際大百科事典 小項目事典について 情報 |
<<: Training camp - Kyoukagashuku
A historical tale from the early Kamakura period....
…Shun is winter. Cod milt is also called chrysant...
...Population: 384,000 (1995). In 1929, the citie...
…Courts were established in the central and local...
…Gharials have been regarded as sacred messengers...
Born: June 19, 1854 in Lucca [Died] August 7, 1893...
Also called chlorine disinfection. Sterilization u...
… After the Russian Revolution of 1917, proletari...
...The effectiveness of antiseptics for wounds is...
…5・7・5・7・7 is divided into the top three lines (5...
…These are called homologous chromosomes. Identif...
A city in northwestern Chiba Prefecture. It was in...
...A city in southern Saitama Prefecture. It stra...
…He was also an accomplished artist in prints, wa...
The original meaning was the handover of duties w...