Regression analysis

Japanese: 回帰分析 - かいきぶんせき（英語表記）regression analysis

Regression analysis is a method of dividing a group of variables of interest into explanatory variables (also called predictor variables or independent variables) that are used to explain or predict, and dependent variables (also called criterion variables) that serve as the basis for the variables, setting up a statistical model between the two, and inferring the relationship between them. These statistical models can be broadly divided into linear regression models and nonlinear regression models.

[Linear regression model] A linear regression model is a ^prediction formula that predicts and explains a dependent variable y using p explanatory variables x1 , ..., ^xp , by adding an intercept c to the sum of the explanatory variables multiplied by the regression coefficient ^bj ( j = 1, ..., p ).

This refers to a statistical analysis method that considers the above and determines b ¹ , ..., b ^p , c that minimize the magnitude of the error e = y - y^ between this and the dependent variable.

[Simple regression analysis and multiple regression analysis] When we want to distinguish between an analysis where p = 1, that is , y^ = bx + c , and an analysis where p ≧ 2, the former is called simple regression analysis and the latter multiple regression analysis. If we assign the subscript i (= 1 , ..., n) to each variable to represent an individual, the prediction formula becomes

This can be expressed as: and this is substituted into the right-hand side of the equation y ⁱ = y^ ⁱ + e ⁱ , which uses the dependent variable as the sum of the prediction formula and the error.

is the regression analysis model (regression model).

[Least squares solution for intercepts and coefficients] Sum of squares of error

The least squares solution is b ¹ , ..., b ^p , c that minimizes the mean of the dependent variable and explanatory variable.

Then, the solution for intercept c is

Substituting this for c in the regression model gives

Then, rearrange the model using the mean deviation scores ỹ ⁱ = y ⁱ - ȳ and x̃ ^ij = x ^ij - x̄ ^j ,

From this formula, we can see that the regression analysis of the mean deviation score is equivalent to the analysis of the raw data, except that the intercept c becomes 0 and disappears.

The average deviation score, coefficient, and error of all n individuals are

If we express it as a vector and a matrix like this, the regression model can be written as y＝Xb＋e, and the solution for the coefficient vector b is

Here, R ^XX is a p×p matrix consisting of correlation coefficients between explanatory variables, r ^X ^y is a p ×1 vector consisting of correlation coefficients between p explanatory variables and the dependent variable, D ^X is a diagonal matrix with the standard deviation of each explanatory variable arranged on the diagonal, and s ^y is the standard deviation of the dependent variable. If regression analysis is applied to data in which all variables have been converted to standard scores with a mean of 0 and a variance of 1, that is, s ^y = 1 and D ^X is a unit matrix, the results are the same as when analyzing raw data, except that s ^y D ^-1 ^X has disappeared from b^ in the previous step, and R ^-1 ^XX r ^X ^y becomes the solution for b, and the intercept becomes 0. This solution is specifically called the standardized solution. The standard solution for the regression coefficient in simple regression analysis is equal to the correlation coefficient between the dependent variable and the explanatory variable.

[Explanatory variance rate and multiple correlation coefficient] Solution b^ = [b^ ¹ , ..., b^ ^p ]', c^ and the predicted value obtained by substituting the explanatory variable values into the prediction formula

The dependent variable y ⁱ and the residual e^ ⁱ = y ⁱ - y ⁱ have the following properties: (1) The average ē of the residual e^ ⁱ is 0, and the sum of squares

The residual variance s ² ^e based on the above can be considered as the overall size of the residual. ⑵The mean of the predicted value y^ ⁱ is equal to the mean ȳ of the dependent variable. ⑶The covariance s ^y ^ŷ of y ⁱ and y^ ⁱ is equal to the variance s ² ^ŷ of y^ ⁱ . ⑷The sum of squares of the dependent variable is

This is called the decomposition of sum of squares.

From this division, the variance s ² ^y of the dependent variable y ⁱ is

is derived, and dividing both sides by s ² ^y gives

The ratio of the variance of the predicted value to the variance of the dependent variable is obtained.

It can be seen that the value of s 2 ^ŷ / s ² y is between 0 and 1, and indicates the smallness of the residual. This ratio s ² ŷ / s 2 ^y is called the coefficient of determination or the proportion of variance accounted for, and is interpreted as the proportion of the variance of the dependent variable that is explained by the variance of p explanatory variables. Furthermore, by substituting the property s ² ^ŷ = s ^y ^ŷ from the previous paragraph (3) into the variance explanation ratio s ² ^ŷ / s ² ^y, we get

It can be seen that the square of the correlation coefficient r ^y ^ŷ = s ^y ^ŷ / s ^y s ^ŷ between the predicted value and the dependent variable is equal to the variance explanation rate. When p ≧ 2, r ^y ^ŷ is called the multiple correlation coefficient between the explanatory variables and the dependent variable, and is an index of the correlation between multiple variables and one variable. When assuming a normal distribution, the hypothesis that "the variance explanation rate and multiple correlation coefficient are 0 in the population" can be tested using analysis of variance.

[Partial regression coefficient] The regression coefficient ^bj applied to the explanatory variable ^xj when p ≧ 2 is called the partial regression coefficient. This is the advantage of multiple regression analysis, which allows us to understand the effect of ^xj excluding the effects of explanatory variables other than ^xj . For ^example , in a simple regression analysis that predicts the sales y of a product from only ^the quality x1 of the product , the effect of the variable x2 that is not entered into the analysis is mixed into the results, such as "when ^the quality ^x1 is high, the price x2 is also high, so sales y will decrease, and the coefficient applied to quality x1 will be negative." In contrast, in a multiple regression analysis that predicts sales y from ^both ^the quality x1 and price x2 ^of the product, the effect of quality ^x1 on sales y when the effect of price x2 is excluded can be understood by ^the partial regression coefficient b1 . The t- distribution is used to test the hypothesis "partial regression coefficient = 0" and to estimate ^the interval of the coefficient. When comparing the effect on a dependent variable between explanatory variables with different variances, you must refer to the standard partial regression coefficient, which is the standard partial regression coefficient.

[Variable Selection] Variable selection refers to finding a subset ^of explanatory variables that provide a high fit to the regression model, rather than inputting all explanatory variables x1 , ..., ^xp into the analysis. For example, if p = 3, ^{a regression analysis is performed to predict y from each of the seven subsets {x1}, {x2}, {x3}, {x1, x2}, {x1} ^, ^x3 ^} , { x2 ^, ^x3 } , { ^x1 , x2 , ^x3 }, and ^the ^analysis ^results from the set with the highest fit are selected. One of ^the fit indices is the multiple correlation coefficient adjusted for the degrees of freedom, which is a statistic that corrects the drawback of the multiple correlation coefficient, which becomes higher the more explanatory variables there are. When there are many explanatory variables, it is not possible to comprehensively consider the set of variables, so a method is used that starts with an appropriate initial set of variables and then repeatedly adds explanatory variables that improve the fit, or removes variables that decrease the fit, or selects and discards variables, until a desirable set of variables is arrived at; this method of repeated selection and discarding is called the stepwise method.

[Multicollinearity] Since the solution to the partial regression coefficient, such as b^ = s ^y D ^-1 ^X R ^-1 ^XX r ^X ^y, is a function of the inverse matrix of the correlation matrix R ^XX , when the correlation between explanatory variables is very high, the phenomenon in which the solution becomes unstable, for example when the confidence interval of the partial regression coefficient ranges from negative to positive, is called multicollinearity. To diagnose whether each explanatory variable is the cause of multicollinearity, the multiple correlation coefficient between that variable and the other p -1 explanatory variables can be used.

[Other regression analysis] If there are multiple dependent variables ( q items), the mean deviation scores and partial regression coefficients for each are expressed as a matrix.

If we summarize it as above and let E be the error matrix, then an analysis in which the model can be expressed as Y = XB + E is called a multivariate regression analysis. However, the solution for B is given by (X'X) ^-1 X'Y, and its jth column is the same as the solution of a multiple regression analysis in which the jth column of Y is the dependent variable. A multivariate regression analysis that uses a matrix W with fewer columns than p and q , and restricts p×q B to be equal to the matrix product WV , is called a reduced rank regression.

[Non-linear regression model] A non-linear regression model is one that explains the dependent variable by a general function, not limited to the linear form of the explanatory variables. For example, when the dependent variable y ⁱ is a binary value such as correct answer (1) and incorrect answer (0), and the explanatory variable x ^ij is a continuous variable, the probability

The analysis that uses the above as a prediction formula is called logistic regression analysis. In addition, in a nonlinear regression model, when the systematic components of the distribution of the dependent variable are expressed as a linear form of unknown parameters, it is called a generalized linear model. →Causal analysis →Structural equation model →Correlation coefficient →Multivariate analysis [Kohei Adachi]

Latest Sources Psychology Encyclopedia Latest Psychology Encyclopedia About Information

Japanese:

回帰分析とは，関心の対象となっている変数群を，説明や予測をするための説明変数explanatory variable（予測変数，独立変数などともよばれる）と，その基準となる従属変数dependent variable（基準変数などともよばれる）に分け，両者の間に統計モデルを設定し，その間の関係を推論する手法である。その統計モデルは，線形回帰モデルと非線形回帰モデルに大別される。

【線形回帰モデルlinear regression model】　線形回帰モデルとは，従属変数ｙを，ｐ個の説明変数ｘ¹，…，ｘ^pによって予測・説明するために，回帰係数regression coefficient b^j（ｊ＝1，…，ｐ）を乗じた説明変数の和に切片ｃを加えた予測式

を考え，これと従属変数との誤差e＝y－y＾の大きさを最小にするｂ¹，…，ｂ^p，ｃを求める統計解析法を指す。

【単回帰分析と重回帰分析】　ｐ＝1つまりｙ＾＝bx＋cと表わせる分析とｐ≧2の分析を区別したいときは，前者を単回帰分析simple regression analysis，後者を重回帰分析multiple regression analysisとよぶ。個体を表わす添え字ｉ（＝1，…，ｎ）を各変数につけると，予測式は

と表わせ，これを，従属変数を予測式と誤差の和とする式ｙⁱ＝ｙ＾ⁱ＋ｅⁱの右辺に代入した

が回帰分析のモデル（回帰モデル）となる。

【切片と係数の最小2乗解】　誤差の2乗和

を最小にするｂ¹，…，ｂ^p，ｃが最小2乗解となる。従属変数・説明変数の平均を

と表わすと，切片ｃの解は

と表わせ，これを回帰モデルのｃに代入すると

となり，平均偏差得点ｙ̃ⁱ＝ｙⁱ－ｙ̄，ｘ̃^ij＝ｘ^ij－ｘ̄^jを用いてモデルを整理すると，

が得られる。この式から，平均偏差得点の回帰分析は，切片ｃが0になって消える以外は，素データの分析と同等であることがわかる。

　ｎ個体すべての平均偏差得点，係数，誤差を

のようにベクトルと行列で表わせば，回帰モデルはy＝Xb＋eと書け，係数ベクトルｂの解は

で与えられる。ここで，Ｒ^XXは説明変数同士の相関係数からなるp×pの行列，ｒ^X^yはｐ個の説明変数と従属変数の相関係数からなるｐ×1のベクトル，Ｄ^Xは各説明変数の標準偏差を対角に配する対角行列，ｓ^yは従属変数の標準偏差である。全変数を平均0，分散1，つまりｓ^y＝1かつＤ^Xが単位行列となる標準得点に変換したデータに回帰分析を適用すると，前段のｂ＾からｓ^yＤ^-1^Xが消えたＲ^-1^XXｒ^X^yがｂの解となって，切片が0となる以外は，素データを分析した場合と同じ結果が得られる。この解をとくに標準解standardized solutionとよぶ。単回帰分析の回帰係数の標準解は，従属変数と説明変数の相関係数に一致する。

【分散説明率と重相関係数】　解ｂ＾＝［ｂ＾¹，…，ｂ＾^p］′，ｃ＾と説明変数の値を予測式に代入して得られる予測値

従属変数ｙⁱ，および残差ｅ＾ⁱ＝ｙⁱ－ｙ＾ⁱは，次の性質をもつ。⑴残差ｅ＾ⁱの平均ｅ̄は0となり，平方和

に基づく残差分散ｓ²^eを，総体的な残差の大きさとみなせる。⑵予測値ｙ＾ⁱの平均は，従属変数の平均ｙ̄に等しい。⑶ｙⁱとｙ＾ⁱの共分散ｓ^y^ŷは，ｙ＾ⁱの分散ｓ²^ŷに等しい。⑷従属変数の平方和は，

のように分割され，これを平方和の分割decomposition of sum of squaresとよぶ。

　この分割より従属変数ｙⁱの分散ｓ²^yについて，

が導かれ，この両辺をｓ²^yで割ると

が得られ，従属変数の分散に対する予測値の分散の比率

が，0以上1以下の値を取って，残差の小ささを表わすことがわかる。この比率ｓ²^ŷ/ｓ²^yは，決定係数coefficient of determinationまたは分散説明率proportion of variance accounted forとよばれ，従属変数の分散のうち，ｐ個の説明変数の分散によって説明される成分の割合と解される。さらに，前段⑶の性質ｓ²^ŷ＝ｓ^y^ŷを分散説明率ｓ²^ŷ/ｓ²^yに代入すると

となり，予測値と従属変数の相関係数ｒ^y^ŷ＝ｓ^y^ŷ/ｓ^yｓ^ŷの2乗が分散説明率に等しいことがわかる。ｐ≧2のときのｒ^y^ŷは，説明変数と従属変数の重相関係数multiple correlation coefficientとよばれ，複数変数と一つの変数との相関の指標となる。正規分布を仮定する場合，「母集団では分散説明率と重相関係数は0である」という仮説を，分散分析によって検定できる。

【偏回帰係数partial regression coefficient】　ｐ≧2のときに説明変数ｘ^jにかかる回帰係数ｂ^jを，とくに偏回帰係数とよぶ。これによって，ｘ^j以外の説明変数の影響を除いたｘ^jの効果を把握できるのが，重回帰分析の利点である。たとえば，商品の質ｘ¹だけから商品の売上ｙを予測する単回帰分析では，「質ｘ¹が高いと値段ｘ²も高いので，売上ｙは下がるため，質ｘ¹にかかる係数が負になる」というように，分析に投入されない変数ｘ²の効果が結果に混入する。これに対し，商品の質ｘ¹と値段ｘ²の両者から売上ｙを予測する重回帰分析では，値段ｘ²の影響を除いたときに質ｘ¹が売上ｙへ及ぼす効果を，偏回帰係数ｂ¹によって把握できる。仮説「偏回帰係数＝0」の検定と係数の区間推定には，ｔ分布が用いられる。分散の異なる説明変数の間で従属変数への効果の大小を比較するときは，標準解の偏回帰係数である標準偏回帰係数standardized partial regression coefficientを参照しなければならない。

【変数選択variable selection】　説明変数ｘ¹，…，ｘ^pすべてを分析に投入するのではなく，回帰モデルの適合度が高い説明変数の部分集合を求めることを変数選択とよぶ。たとえば，ｐ＝3であれば，計7通りの部分集合｛ｘ¹｝，｛ｘ²｝，｛ｘ³｝，｛ｘ¹，ｘ²｝，｛ｘ¹，ｘ³｝，｛ｘ²，ｘ³｝，｛ｘ¹，ｘ²，ｘ³｝のそれぞれからｙを予測する回帰分析を行ない，適合度が最大となる集合による分析結果を採択すればよい。適合度指標の一つに自由度調整済み重相関係数multiple correlation coefficient adjusted for the degrees of freedomがあり，これは説明変数が多いほど値が高くなる重相関係数の短所を補正した統計量である。説明変数が多いときは変数集合を網羅的に考慮できないので，適当な初期変数集合から始めて，適合度を高める説明変数の投入，あるいは適合度を下げる変数の除去，あるいは変数の取捨選択を繰り返して，望ましい変数集合にたどり着く方法が使われ，取捨選択を繰り返す方法はステップワイズ法stepwise methodとよばれる。

【多重共線性multicollinearity】　ｂ＾＝ｓ^yＤ^-1^XＲ^-1^XXｒ^X^yのように偏回帰係数の解は相関行列Ｒ^XXの逆行列の関数であるため，説明変数同士の相関が非常に高い場合に，たとえば偏回帰係数の信頼区間が負から正にわたるなど，解が不安定になる現象を多重共線性とよぶ。各説明変数が多重共線性の原因になっているか否かを診断するために，その変数と他のｐ－1個の説明変数の重相関係数を利用できる。

【他の回帰分析】　従属変数が複数（ｑ個）であり，それぞれの平均偏差得点と偏回帰係数を行列

にまとめ，Ｅを誤差行列とすれば，モデルがY＝XB＋Eと表わせる分析を多変量回帰分析multivariate regression analysisという。ただし，Ｂの解は（Ｘ′Ｘ）^-1Ｘ′Ｙで与えられ，その第ｊ列は，Ｙの第ｊ列を従属変数とした重回帰分析の解と同じになる。列数がｐとｑより少ない行列Ｗを用いて，p×qのＢが行列の積WVに等しいと制約した多変量回帰分析は縮小ランク回帰reduced rank regressionとよばれる。

【非線形回帰モデルnon-linear regression model】　非線形回帰モデルは，従属変数を説明変数の線形式に限らず，一般的な関数によって説明するものである。たとえば，従属変数ｙⁱが正答⑴と誤答（0）のように2値，説明変数ｘ^ijは連続変数のときに，その確率

を予測式とする分析をロジスティック回帰分析logistic regression analysisとよぶ。なお，非線形回帰モデルにおいて，従属変数の分布の系統的成分が，未知パラメータの線形式で表現されるとき，一般化線形モデルgeneralized linear modelとよばれる。　→因果分析　→構造方程式モデル　→相関係数　→多変量解析
〔足立浩平〕

出典　最新心理学事典最新心理学事典について　情報

<<: Marine license - Kaigimenjo

>>: Relapsing fever