Regression analysis is a method of dividing a group of variables of interest into explanatory variables (also called predictor variables or independent variables) that are used to explain or predict, and dependent variables (also called criterion variables) that serve as the basis for the variables, setting up a statistical model between the two, and inferring the relationship between them. These statistical models can be broadly divided into linear regression models and nonlinear regression models. [Linear regression model] A linear regression model is a prediction formula that predicts and explains a dependent variable y using p explanatory variables x1 , ..., xp , by adding an intercept c to the sum of the explanatory variables multiplied by the regression coefficient bj ( j = 1, ..., p ). This refers to a statistical analysis method that considers the above and determines b 1 , ..., b p , c that minimize the magnitude of the error e = y - y^ between this and the dependent variable. [Simple regression analysis and multiple regression analysis] When we want to distinguish between an analysis where p = 1, that is , y^ = bx + c , and an analysis where p ≧ 2, the former is called simple regression analysis and the latter multiple regression analysis. If we assign the subscript i (= 1 , ..., n) to each variable to represent an individual, the prediction formula becomes This can be expressed as: and this is substituted into the right-hand side of the equation y i = y^ i + e i , which uses the dependent variable as the sum of the prediction formula and the error. is the regression analysis model (regression model). [Least squares solution for intercepts and coefficients] Sum of squares of error The least squares solution is b 1 , ..., b p , c that minimizes the mean of the dependent variable and explanatory variable. Then, the solution for intercept c is Substituting this for c in the regression model gives Then, rearrange the model using the mean deviation scores ỹ i = y i - ȳ and x̃ ij = x ij - x̄ j , From this formula, we can see that the regression analysis of the mean deviation score is equivalent to the analysis of the raw data, except that the intercept c becomes 0 and disappears. The average deviation score, coefficient, and error of all n individuals are If we express it as a vector and a matrix like this, the regression model can be written as y=Xb+e, and the solution for the coefficient vector b is Here, R XX is a p×p matrix consisting of correlation coefficients between explanatory variables, r X y is a p ×1 vector consisting of correlation coefficients between p explanatory variables and the dependent variable, D X is a diagonal matrix with the standard deviation of each explanatory variable arranged on the diagonal, and s y is the standard deviation of the dependent variable. If regression analysis is applied to data in which all variables have been converted to standard scores with a mean of 0 and a variance of 1, that is, s y = 1 and D X is a unit matrix, the results are the same as when analyzing raw data, except that s y D -1 X has disappeared from b^ in the previous step, and R -1 XX r X y becomes the solution for b, and the intercept becomes 0. This solution is specifically called the standardized solution. The standard solution for the regression coefficient in simple regression analysis is equal to the correlation coefficient between the dependent variable and the explanatory variable. [Explanatory variance rate and multiple correlation coefficient] Solution b^ = [b^ 1 , ..., b^ p ]', c^ and the predicted value obtained by substituting the explanatory variable values into the prediction formula The dependent variable y i and the residual e^ i = y i - y i have the following properties: (1) The average ē of the residual e^ i is 0, and the sum of squares The residual variance s 2 e based on the above can be considered as the overall size of the residual. ⑵The mean of the predicted value y^ i is equal to the mean ȳ of the dependent variable. ⑶The covariance s y ŷ of y i and y^ i is equal to the variance s 2 ŷ of y^ i . ⑷The sum of squares of the dependent variable is This is called the decomposition of sum of squares. From this division, the variance s 2 y of the dependent variable y i is is derived, and dividing both sides by s 2 y gives The ratio of the variance of the predicted value to the variance of the dependent variable is obtained. It can be seen that the value of s 2 ŷ / s 2 y is between 0 and 1, and indicates the smallness of the residual. This ratio s 2 ŷ / s 2 y is called the coefficient of determination or the proportion of variance accounted for, and is interpreted as the proportion of the variance of the dependent variable that is explained by the variance of p explanatory variables. Furthermore, by substituting the property s 2 ŷ = s y ŷ from the previous paragraph (3) into the variance explanation ratio s 2 ŷ / s 2 y, we get It can be seen that the square of the correlation coefficient r y ŷ = s y ŷ / s y s ŷ between the predicted value and the dependent variable is equal to the variance explanation rate. When p ≧ 2, r y ŷ is called the multiple correlation coefficient between the explanatory variables and the dependent variable, and is an index of the correlation between multiple variables and one variable. When assuming a normal distribution, the hypothesis that "the variance explanation rate and multiple correlation coefficient are 0 in the population" can be tested using analysis of variance. [Partial regression coefficient] The regression coefficient bj applied to the explanatory variable xj when p ≧ 2 is called the partial regression coefficient. This is the advantage of multiple regression analysis, which allows us to understand the effect of xj excluding the effects of explanatory variables other than xj . For example , in a simple regression analysis that predicts the sales y of a product from only the quality x1 of the product , the effect of the variable x2 that is not entered into the analysis is mixed into the results, such as "when the quality x1 is high, the price x2 is also high, so sales y will decrease, and the coefficient applied to quality x1 will be negative." In contrast, in a multiple regression analysis that predicts sales y from both the quality x1 and price x2 of the product, the effect of quality x1 on sales y when the effect of price x2 is excluded can be understood by the partial regression coefficient b1 . The t- distribution is used to test the hypothesis "partial regression coefficient = 0" and to estimate the interval of the coefficient. When comparing the effect on a dependent variable between explanatory variables with different variances, you must refer to the standard partial regression coefficient, which is the standard partial regression coefficient. [Variable Selection] Variable selection refers to finding a subset of explanatory variables that provide a high fit to the regression model, rather than inputting all explanatory variables x1 , ..., xp into the analysis. For example, if p = 3, a regression analysis is performed to predict y from each of the seven subsets {x1}, {x2}, {x3}, {x1, x2}, {x1 , x3 } , { x2 , x3 } , { x1 , x2 , x3 }, and the analysis results from the set with the highest fit are selected. One of the fit indices is the multiple correlation coefficient adjusted for the degrees of freedom, which is a statistic that corrects the drawback of the multiple correlation coefficient, which becomes higher the more explanatory variables there are. When there are many explanatory variables, it is not possible to comprehensively consider the set of variables, so a method is used that starts with an appropriate initial set of variables and then repeatedly adds explanatory variables that improve the fit, or removes variables that decrease the fit, or selects and discards variables, until a desirable set of variables is arrived at; this method of repeated selection and discarding is called the stepwise method. [Multicollinearity] Since the solution to the partial regression coefficient, such as b^ = s y D -1 X R -1 XX r X y, is a function of the inverse matrix of the correlation matrix R XX , when the correlation between explanatory variables is very high, the phenomenon in which the solution becomes unstable, for example when the confidence interval of the partial regression coefficient ranges from negative to positive, is called multicollinearity. To diagnose whether each explanatory variable is the cause of multicollinearity, the multiple correlation coefficient between that variable and the other p -1 explanatory variables can be used. [Other regression analysis] If there are multiple dependent variables ( q items), the mean deviation scores and partial regression coefficients for each are expressed as a matrix. If we summarize it as above and let E be the error matrix, then an analysis in which the model can be expressed as Y = XB + E is called a multivariate regression analysis. However, the solution for B is given by (X'X) -1 X'Y, and its jth column is the same as the solution of a multiple regression analysis in which the jth column of Y is the dependent variable. A multivariate regression analysis that uses a matrix W with fewer columns than p and q , and restricts p×q B to be equal to the matrix product WV , is called a reduced rank regression. [Non-linear regression model] A non-linear regression model is one that explains the dependent variable by a general function, not limited to the linear form of the explanatory variables. For example, when the dependent variable y i is a binary value such as correct answer (1) and incorrect answer (0), and the explanatory variable x ij is a continuous variable, the probability The analysis that uses the above as a prediction formula is called logistic regression analysis. In addition, in a nonlinear regression model, when the systematic components of the distribution of the dependent variable are expressed as a linear form of unknown parameters, it is called a generalized linear model. →Causal analysis →Structural equation model →Correlation coefficient →Multivariate analysis [Kohei Adachi] Latest Sources Psychology Encyclopedia Latest Psychology Encyclopedia About Information |
回帰分析とは,関心の対象となっている変数群を,説明や予測をするための説明変数explanatory variable(予測変数,独立変数などともよばれる)と,その基準となる従属変数dependent variable(基準変数などともよばれる)に分け,両者の間に統計モデルを設定し,その間の関係を推論する手法である。その統計モデルは,線形回帰モデルと非線形回帰モデルに大別される。 【線形回帰モデルlinear regression model】 線形回帰モデルとは,従属変数yを,p個の説明変数x1,…,xpによって予測・説明するために,回帰係数regression coefficient bj(j=1,…,p)を乗じた説明変数の和に切片cを加えた予測式 を考え,これと従属変数との誤差e=y-y^の大きさを最小にするb1,…,bp,cを求める統計解析法を指す。 【単回帰分析と重回帰分析】 p=1つまりy^=bx+cと表わせる分析とp≧2の分析を区別したいときは,前者を単回帰分析simple regression analysis,後者を重回帰分析multiple regression analysisとよぶ。個体を表わす添え字i(=1,…,n)を各変数につけると,予測式は と表わせ,これを,従属変数を予測式と誤差の和とする式yi=y^i+eiの右辺に代入した が回帰分析のモデル(回帰モデル)となる。 【切片と係数の最小2乗解】 誤差の2乗和 を最小にするb1,…,bp,cが最小2乗解となる。従属変数・説明変数の平均を と表わすと,切片cの解は と表わせ,これを回帰モデルのcに代入すると となり,平均偏差得点ỹi=yi-ȳ,x̃ij=xij-x̄jを用いてモデルを整理すると, が得られる。この式から,平均偏差得点の回帰分析は,切片cが0になって消える以外は,素データの分析と同等であることがわかる。 n個体すべての平均偏差得点,係数,誤差を のようにベクトルと行列で表わせば,回帰モデルはy=Xb+eと書け,係数ベクトルbの解は で与えられる。ここで,RXXは説明変数同士の相関係数からなるp×pの行列,rXyはp個の説明変数と従属変数の相関係数からなるp×1のベクトル,DXは各説明変数の標準偏差を対角に配する対角行列,syは従属変数の標準偏差である。全変数を平均0,分散1,つまりsy=1かつDXが単位行列となる標準得点に変換したデータに回帰分析を適用すると,前段のb^からsyD-1Xが消えたR-1XXrXyがbの解となって,切片が0となる以外は,素データを分析した場合と同じ結果が得られる。この解をとくに標準解standardized solutionとよぶ。単回帰分析の回帰係数の標準解は,従属変数と説明変数の相関係数に一致する。 【分散説明率と重相関係数】 解b^=[b^1,…,b^p]′,c^と説明変数の値を予測式に代入して得られる予測値 従属変数yi,および残差e^i=yi-y^iは,次の性質をもつ。⑴残差e^iの平均ēは0となり,平方和 に基づく残差分散s2eを,総体的な残差の大きさとみなせる。⑵予測値y^iの平均は,従属変数の平均ȳに等しい。⑶yiとy^iの共分散syŷは,y^iの分散s2ŷに等しい。⑷従属変数の平方和は, のように分割され,これを平方和の分割decomposition of sum of squaresとよぶ。 この分割より従属変数yiの分散s2yについて, が導かれ,この両辺をs2yで割ると が得られ,従属変数の分散に対する予測値の分散の比率 が,0以上1以下の値を取って,残差の小ささを表わすことがわかる。この比率s2ŷ/s2yは,決定係数coefficient of determinationまたは分散説明率proportion of variance accounted forとよばれ,従属変数の分散のうち,p個の説明変数の分散によって説明される成分の割合と解される。さらに,前段⑶の性質s2ŷ=syŷを分散説明率s2ŷ/s2yに代入すると となり,予測値と従属変数の相関係数ryŷ=syŷ/sysŷの2乗が分散説明率に等しいことがわかる。p≧2のときのryŷは,説明変数と従属変数の重相関係数multiple correlation coefficientとよばれ,複数変数と一つの変数との相関の指標となる。正規分布を仮定する場合,「母集団では分散説明率と重相関係数は0である」という仮説を,分散分析によって検定できる。 【偏回帰係数partial regression coefficient】 p≧2のときに説明変数xjにかかる回帰係数bjを,とくに偏回帰係数とよぶ。これによって,xj以外の説明変数の影響を除いたxjの効果を把握できるのが,重回帰分析の利点である。たとえば,商品の質x1だけから商品の売上yを予測する単回帰分析では,「質x1が高いと値段x2も高いので,売上yは下がるため,質x1にかかる係数が負になる」というように,分析に投入されない変数x2の効果が結果に混入する。これに対し,商品の質x1と値段x2の両者から売上yを予測する重回帰分析では,値段x2の影響を除いたときに質x1が売上yへ及ぼす効果を,偏回帰係数b1によって把握できる。仮説「偏回帰係数=0」の検定と係数の区間推定には,t分布が用いられる。分散の異なる説明変数の間で従属変数への効果の大小を比較するときは,標準解の偏回帰係数である標準偏回帰係数standardized partial regression coefficientを参照しなければならない。 【変数選択variable selection】 説明変数x1,…,xpすべてを分析に投入するのではなく,回帰モデルの適合度が高い説明変数の部分集合を求めることを変数選択とよぶ。たとえば,p=3であれば,計7通りの部分集合{x1},{x2},{x3},{x1,x2},{x1,x3},{x2,x3},{x1,x2,x3}のそれぞれからyを予測する回帰分析を行ない,適合度が最大となる集合による分析結果を採択すればよい。適合度指標の一つに自由度調整済み重相関係数multiple correlation coefficient adjusted for the degrees of freedomがあり,これは説明変数が多いほど値が高くなる重相関係数の短所を補正した統計量である。説明変数が多いときは変数集合を網羅的に考慮できないので,適当な初期変数集合から始めて,適合度を高める説明変数の投入,あるいは適合度を下げる変数の除去,あるいは変数の取捨選択を繰り返して,望ましい変数集合にたどり着く方法が使われ,取捨選択を繰り返す方法はステップワイズ法stepwise methodとよばれる。 【多重共線性multicollinearity】 b^=syD-1XR-1XXrXyのように偏回帰係数の解は相関行列RXXの逆行列の関数であるため,説明変数同士の相関が非常に高い場合に,たとえば偏回帰係数の信頼区間が負から正にわたるなど,解が不安定になる現象を多重共線性とよぶ。各説明変数が多重共線性の原因になっているか否かを診断するために,その変数と他のp-1個の説明変数の重相関係数を利用できる。 【他の回帰分析】 従属変数が複数(q個)であり,それぞれの平均偏差得点と偏回帰係数を行列 にまとめ,Eを誤差行列とすれば,モデルがY=XB+Eと表わせる分析を多変量回帰分析multivariate regression analysisという。ただし,Bの解は(X′X)-1X′Yで与えられ,その第j列は,Yの第j列を従属変数とした重回帰分析の解と同じになる。列数がpとqより少ない行列Wを用いて,p×qのBが行列の積WVに等しいと制約した多変量回帰分析は縮小ランク回帰reduced rank regressionとよばれる。 【非線形回帰モデルnon-linear regression model】 非線形回帰モデルは,従属変数を説明変数の線形式に限らず,一般的な関数によって説明するものである。たとえば,従属変数yiが正答⑴と誤答(0)のように2値,説明変数xijは連続変数のときに,その確率 を予測式とする分析をロジスティック回帰分析logistic regression analysisとよぶ。なお,非線形回帰モデルにおいて,従属変数の分布の系統的成分が,未知パラメータの線形式で表現されるとき,一般化線形モデルgeneralized linear modelとよばれる。 →因果分析 →構造方程式モデル →相関係数 →多変量解析 〔足立 浩平〕 出典 最新 心理学事典最新 心理学事典について 情報 |
<<: Marine license - Kaigimenjo
A 13th century Indian book on music theory. The au...
...The first Nakayama Bunshichi played Tokimasa a...
This is an event held around the second day of the...
A quantity defined by the following integral using...
…The cedar waxwing, B. cedrorum (English name: ce...
…There are about 40 species native to Central and...
⇒Money Lending Business Act Source: About Shogakuk...
Located about 5km south of Nagano city, this area...
[1] [noun] A palace established outside the Imperi...
A pet breed (illustration) developed from a small ...
A valley in the upper reaches of the Ojika River i...
...In Japan, the following three different states...
…There are six species in Africa, one in Europe a...
During the Ming Dynasty in China, a popular rebell...
Hf. Atomic number 72. Group 4 transition metal el...