Cluster analysis - cluster analysis, clustering

Japanese: クラスター分析 - クラスターぶんせき（英語表記）cluster analysis，clustering

This refers to a method of classifying surveyed items (variables), individuals, organizations (individuals), etc., using statistical information when they are considered to be made up of heterogeneous groups or populations. When classifying individuals, a grouping (classification) of individuals is found based on the p- variable data x ⁱ = [ x ⁱ ¹ , ..., x ^ip ]' of individual i (= 1, ..., n ) or similarity data between individuals, in which similar individuals belong to the same group (cluster) and dissimilar objects belong to different groups. In the following description, when classifying a group of variables, individual i can be read as variable i . Cluster analysis is the general name for such statistical methods, and is broadly divided into hierarchical cluster analysis and non-hierarchical cluster analysis.

The diagram illustrates the principle of hierarchical clustering. The dendrogram on the right, which is the result of analyzing the data ^x1 = [4, 1] ' , ^x2 = [1, 5]', ^x3 = [5, 4]', ^x4 = [1, 3]', and ^x5 = [5, 1]', as shown in the diagram on the left, is obtained through the following three steps: (1) The distance between the five points in the scatter plot is found, and the shortest ^x1 and ^x5 are merged into ^one group C1 . This merging is shown ^by the intersection C1 in the dendrogram on the right. (2) Set the representative point of group C1 as the center of gravity of the points of the individuals belonging to ^it , ^c1 = 0.5( ^x1 + ^x5 ) = [4.5, 1]', find the distance between ^c1 , ^x2 , ^x3 , and ^x4 , and merge the shortest ^x2 and ^x4 into ^group C2 . This is shown by C2 on the right. ( ³ ) Find the distance between the representative point ^of C2 ^{, c2} = 0.5( ^x2 + ^x4 ), and ^c1 and ^x3 , and since ^x3 and ^c1 are the shortest, merge ^x3 ^into C1 . This merging is shown ^by C3 on the right.

Hierarchical analysis is subdivided into several sub-methods based on the differences in the procedures in steps 2 and 3 above. Among them, the method explained using the diagram above is called the centroid method, and is characterized by using the centroid to calculate the distance between groups and individuals, and between groups. Other methods for calculating the inter-group distance include the group average method, which uses the average of the squared distances between individuals belonging to different groups, the nearest neighbor method, which uses the shortest distance, the furthest neighbor method, which uses the longest distance, and Ward's method, which calculates the distance between A and B by subtracting the distance between individuals in group A and the distance between individuals in group B from the distance between individuals in the group obtained by merging groups A and B, i.e., the increase in the distance between individuals due to the merging of groups.

Methods that do not hierarchically (sequentially) merge individuals or groups, but instead define a statistically ideal classification using an objective function and optimize it, are collectively called nonhierarchical clustering. The representative method, the K - means method,

g ^ik that minimizes is found. Here, k (= 1, ..., K ) represents the group, g ⁱ ¹ , ..., g ^iK are parameters that are 1 only for the group to which individual i belongs and 0 for all others, x̄ ^k is the average (centre of gravity) of the data of individuals belonging to group k , and ∥x ⁱ -x̄ ^k ∥ represents the distance between x ⁱ and x̄ ^k . g ^ik that minimizes the objective function f ( g ^ik ) represents the classification that minimizes the sum of the squared distance between each individual and the average of the cluster that contains it.

The K -means method does not allow individuals to belong to multiple groups, but one non-hierarchical analysis that does is ADCLUS (additive clustering), developed in the field of quantitative psychology. This is based on the similarity data s ^ij between i and j ,

This is a method to find g ^ik, which is either 1 or 0, and a continuous quantity w ^k ≧0 that minimizes, and its aim is easy to understand if you call i and j stimuli and group k feature k . In other words, g ^ik g ^jk = 1 indicates that both stimuli share the feature k of weight w ^k , and Adclass aims to describe the similarity by the sum of the shared features w ^k . →Multivariate Analysis [Adachi Kohei]

Figure: Principle of hierarchical cluster analysis
">

Figure: Principle of hierarchical cluster analysis

Latest Sources Psychology Encyclopedia Latest Psychology Encyclopedia About Information

Japanese:

調査対象になっている項目（変数）や個人，組織（個体）などが異質のグループや集団から成立していると考えられるとき，それらを統計的な情報を使って分類する手法を指す。個体を分類する場合には，個体ｉ（＝1，…，ｎ）のｐ変量データｘⁱ＝［ｘⁱ¹，…，ｘ^ip］′，または個体間の類似性データに基づいて，類似する個体同士は同じ群（クラスター）に，類似しない対象同士は異なる群に属するような個体の群分け（分類）を見いだす。以下の記述において，変数群を分類する場合には，個体ｉを変数ｉと読み替えればよい。クラスター分析は，こうした統計手法の総称名で，階層的クラスター分析と非階層的クラスター分析に大別される。

　図は階層的クラスター分析hierarchical clusteringの原理を例示する。そのうちの左の図のように散布するデータｘ¹＝［4，1］′，ｘ²＝［1，5］′，ｘ³＝［5，4］′，ｘ⁴＝［1，3］′，ｘ⁵＝［5，1］′の分析結果である右の樹形図（デンドログラム）は，次の3ステップを通して求められる。⑴散布図の5点間の距離を求め，最短のｘ¹とｘ⁵を一つの群Ｃ¹として併合する。この併合を右の樹形図の交わりＣ¹が示す。⑵群Ｃ¹の代表点を所属個体の点の重心ｃ¹＝0.5（ｘ¹＋ｘ⁵）＝［4.5，1］′として，ｃ¹，ｘ²，ｘ³，ｘ⁴間の距離を求め，最短のｘ²とｘ⁴を群Ｃ²として併合する。これを右のＣ²が示す。⑶Ｃ²の代表点ｃ²＝0.5（ｘ²＋ｘ⁴）とｃ¹とｘ³の距離を求め，ｘ³とｃ¹が最短であるため，ｘ³をＣ¹に併合する。この併合を右のＣ³が示す。

　以上のステップの⑵，⑶における手順の違いによって，階層的分析はいくつかの下位手法に細分される。その中でも上記の図を用いた説明による手法は重心法centroid methodとよばれ，群と個体，および群間の距離の算出に重心を用いるのが特徴である。ほかに群間距離として，異なる群に属する個体同士の距離の2乗の平均を用いる群平均法group average method，最短距離を用いる最近隣法nearest neighbor method，最長距離を用いる最遠隣法furthest neighbor methodや，群Ａと群Ｂを合併した群内の個体間距離から群Ａ内の個体間距離とＢ内の個体間距離を減じた値，つまり群の合併に伴う個体間距離の増分を，ＡとＢの距離とするウォード法Ward's methodなどがある。

　階層的（逐次的）に個体や群を合併していくのではなく，統計学的に理想的な分類を目的関数によって定義して，それを最適化する方法を非階層的クラスター分析nonhierarchical clusteringと総称する。その代表であるＫ平均法Ｋ-means methodでは，

を最小にするｇ^ikが求められる。ここで，ｋ（＝1，…，Ｋ）は群を表わし，ｇⁱ¹，…，ｇ^iKは，それらの中で個体ｉが属する群に対応するものだけが1，ほかはすべて0を取るパラメータ，ｘ̄^kは群ｋに所属する個体のデータの平均（重心），∥ｘⁱ－ｘ̄^k∥はｘⁱとｘ̄^kの距離を表わす。目的関数ｆ（ｇ^ik）を最小にするｇ^ikは，各個体とそれを含むクラスターの平均との平方距離の合計が最小となる分類を表わす。

　Ｋ平均法は，各個体の複数群への所属を認めない方法であるが，それを認める非階層的分析の一つに，計量心理学の分野で開発されたアドクラスADCLUS（additive clustering）がある。これは，ｉとｊの類似性データｓ^ijに基づいて，

を最小にする1か0のｇ^ikと連続量のｗ^k≧0を求める方法であり，そのねらいは，ｉとｊを刺激，群ｋを特徴ｋとよび換えるとわかりやすい。すなわち，ｇ^ikｇ^jk＝1となることは両刺激がウェイトｗ^kの特徴ｋを共有することを表わし，共有特徴のｗ^kの総和によって類似性を記述することをアドクラスはめざしている。　→多変量解析
〔足立浩平〕

図階層的クラスター分析の原理
">

図階層的クラスター分析の原理

出典　最新心理学事典最新心理学事典について　情報

<<: Glastonbury

>>: Cluster - Cluster (English spelling)