FAMD Explained: A Beginner’s Guide to Factor Analysis of Mixed DataFactor Analysis of Mixed Data (FAMD) is a dimensionality-reduction technique designed specifically for datasets that contain both numerical (continuous) and categorical (qualitative) variables. It blends ideas from Principal Component Analysis (PCA), which handles quantitative variables, and Multiple Correspondence Analysis (MCA), which handles categorical variables. FAMD helps reveal the main structures, patterns, and relationships in mixed datasets while reducing their dimensionality for visualization, clustering, or further modeling.
When and why use FAMD
- Use FAMD when your dataset contains a mix of numerical and categorical variables and you want a single unified method to analyze them.
- FAMD preserves the dual nature of variables: quantitative variables are treated in a PCA-like fashion, and categorical variables are treated in an MCA-like fashion. This balanced treatment prevents one type of variable from dominating the analysis.
- It’s useful for exploratory data analysis (EDA), visualization (reducing to 2–3 dimensions for plotting), preprocessing before clustering or classification, and for interpreting relationships between individuals (observations) and variables.
Key concepts and intuition
- Each quantitative variable contributes its standardized variance as in PCA. Each categorical variable is expanded into a set of binary indicator (dummy) variables (one per level), and MCA-like weighting ensures their combined influence is comparable to that of quantitative variables.
- FAMD finds a set of principal components (dimensions) that maximize explained variance across both types of variables simultaneously. Each component is a linear combination of quantitative variables and indicator variables from categories.
- Individuals (rows) are projected into the low-dimensional space; their coordinates reflect similarity across both quantitative and categorical features. Variables (or categories) can also be projected to interpret which features drive each dimension.
Mathematical overview (concise)
- Let X_q be the matrix of quantitative variables (standardized) and Z be the indicator matrix for categorical variables (with columns scaled by row/column weights as in MCA).
- FAMD performs a singular value decomposition (SVD) on the concatenated, appropriately scaled matrix [X_q | Z]. The left singular vectors give individual coordinates; right singular vectors relate to variable contributions.
- Eigenvalues from the decomposition represent the inertia (variance) explained by each component. Scree plots and cumulative explained inertia guide how many components to retain.
Steps to run FAMD (practical)
- Data cleaning: handle missing values (imputation or removal) and ensure categorical levels are meaningful.
- Standardize quantitative variables (mean 0, variance 1).
- Encode categorical variables as indicator/dummy variables; apply the MCA weighting (centering and scaling by category frequencies).
- Apply SVD to the combined matrix.
- Examine eigenvalues, variable contributions, and individual coordinates. Visualize individuals and variables on the first two dimensions.
Interpretation tips
- Plot individuals on first two dimensions (scatterplot). Clusters suggest groups with similar mixed-variable profiles.
- Plot variable points: quantitative variables appear as continuous vectors; categories appear as points. Categories near a particular region indicate that individuals in that region often have that category.
- Use contribution and squared cosine (cos2) metrics to identify which variables/categories contribute most to a dimension and how well a point is represented by the selected dimensions.
- Beware of over-interpreting dimensions that explain little inertia; small eigenvalues may capture noise.
Example use cases
- Market research: combine purchase frequency (numeric), customer segment (categorical), and satisfaction scores (numeric) to profile customers.
- Social science surveys: mix demographics (categorical), income (numeric), and attitudes (Likert scales) to explore respondent typologies.
- Medicine: combine lab measurements (numeric) with categorical diagnostic codes or treatment groups.
R and Python tools
- R: FactoMineR::FAMD or the PCAmixdata package offer FAMD implementations with plotting and interpretation functions.
- Python: prince (a library implementing MCA/FAMD) can be used; scikit-learn does not natively implement FAMD but PCA plus appropriate preprocessing can approximate it.
Pitfalls and best practices
- If categorical variables have many rare levels, consider grouping small levels to avoid sparse indicator matrices that add noise.
- Missing data: imputation methods that respect variable types (e.g., multiple imputation, k-NN) are preferable.
- Scaling choices matter: FAMD’s built-in weighting is designed to balance variable types — avoid reweighting without understanding consequences.
- Validate findings with downstream methods (clustering, classification) and, when possible, with cross-validation or holdout sets.
Quick practical example (conceptual)
Suppose you have a dataset with age (numeric), income (numeric), occupation (categorical), and education level (categorical). FAMD will standardize age and income, convert occupation and education into indicators with MCA-type weighting, then extract components capturing the main axes of variation — for example, a socioeconomic dimension (high income, higher education, professional occupations) and an age-related dimension.
Summary
FAMD is a powerful and interpretable technique for exploratory analysis of mixed-type datasets. By combining PCA and MCA principles, it balances quantitative and categorical variables, producing components that can be visualized and used for further analysis such as clustering or as features for supervised models.
Leave a Reply