Python has a number of statistical modules that allows us to perform analysis without R, but it is always good idea to compare the outputs of different implementations. I performed factor analysis using Scikit-learn module of Python for my dictionary creation system, but the outputs were completely different from that of R’s factanal function just like someone’s post to stackoverflow. After long hours, I finally found that it is because I had’t have normalized data for Scikit-learn. Factanal does normalization automatically, but Scikit-learn doesn’t. The right way of performing factor analysis must be this:
from sklearn import decomposition, preprocessing data_normal = preprocessing.scale(data) # Normalization fa = decomposition.FactorAnalysis(n_components=1) fa.fit(data_normal) print fa.components_ # Factor loadings
If you I do like this, factor loadings estimated by Scikit-learn become very close to R’s estimates:
# Python (Scikit-learn) 1: 0.24705429 2: 0.56100678 3: 0.48559474 4: 0.54208185 5: 0.50989289 6: 0.33028625 7: 0.38651951 # R (factanal) 1: 0.285719656390773 2: 0.633553717909623 3: 0.493731965398187 4: 0.527418210503982 5: 0.487150249901473 6: 0.312724093202758 7: 0.378827084637606
Update on 11/5/2020: a new Python package for factor analysis has been released. I haven’t tried it yet but looks great. Please see the comment below.
Good one! I agree that you can learn a lot by comparing different implementations and it sometimes needs some time to understand why they differ. I have also struggled with factor analysis in python as I was used to factanal and hence created a python package that wraps the R factanal function so that you can just call it from python with a pandas data frame like this:
For everyone interested you can find more information here: https://pypi.org/project/factanal/