Data mining techniques – Bonferroni Rule by Rob Sebastian (LinkedIn)

Gautam suggested yesterday in class that, in data mining with a large number of potential variables, there is a risk that 5 out of 100 will be randomly assigned statistical significance based on sheer chance under a 95% confidence interval (p < 0.05).  Kartik correctly suggested that you can / should use your intuition to weed out potential overfitting, but I also wanted to follow up on my comment from class.  There is a method – the Bonferroni Rule – intended to correct for such over fitting when working with a large number of potential variables.  According to the Bonferroni Rule, one can reject H0 if p <•/n, where a is the desired level for the test of H0.  So in the case of our 95% CI, a = 0.05 and if you’re considering 100 variables then p must be less than 0.0005 = 0.05 / 100 for inclusion.  This more-stringent threshold for inclusion helps correct for the sheer number of variables you are considering when data mining.  The linked .pdf is from an excellent lecture by Professor Bob Stine on Over-Fitting from his STAT 622: Statistical Modeling elective.

Comscore US Digital Year in Review

Pablo Lema (LinkedIn) found this great piece of research.  Enjoy!

http://www.comscore.com/Press_Events/Presentations_Whitepapers/2010/The_2009_U.S._Digital_Year_in_Review