P. Urruchi Mohino, J. López Fidalgo
We present a framework for dimension reduction applied to text corpora coming from tweets in where we expect to keep predictive power in regression tasks. The main objective will be finding optimal sub-samplings and key dimensions in an unstructured high volume data problem.
The sufficient dimension reduction is obtained through projecting x, which is the sparse matrix formed by the tokens (units) and their frequencies from text corpora, through beta, which is a sparse vector of coefficients coming from an L1 regularised linear model. Such coefficients are estimated through the Maximum A Posteriori method instead of Maximum Likelihood as we propose prior distributions for them. This optimisation problem constricted by the Lasso leads to a simpler model in where we expect to find beta to be sparse.
Regarding data reduction, we expect to be able to formulate an Optimal Experimental Design for achieving maximum informative data points.
Palabras clave: Big Data, Optimal Experimental Design, Text Mining, Bayesian MAP
Programado
GT7-1 Diseño de Experimentos
3 de septiembre de 2019 15:30
I2L5. Edificio Georgina Blanes