How to Train a More Interpretable Neural Text Classifier?

Hanjie Chen
Department of Computer Science
University of Virginia

Abstract Although neural networks have achieved remarkable performance on text classification, the lack of transparency causes the challenge of understanding model predictions. In the meantime, the growing demand for using neural networks in many text classification tasks drives the research of building more interpretable models.

In our work, we propose a novel training strategy, called learning with auxiliary examples, to improve the interpretability of existing neural text classifiers. By using sentiment classification as the example task and a well-adopted baseline convolutional neural network model as the neural classifier, we show that the new learning strategy improves the model interpretability while maintains similar classification performance. Besides, we also propose an automatic evaluation measurement to quantify the interpretability by measuring the consistency between the model predictions and the corresponding explanations. Experiments on two benchmark datasets show some significant improvements on the interpretability of the models trained with the proposed strategy.