This paper addresses the issue that existing open source models have weak multi-turn interaction capabilities, especially in long contexts, despite their zero-shot capabilities and powerful image understanding capabilities. To address these issues, we propose a context modeling module called ContextQFormer that enhances the representation of contextual information, and announce the construction and release of a new dataset, TMDialog, for multi-turn multi-modal dialogue research. TMDialog supports multi-turn multi-modal dialogue research, including longer conversations than existing datasets. In experiments using TMDialog, ContextQFormer shows 2-4% better performance than existing models.