This paper emphasizes the importance of understanding visually rich documents (text, complex layouts, and image integration) and points out the limitations of existing Key Information Extraction (KIE) methods (delay due to reliance on OCR, high computational costs, and errors). To overcome these limitations, we present STNet, a novel end-to-end model that extracts text directly from images without OCR. STNet uses special tokens to observe (see) image regions relevant to a question and, based on these, provides accurate answers and visual grounding (tell). To improve the model's performance, we leverage GPT-4 to build the TVG (TableQA with Vision Grounding) dataset, which contains text-based question-answering (QA) pairs and accurate visual grounding. Experimental results demonstrate state-of-the-art performance on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made public.