This study began as advanced image caption- ing system equipped with integrated audio descriptions, designed to improve accessibility for visually impaired individuals. Employing a combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the system adeptly interprets visual content from agricultural images and generates corresponding textual captions. These captions are subsequently converted into natural-sounding audio, enriching user inter- action with agricultural content. The system was thoroughly tested against standard datasets focused on agricultural scenes, where it demonstrated notable improvements in caption accuracy and audio description quality. This study not only pushes the boundaries of image captioning technology but also underscores its transformative potential in agricultural applications, thereby improving accessibility across the board. The paper details the system architecture and methodology, and provides a comparative analysis with existing technologies, emphasizing significant advancements in real-time image processing and accessibility.