This study evaluates the performance of four widely used generative AI chatbots (ChatGPT, Google Gemini, Grok, and Perplexity) in industrial safety engineering using a controlled comparative protocol. Nine standardized prompts (three safety domains × three prompt categories) were administered across Electrical, Mechanical, and Fire Safety, covering Fundamental Knowledge, Applied Reasoning, and Image-Based Analysis. Responses were assessed against pre-defined gold standards using a 1–5 rubric across three key performance indicators (accuracy, completeness, and relevance) by three independent evaluators, with scores averaged. Overall, the chatbots performed best on text-based prompts (Fundamental Knowledge: 88.67%; Applied Reasoning: 79.33%) and lowest on Image-Based Analysis (68.67%). By domain, performance was highest in Fire Safety (87.33%) and lowest in Electrical Safety (70.67%). Across KPIs, relevance and accuracy were consistently stronger than completeness, indicating that responses were generally aligned and correct but frequently lacked the depth expected for professional judgement. The findings suggest that AI chatbots can support HSE work in drafting, summarization, and preliminary decision support, but human oversight remains essential, especially for applied reasoning in safety-critical and visual interpretation tasks.
Keywords
AI chatbots; Industrial Safety Engineering; Fundamental Knowledge; Applied Reasoning, Image-Based Analysis.