AI models output a layer of data that is a set of interpretations of the raw incoming data. For example, a video stream comes in, and an AI model outputs the x, y coordinates of the people it detected in each frame. These basic outputs are technically considered AI because they go through expensive, and cutting-edge machine and deep learning training processes to be able to output this basic layer of information. Though the useful implementation of these “AI” outputs comes in the many additional layers and processing that comes downstream.
To be able to perform the function of a security-camera monitor employee, the AI needs to provide more than this basic output. We want the AI to know and to tell us things that the employee would normally be there to tell us: “Who is that person?” “What are they doing?” and most importantly “is there anything going on, or about to go on, that I need to take some kind of security-based action on? ..and what is that action?”
This is where we need to layer and apply additional models to interpret the outputs of the basic models such as object detection, tracking, and recognition, and to incorporate real world time-space data (timestamps, exact locations), to create an AI that can truly provide the insights to either directly, or indirectly affect the real world.