What is Computer Vision?
How can just one camera offer a solution to this issue? The answer to that is an artificial intelligence technology called Computer Vision. Computer Vision aims to allow computers to make sense of images and videos just like a human would. To accomplish this, algorithms are trained on very large datasets of images. With enough images, the algorithm will learn enough features of the object and be able to identify if an image contains that object. More advanced algorithms can even determine where that object is in an image, outline the object, and track how that specific object may move from frame to frame.
For our research on museum visitors, computer vision can be preferable to human vision in several ways. Computers don't get tired, distracted or bored; they don't accidentally influence how people move through the exhibit; and they can sense infrared light and other things humans cannot. Perhaps the most important distinction is that computer vision can carefully observe how multiple people move through an exhibit simultaneously. Suppose 30 children on a school field trip enter an exhibit and instantly scatter. Human eyes may have a tough time observing how the students choose to move throughout the exhibit, but computer vision will excel at that task.
Why Cameras Instead of Some Other Sensor?
This sort of tracking could be done with other technologies, sure. RFID tags visitors carry with them, Indiana Jones-style pressure plates in the floor in front of exhibits, or even the classic museum staff member with a tally counter could all get the job done. Cameras are cheap, widely available, easy to install, and commonplace enough to not be distracting in the exhibit. One camera with an ideal vantage point could potentially survey movements in an entire room. Camera installation for this experiment took less than an hour, and even installing cameras high in the exhibit only required one technician in a scissor-lift. Power had to be run to the camera and video was wirelessly fed to a receiver box in a closet off the exhibit. The whole setup was incredibly efficient and low-profile.
Our Approach to Data Collection in the Museum
Data collection for our experiment involved recording video for several weeks in early February for a few 30-minute segments each day. A sign was placed outside of the roomto inform visitors when the recording was occurring and let them know they could come back at another time if they were uncomfortable being recorded. Files were collected from the museum and luckily were prepared for analysis just before the COVID-19 shutdown went into effect.
The algorithm used to analyze the footage was developed using several open source python wrappers and pre-trained models. Most notably, the FairMOT model was used for human identification in the footage. OpenCV Python was used to image and video reading. Cython-bbox was used to create the bounding box around identified humans in the exhibit, and Matplotlib was used to generate output visualizations. A virtual machine was created within Microsoft Azure to run the analysis, and to reduce the amount of processing required for each video clip, the algorithm was written to only analyze every 3rd frame.
Several areas of interest, referred to as "grids," were marked and defined into the algorithm so the computer would measure when people interacted with those areas. Each grid corresponds to an information board in the exhibit that museum designers wanted to gain information about.