Over the past two years, Facebook AI Research (FAIR) has worked with 13 universities around the world to assemble the largest first-person video dataset in history – specifically to train deep learning image recognition models. Artificial intelligence trained on a dataset will better control robots that interact with humans or interpret images from smart glasses. “Machines can only help us in our daily lives if they really understand the world through our eyes,” says Kristen Grauman of FAIR, who leads the project.
Such technologies can support people who need help around the house or help people with the tasks they are learning to do. “The videos in this dataset are much closer to how people observe the world,” says Michael Rue, a computer vision researcher at Google Brain and Stony Brook University in New York who is not involved with Ego4D.
But the potential abuse is obvious and worrying. The study is funded by Facebook, the social media giant that was recently accused by the US Senate of prioritizing profit over people’s well-being, as evidenced by MIT Technology Review’s own research.
The business model for Facebook and other big tech companies is to get as much data as possible about people’s online behavior and sell it to advertisers. The AI described in the project could extend that reach to the daily offline behavior of people, showing what objects are around your home, what activities you enjoy, who you spent time with, and even where your gaze lingered – an unprecedented degree of personal information.
“There needs to be some privacy work as you take that out of the world of exploratory research and move on to something that is a product,” says Grauman. “This work may even be inspired by this project.”
The largest previous POV dataset consists of 100 hours of video footage of people in the kitchen. The Ego4D dataset consists of 3,025 hours of video recorded by 855 people in 73 different locations in nine countries (USA, UK, India, Japan, Italy, Singapore, Saudi Arabia, Colombia, and Rwanda).
The participants were of different ages and backgrounds; some were hired for their visually interesting professions such as bakers, mechanics, carpenters, and landscape designers.
Previous datasets usually consisted of video clips with a half-recorded script lasting only a few seconds. On Ego4D, members wore head cameras for up to 10 hours at a time and filmed first-person videos of daily activities without a script, including walking down the street, reading, doing laundry, shopping, playing with pets, board games, and more. other people. Some footage also includes audio, information about where the participants’ gaze was focused, and multiple points of view of the same scene. “This is the first dataset of its kind,” Ryu says.