Facial recognition as a UX driver. From AR to emotion detection, how the camera turned out to be the best tool to decipher the world

The camera is finally on stage to solve UX, technology and communication between us all. Years after the Kinect was trashed and Google Glass failed, there is a new hope. The impressive technological array that Apple minimized from a PrimeSense to the iPhone X is the beginning of the way for emotion dependent interactions. It’s not new, it’s better than that, it’s commercialized and comes with access to developers.

Recently Mark Zuckerberg mentioned that a lot of Facebook’s focus will be on the camera and its surrounding. Snapchat have defined themselves as a camera company. Apple and Google (Lens, Photos) also are heavily investing in cameras. There is a tremendous power in the camera that is still hidden and it’s the power to detect emotions.

Inputs need to be easy, natural and effortless

When Facebook first introduced emojis as an enhanced reaction to the Like I realised they were onto something. Facebook chose five emotions that would be added and essentially help them understand emotional reactions to content better. I argued that it is essentially a glorified form of the same thing; but one that works better than anything else. In the past Facebook only had the Like button while YouTube had the Like and Dislike buttons. But these are not enough for tracking emotions, and cannot bring too much value to researchers and advertisers. Most people expressed their emotions in comments, and yet there were more likes than comments. The comments are text based, or even image/Gif which is harder to analyze. That is because there are many contextual connections the algorithm needs to guess. For example: how familiar is that person with the person he reacts to and vice versa? What’s their connection with the specific subject? Is there sub text/slang or anything related to previous experience? Is that a continued conversation from the past? Etc. Facebook did a wonderful job at keeping the conversation positive and prevented something like the Dislike button from pulling focus, which could have discouraged content creators and shares. They kept it positively pleasant.

Nowadays I would compare Facebook.com to a glorified forum. Users can reply to comments, like (and other emotions). We’ve almost reached a point where you can like a like 😂. Yet it is still very hard to know what people are feeling. Most people that read don’t comment. What do they feel while reading the post?

The old user experience for cameras

What do you do with a camera? Take pictures, videos, and that’s about it. There has been huge development in camera apps. We have many features there that are related to the surroundings of the main use case; things like HRD, Slow mo, portrait mode etc.

Twitter Luke Cameras — https://twitter.com/lukew/status/522056776477200384

Based on the enormous amount of pictures users generated there was a new wave of smart galleries, photo processing, and metadata apps.

However, there has been a change in the focus recently towards the life integrated camera. A stronger combination of the strongest treats and best use cases for what we do with mobile phones. The next generation of cameras will be fully integrated with our lives and could replace all these other input icons in a messaging app (microphone, camera, location).

It is not a secret that cameras were amongst the three main components that have been constantly developed at a dizzying pace: The screen, the processor, and the camera. Every new phone that came out pushed the limits of that year after year. For cameras, the improvements were in the realm of megapixels, movement stabilization, aperture, speed and, as mentioned above, the apps. Let’s look at a few products that were created by these companies to evaluate the evolution that happened.

This is just a glimpse of MP upgrade, not including double cameras, flash, etc. There are so many software changes.

Most of the development focused on the back camera because, at least initially, the front camera was perceived to be used for video calls only. However, selfie culture and also Snapchat changed it. Snapchat’s masks, which were later copied by everyone else, are still a huge success. Face masks weren’t new, Google introduced them way back, but Snapchat was more effective at putting them in front of people and growing their use.

Highlights from memory lane

In December 2009 Google introduced Google Goggles, which was the first time that users could use their phone to get information about things that are around them. The information was mainly about landmarks initially.

In November 2011 on the Samsung Nexus, they introduced facial recognition to unlock the phone for the first time. Like many other things that are done for the first time, it wasn’t very good and therefore scrapped later on.

In February 2013 Google released Google Glass which had more use cases because it was able to receive additional input other than just from the camera, like voice. It was also always there and present but it essentially failed to gain traction because it was too expensive, looked unfashionable, and triggered an antagonistic backlash from the public. It was just not ready for prime time.

Devices so far only had a limited amount of information available at their disposal. It was audio visual with GPS and historical data. But it was limited: Google Glass displayed the information on a small screen near your eye which made you look like an idiot looking at it and prevented you from looking at anything else. I would argue that putting such technology on a phone for external use is not just a technological limitation but also a physical one. When you focus on the phone you cannot see anything else, your field of view is limited, similar to the field of view in UX principles for VR. That’s why there are some cities that make routes for people who are on their phone and traffic lights that help people not to die while walking and texting. A premise like Microsoft’s Hololens is much more aligned with the spatial environment and can actually help users interact rather than absorb their attention and put them in danger.

In July 2014 Amazon introduced the Fire Phone. It featured four cameras at the front. This was a breakthrough phone in my opinion; even though it didn’t succeed. The four frontal cameras were used for scrolling once the user’s eyes reached the bottom, and created 3D effects based on the accelerometer and user’s gaze. It was the first time that a phone used the front camera as an input method to learn from users.

August 2016 Note 7 was launched with iris scanning that allows users to unlock their phones. Samsung resurrected an improved facial recognition technology that rested on the shelf for 6 years. Unfortunately just looking at the tutorial is vexing. Looking at it made it clear to me that they didn’t do too much user experience testing for that feature. It is extremely disturbing to hold this huge phone and put it exactly 90° parallel to your face. I don’t think it’s something anyone should do in a street. I do understand it could work very nicely with Saudi women who have covered their faces. But the Note 7 exploded, luckily not in people’s faces while doing iris scanning or VR, and this whole concept waited for another full year until the Note 8 came out.

By that time no one mentioned that feature. All it says is that it’s an additional way of unlocking your phone in conjunction with the fingerprint sensor. My guess is that this is because it’s not good enough or Samsung wasn’t able to make a decision (similarly to the release of the Galaxy 6 and 6 Edge). Similarly, for something to succeed it needs to have multiple things you can do with it, otherwise, it risks being forgotten.

Google took a break and then in July 2017 they released the second version of Glass as a B2B product. The use cases became more specific for some industries.

Now Google is about to release the Google Lens to bring the main initial Goggles use case to the modern age. It’s the company’s effort to learn more about how to use visual with additional context, and to figure out the next type of product they should develop. It seems that they’re leaning towards a camera that is wearable.

There are many others that are exploring visual input as well. For example, Pinterest is seeing huge demand for their visual search lens and they intend to use it for searching for familiar things to buy and to help people curate.

Snapchat’s spectacles that allow users to record short videos so easily (even though the upload process is cumbersome).

Now Facial Recognition is also on the Note 8 and Galaxy 8 but it’s not panning out as well as we’d hoped it would.

Or https://twitter.com/MelTajon/status/904058526061830144/video/1

Apple is known for being slow to adopt new technology in relation to its competitors. But on the other hand, they are known for commercializing them. Like, for example, the amount of Apple Watches they sold in comparison to other brands. This time it was all about facial recognition and infinite screen. There is no better way of making people use it than removing any other options (like the Touch ID). It’s not surprising, last year they did this with Wireless Audio (removing the headphone jack) and USB C on the MacBook Pro (by removing everything else).

I am sure that there is a much bigger strategic reason to why Apple chose this technology at this specific time. It’s to do with their AR efforts.

Face ID has some difficulties that immediately occurred to me, like Niqāb (face covers) in Arab countries, plastic surgery and simply growing up. But the bigger picture here is much more interesting. This is the first time that users can do something they naturally do with no effort and receive data that is much more meaningful for the future of technology. I still believe that a screen that can completely read your fingers anywhere is a better way, and it seems like Samsung is heading in that direction (although rumors claimed that Apple tried to do it and failed).

So where is this going? What’s the target?

In the past, companies used special glasses and devices to do user testing. The only output they could give in regards to focus were Heat Maps – using the mouse they were looking at interactions and they were looking out where people physically look. Yet they weren’t able to document users’ focus, emotions, and how they react to the things they see.

Based on tech trends it seems like the future involves Augmented Reality and Virtual Reality. But in my opinion, it’s more about Audio and 3D Sound, and Visual Inputs; gathered simultaneously. This would allow a wonderful experience such as being able to look anywhere, at anything and get information about it.

What if we’d be able to know where the users are looking, where their focus is? For years this is something that Marketing and Design professionals have tried to capture and analyze. What can be better to do that than the set of arrays a device like the iPhone X has as a starting point? Later on, this should evolve into glasses that can see where the user’s focus is.

Reactions are powerful and addictive

Reactions help people converse, raise retention and engagement. Some apps offer post reaction as a message that one can send to their friends. There are some funny videos on YouTube of reactions to a variety of videos. There is even a TV show dedicated solely to people watching TV shows called Gogglebox.

In Google, IO Google has decided to open the option to pay creators on its platform, kind of like what the brilliant Patron site is doing but in a much more dominant way. A way that helps you as someone from the crowd to stand up and grab the creator’s attention called SuperChat.

https://www.youtube.com/watch?v=b9szyPvMDTk

I keep going back to Chris Harrison’s student project from 2009. In this he created the keyboard that has pressure sensing in the keys and if you type strongly it basically read your emotions that you’re angry or excited and the letters got bigger. Now imagine combining it with a camera that sees your facial expression and we all know people express their emotions whilst they’re typing something to someone.

https://www.youtube.com/watch?time_continue=88&v=PDI8eYIASf0

How would such UX look?

Consider the pairing of a remote and the focus center point in VR. The center is our focus but we also have a secondary focus point, which is where the remote points. However, this type of user experience cannot work in Augmented Reality, well unless you want everything to be very still and walk around with a magic wand. To be able to take advantage of Augmented Reality, which is one of Apple’s new focuses, they must know where the user’s focus lies.

https://blog.kickpush.co/beyond-reality-first-steps-into-the-unknown-cbb19f039e51

What started as AR Kit and Google’s ARCore SDKs will be the future of development not only because of the amazing output, but also because of the input that they can get from the front and back cameras combined. This will allow for a greater focus on the input.

A more critical view on future developments

While Apple opened the hatch for facial recognition to trigger reactionary Animojis it is going to get much more interesting when others start implementing Face ID. Currently, it is manifested in a basic harmless way, but the goal remains to get more information! Information that will be used to track us, sell to us, learn about us, sell our emotional data and to allow us to have immersive experiences nonetheless.

Animoji

It is important to say that the front camera doesn’t come alone, it’s finally the expected result of Apple buying PrimeSense. The array of front-facing technology includes an IR camera, depth sensor etc. (I think they could do well with a heat sensor too). It’s not necessarily that someone will keep videos of our faces using the phone, but rather there will be a scraper that will document all the information about our emotions.

Can’t be fooled by a mask — From Apple’s Keynote

Summary

It is exciting for Augmented Reality to have algorithms that can read our faces. There have already been so many books about identifying people’s facial reaction, but now it’s time to digitize that too. It will be wonderful for many things. For example, robots that can look and see how we feel and react to those emotions, or on our glasses to get more context for what we need them to do. Computationally it’s better to look at the combination of elements because that is the thing that creates the context that helps the machine understand you better. It could be another win for companies that have ecosystems which can be leveraged.

The things you can do if you know what the user is focusing on are endless. Attaining that knowledge is the dream of every person that deals with technology.