Site icon Filestack Blog

Comparing 4 Most Popular Image Tagging Services

Prior to integrating image tagging into our our API, here at Filestack we evaluated four of the most popular image tagging services. To determine which service to use, we looked at features, pricing, image size limits, rate limits, performance and accuracy. Ultimately, we decided to go with Google Vision, but the other services might be a good fit for your project.

Background

One of the things Filestack prides ourselves on is providing the world’s top file handling service for developers, and in effect, building the files API for the web. Whether you want to integrate our uploader widget with a few lines of code or you want to build a custom uploading system on top of our APIs, we want to provide you a rock solid platform coupled with an excellent experience.

Over time our customers have progressively asked for more detailed data about their files and uploads. This led to image tagging being one of the most requested features by customers. In past years, more than 250,000,000 files were uploaded through Filestack, with images accounting for more than 85% of those uploads. With photos dominating uploads, exchangeable image file format (exif) data extraction and automated object recognition (image tagging) are becoming dominant figures in data analytics.  

Exif data is not new to Filestack, customers today can query an uploaded image’s metadata to see the exif payload. Much like the decision we made to partner with a CDN, the Filestack engineering team had a decision to make: do we build an image tagging service, or do we look for a best of breed partner to go to market with. This decision was a little more complicated because we already had a facial detection engine, so there is some scaffolding of a product, whereas content delivery was something we had no software investment in.

The AI landscape has taken off, and with it we see image recognition systems like Clarifai and Imagga, as well as large technology incumbents such as Microsoft, Google, and Amazon. It feels like there is a newcomer every few days, and competition is good! This is going to produce better platforms, happier customers, and more creative uses of this technology. In the interest of time, we decided to pick a handful of players in this space and put them through the paces. Speed and accuracy were the two categories we prioritized; there is an almost endless amount of success criteria one could place on this, so we tried to keep it simple.

Comparing 4 popular object recognition APIs

The four platforms we put to the test were:

Google Cloud Vision

Based on the Tensorflow open-source framework that also powers Google Photos, Google launched the Cloud Vision API (beta) in February 2016. It includes multiple functions, including optical character recognition (OCR), as well as face, emotion, logo, inappropriate content and object detection.

 

 

Microsoft Cognitive Services

Formerly known as Project Oxford, Microsoft Cognitive Services encompasses 22 API’s that include a wide variety of detection API’s such as dominant color, face, emotion, celebrity, image type and not-safe-for-work content (NSFW). For the purposes of our object recognition testing, we focused on the Computer Vision API (preview), which employs the 86-category concept for tagging.

 

Amazon Rekognition

Amazon Rekognition is an image recognition service that was propelled by the quiet Orbeus acquisition back in 2015. Rekognition is focused on object, facial, and emotion detection. One major difference from the other services tested was the absence of NSFW content detection. Moving forward, Amazon has thrown their full support behind MXNet as their deep-learning framework of choice.

 

 

Clarifai

Founded in 2013, and winner of the Imagenet 2013 Classification Challenge, Clarafi is one of the hottest startups in the AI space raising over 40 million in funding. Led by Machine Learning / Computer Vision Guru Matthew Zeiller, Clarifai is making a name for themselves by combining core models with additional machine learning models to tag in areas like “general,” “NSFW,” “weddings,” “travel,” and “food.”

 

High-level feature platform overview

Feature Comparison

Amazon Google Clarifai Microsoft
Image Tagging Yes Yes Yes Yes
Video Tagging Yes Yes Yes Yes
Emotions detection Yes Yes Yes Yes
Logo detection Yes Yes Yes Yes
NSFW tagging Yes Yes Yes Yes
Dominant color Yes Yes Yes Yes
Feedback API No Yes Yes No

 

Image Size Limits

Amazon Rekognition 5Mb / Image, 15Mb / Image from S3
Google Vision API 20 MB / Image
Clarifai No data in documentation
Microsoft Computer Vision API 4 MB / Image

 

Rate limits

Amazon Rekognition Not defined in documentation
Google Vision API Varies by plan
Clarifai Varies by plan (we get 30 rps for testing purposes)
Microsoft Computer Vision API 10 Requests per second

 

Now that we’ve set the stage, we were ready to test for the two characteristics we cared about the most: performance and accuracy.

Performance Testing

MacBook Pro, Kraków, 1000 files, 10 at a time

Average Minimum Maximum 90th percentile
Amazon 2.42s 1.03s 3.73s 3.21s
Google 1.23s 0.69s 1.68s 1.42s
Clarifai 4.69s 0.1s 58.16s 4.78s
Microsoft 1.11s 0.65s 5.07s 1.5s

N. Virginia, 1000 files, 10 at a time

Average Minimum Maximum 90th percentile
Amazon 1.1s 0.302s 3.64s 1.97s
Google 0.98s 0.4s 1.79s 1.12s
Clarifai 2.17s 0.81s 7.35s 3.34s
Microsoft 1.38s 0.81s 4.22s 2.14s

N. Virginia, 3000 files, 10 at a time

Average Minimum Maximum 90th percentile
Amazon 1.08 0.25 2.71 1.96
Amazon S3 1.26 0.35 4.02 2.17
Google 0.97 0.41 2.87 1.11
Clarifai 2.08 0.84 7.63 3.05
Microsoft 1.31 0.73 14.74 1.87

What did we learn?

Object recognition testing

Google Maps screenshot

Source: Flickr.com

 

Amazon Diagram (92%), Plan (92%), Atlas (60%), Map (60%)
Google Map (92%), Plan (60%)
Clarifai Map (99%), Cartography (99%), Graph (99%), Guidance (99%), Ball-Shaped (99%), Location (98%), Geography (98%), Topography (97%), Travel (97%), Atlas (96%), Road (96%), Trip (95%), City (95%), Country (94%), Universe (93%), Symbol (93%), Navigation (92%), Illustration (91%), Diagram (91%), Spherical (90%)
Microsoft Text (99%), Map (99%)

 

To start we decided to use something we thought was pretty simple, a screenshot of Google Maps. All services performed pretty well here; but Microsoft was the odd man out with a 99% confidence that this was “Text.”  

 

Fruit cup

 

Source: Flickr.com

 

Amazon Fruit (96%), Dessert (95%), Food (95%), Alcohol (51%), Beverage (51%), Coctail (51%), Drink (51%), Cream (51%), Creme (51%)
Google Food (95%), Dessert (84%), Plant (81%), Produce (77%), Frutti Di Bosco (77%), Fruit (72%), Breakfast (71%), Pavlova (71%), Meal (66%), Gelatin Dessert (57%)
Clarifai Fruit (99%), No Person (99%), Strawberry (99%), Delicious (99%), Sweet (98%), Juicy (98%), Food (97%), Health (97%), Breakfast (97%), Sugar (97%), Berry (96%), Nutrition (96%), Summer (95%), Vitamin (94%), Kiwi (93%), Tropical (92%), Juice (92%), Refreshment (92%), Leaf (90%), Ingredients (90%)
Microsoft Food (97%), Cup (90%), Indoor (89%), Fruit (88%), Plate (87%), Dessert (38%), Fresh (16%)

 

Keeping things relatively simple, we ran a test with a fruit dessert cup. Google surprising came in last in confidence as “fruit” as a category. Clarifai not only hit fruit, but with their wide range of categorization options, we got plenty of results around “health” tags.  

 

Assorted peppers

Source: Flickr.com

 

Amazon Bell Pepper (97%), Pepper (97%), Produce (97%), Vegetable (97%), Market (84%), Food (52%)
Google Malagueta Pepper (96%), Food (96%), Pepperoncini (92%), Chili Pepper (91%), Vegetable (91%), Produce (90%), Cayenne pepper (88%), Plant (87%), Bird’s Eye Chili (87%), Pimiento (81%)
Clarifai Pepper (99%), Chili (99%), Vegetable (98%), Food (98%), Cooking (98%), No Person (97%), Spice (97%), Capsicum (97%), Bell (97%), Hot (97%), Market (96%), Pimento (96%), Healthy (95%), Ingredients (95%), Jalapeno (95%), Cayenne (95%), Health (94%), Farming (93%), Nutrition (93%), Grow (90%)
Microsoft Pepper (97%), Hot Pepper (87%), Vegetable (84%)

 

We spiced things up this round by presenting a photo full of various peppers. All providers performed pretty well here, especially Google and Clarifai with the most accurate pepper tags.

 

Herman the Dog

Source: Flickr.com

 

Amazon Animal (92%), Canine (92%), Dog (92%), Golden Retriever (92%), Mammal (92%), Pet (92%), Collie (51%)
Google Dog (98%), Mammal (93%), Vertebrate (92%), Dog Breed (90%), Nose (81%), Dog Like Mammal (78%), Golden Retriever (77%), Retriever (65%), Collie (56%), Puppy (51%)
Clarifai Dog (99%), Mammal (99%), Canine (98%), Pet (98%), Animal (98%), Portrait (98%), Cute (98%), Fur (96%), Puppy (95%), No person (92%), Retriever (91%), One (91%), Eye (90%), Looking (89%), Adorable (89%), Golden Retriever (88%), Little (87%), Nose (86%), Breed (86%), Tongue (86%)
Microsoft Dog (99%), Floor (91%), Animal (90%), Indoor (90%), Brown (88%), Mammal (71%), Tan (27%), Starting (18%)

 

Next up was a stock photo of a black dog, let’s call him Herman.  All services fared well, most interesting to see here is Amazon, Google, and Clarifai all tag with “Golden Retriever.” I’m not confident that Herman is a golden retriever, but three out of four services said otherwise.  

 

Flipped Herman the dog

Source: Filckr.com

 

 

Amazon Animal (98%), Canine (98%), Dog (98%), Mammal (98%), Pet (98%), Pug (98%)
Google Dog (97%), Mammal (92%), Vertebrate (90%), Dog Like Mammal (70%)
Clarifai Dog (99%), Mammal (97%), No Person (96%), Pavement (96%), Pet (95%), Canine (94%), Portrait (94%), One (93%), Animal (93%), Street (93%), Cute (93%), Sit (91%), Outdoors (89%), Walk (88%), Sitting (87%), Puppy (87%), Looking (87%), Domestic (87%), Guard (86%), Little (86%)
Microsoft Ground (99%), Floor (90%), Sidewalk (86%), Black (79%), Domestic Cat (63%), Tile (55%), Mammal (53%), Tiled (45%), Dog (42%), Cat (17%)

 

Going back to our adorable stock photo of Herman, we flipped the image and tossed it back into the mixer. Herman adequately gave Microsoft some heartburn, as “Dog” fell to 42%, and it even gave a 17% certainty of Herman being a “Cat.”

 

Zoomed in Herman

Source: Flickr.com

 

Amazon Animal (89%), Canine (89%), Dog (89%), Labrador Retriever (89%), Mammal (89%), Pet (89%)
Google Dog (96%), Mammal (92%), Vertebrate (90%), Dog Like Mammal (69%)
Clarifai Animal (99%), Mammal (98%), Nature (97%), Wildlife (97%), Wild (96%), Cute (96%), Fur (95%), No Person (95%), Looking (93%), Portrait (92%), Grey (92%), Dog(91%), Young (89%), Hair (88%), Face (87%), Chordata (87%), Little (86%), One (86%), Water (85%), Desktop (85%)
Microsoft Dog (99%), Animal (98%), Ground (97%), Black (97%), Mammal (97%), Looking (86%), Standing (86%), Staring (16%)

 

Ending our round of Herman based tests, we presented a zoomed image to each of the platforms. Kudos to Amazon for “Labrador Retriever” as that is what breed Herman is. Everything else here is pretty standard.

 

 

Source: Flickr.com

 

 

Amazon

Emblem (51%), Logo (51%)
Google Text (92%), Font (86%), Circle (64%), Trademark (63%), Brand (59%), Number (53%)
Clarifai Business (96%), Round (94%), No Person (94%), Abstract (94%), Symbol (92%), Internet (90%), Technology (90%), Round out (89%), Desktop (88%), Illustration (87%), Arrow (86%), Conceptual (85%), Reflection (84%), Guidance (83%), Shape (82%), Focus (82%), Design (81%), Sign (81%), Number (81%), Glazed (80%)
Microsoft Bicycle (99%), Metal (89%), Sign (68%), Close (65%), Orange (50%), Round (27%), Bicycle Rack (15%)

 

In an effort to stretch the image categories, decided we’d throw a logo into our testing and see what was reported. Kudos to Amazon for reporting “Logo”, as that is what we were expecting, and no one else hit. Microsoft leading with a 99% certainty this is a “Bicycle” was the first big miss we found.

 

Source: Flickr.com

 

Amazon Clown (54%), Mime (54%), Performer (54%), Person (54%), Costume (52%), People (51%)
Google Figurine (55%), Costume Accessory (51%)
Clarifai Lid (99%), Desktop (98%), Man (97%), Person (96%), Isolated (95%), Adult (93%), Costume (92%), Young (92%), Retro (91%), Culture (91%), Style (91%), Boss (90%), Traditional (90%), Authority (89%), Party (89%), Crown (89%), Funny (87%), People (87%), Celebration (86%), Fun (86%)
Microsoft No data returned

 

The only test we ran that stumped one of our contestants, a logo of Uncle Sam was not detected at all by Microsoft.  The other three service were varied in their responses, but none of the them were correctly able to identify the logo.

 

What did we learn?

Winner: Google Vision

 

Conclusion

With speed and accuracy being our top priorities, Google’s Vision API was the winner this time around. We encourage you to pick the service that solves your needs best, as every platform has strengths and weaknesses. We will continue to test, review, and ultimately integrate with best of breed services – this is only the beginning. Our goal is to ensure we provide our customers with the best value for their dollar and help them solve complex challenges around intelligent content.

If you want to check out our image tagging implementation, reach out to us to see a demo today!

Exit mobile version