Prior to integrating image tagging into our our API, here at Filestack we evaluated four of the most popular image tagging services. To determine which service to use, we looked at features, pricing, image size limits, rate limits, performance and accuracy. Ultimately, we decided to go with Google Vision today but the other services might be a good fit for your project and we might even use one ourselves at some point in the future.
One of the things Filestack prides ourselves on is providing the best file uploader in the world, and in effect, building the Files API for the web. Whether you want to integrate our uploader widget with a few lines of code or you want to build a custom uploading system on top of our APIs, we want to provide you a rock solid platform coupled with an excellent experience.
As we move into 2017, our customers are progressively asking for more detailed data about their files and uploads. This led to image tagging being one of the features most requested features by customers. Last year, more than 250,000,000 files were uploaded through Filestack, with images accounting for more than 85% of those uploads. With photos dominating uploads, Exchangeable image file format (Exif) data extraction and automated object recognition (image tagging) are becoming dominant figures in data analytics.
Exif data is not new to Filestack, customers today can query an uploaded image’s metadata to see the Exif payload. Conversely, object recognition, or image tagging, is not a service we offered before March 1, 2017. Much like the decision to partner with a CDN we made a couple of years ago, the Filestack engineering team had a decision to make: do we build an image tagging service, or do we look for a best of breed partner to go to market with. This decision was a little more complicated because we already had a facial detection engine, so there is some scaffolding of a product, whereas content delivery was something we had no software investment in.
The AI landscape is taking off, and with it we see up and comers like Clarifai and Imagga, as well as large technology incumbents such as Microsoft, Google, and Amazon. It feels like there is a newcomer every few days, and competition is good! This is going to produce better platforms, happier customers, and more creative uses of this technology. In the interest of time, we decided to pick a handful of players in this space and put them through the paces. Speed and accuracy were the two categories we prioritized; there is an almost endless amount of success criteria one could place on this, so we tried to keep it simple.
Comparing 4 popular object recognition APIs
The four platforms we put to the test were:
Based on the Tensorflow open-source framework that also powers Google Photos, Google launched the Cloud Vision API (beta) in February 2016. It includes multiple functions, including optical character recognition (OCR), as well as face, emotion, logo, inappropriate content and object detection.
Formerly known as Project Oxford, Microsoft Cognitive Services encompasses 22 API’s that include a wide variety of detection API’s such as dominant color, face, emotion, celebrity, image type and not-safe-for-work content (NSFW). For the purposes of our object recognition testing, we focused on the Computer Vision API (preview), which employs the 86-category concept for tagging.
Announced at re:Invent 2016, Amazon Rekognition is an image recognition service that was propelled by the quiet Orbeus acquisition in 2015. Rekognition is focused on object, facial, and emotion detection. One major difference from the other services tested was the absence of NSFW content detection. Moving forward, Amazon has thrown their full support behind MXNet as their deep-learning framework of choice.
Founded in 2013, and winner of the Imagenet 2013 Classification Challenge, Clarafi is one of the hottest startups in the AI space raising over 40 million in funding. Led by Machine Learning / Computer Vision Guru Matthew Zeiller, Clarifai is making a name for themselves by combining core models with additional machine learning models to tag in areas like “general,” “NSFW,” “weddings,” “travel,” and “food.”
High-level feature platform overview
|Amazon Rekognition||$1 / 1000 events||https://cl.ly/1e2R2d071I2g|
|Google Vision API||$1.5 / 1000 events||https://cl.ly/1W3h0P423N1J|
|Clarifai||$19 / 20k, $479 / 250k events||https://cl.ly/1M1P293n0C3E|
|Microsoft Computer Vision API||$1.5 / 1000 Events||https://cl.ly/301X3i0q3U0W|
Image Size Limits
|Amazon Rekognition||5Mb / Image, 15Mb / Image from S3|
|Google Vision API||4 MB / Image|
|Clarifai||No data in documentation|
|Microsoft Computer Vision API||4 MB / Image|
|Amazon Rekognition||Not defined in documentation|
|Google Vision API||10 Requests per second|
|Clarifai||Depends from plan (we get 30 rps for testing purposes)|
|Microsoft Computer Vision API||10 Requests per second|
Now that we’ve set the stage, we were ready to test for the two characteristics we cared about the most: performance and accuracy.
MacBook Pro, Kraków, 1000 files, 10 at a time
N. Virginia, 1000 files, 10 at a time
N. Virginia, 3000 files, 10 at a time
What did we learn?
- Google Vision API provided us with the most steady and predictable performance during our tests, but it does not allow injection with URL’s. In order to use it, we had to send the entire file, or we could alternatively use Google Cloud Storage to save on bandwidth costs.
- Microsoft showed reasonable performance with some higher times on high load.
- Amazon Rekognition supports injection directly from S3 but there was no major improvement in performance. Google was faster in processing even if an image came from a server located in AWS’s infrastructure. Using S3 links could potentially save outgoing bandwidth costs, and using S3 allows us to use much larger files (15Mb).
- Clarifai was the slowest provider but was flexible enough to increase our rate limits to 30 requests per second. It was not clear that more tagging options linearly scaled with time to ingest and tag images.
- Filestack POV: Google’s investment in their network infrastructure once again proved to be a big winner. Even with the cost of bandwidth hitting all-time lows, we have to be mindful of the egress cost of processing millions of files across multiple cloud storage providers and social media sites.
Winner: Google Vision API
Object recognition testing
Google Maps screenshot
|Amazon||Diagram (92%), Plan (92%), Atlas (60%), Map (60%)|
|Map (92%), Plan (60%)|
|Clarifai||Map (99%), Cartography (99%), Graph (99%), Guidance (99%), Ball-Shaped (99%), Location (98%), Geography (98%), Topography (97%), Travel (97%), Atlas (96%), Road (96%), Trip (95%), City (95%), Country (94%), Universe (93%), Symbol (93%), Navigation (92%), Illustration (91%), Diagram (91%), Spherical (90%)|
|Microsoft||Text (99%), Map (99%)|
To start we decided to use something we thought was pretty simple, a screenshot of Google Maps. All services performed pretty well here; but Microsoft was the odd man out with a 99% confidence that this was “Text.”
Fruit dessert cup
|Amazon||Fruit (96%), Dessert (95%), Food (95%), Alcohol (51%), Beverage (51%), Coctail (51%), Drink (51%), Cream (51%), Creme (51%)|
|Food (95%), Dessert (84%), Plant (81%), Produce (77%), Frutti Di Bosco (77%), Fruit (72%), Breakfast (71%), Pavlova (71%), Meal (66%), Gelatin Dessert (57%)|
|Clarifai||Fruit (99%), No Person (99%), Strawberry (99%), Delicious (99%), Sweet (98%), Juicy (98%), Food (97%), Health (97%), Breakfast (97%), Sugar (97%), Berry (96%), Nutrition (96%), Summer (95%), Vitamin (94%), Kiwi (93%), Tropical (92%), Juice (92%), Refreshment (92%), Leaf (90%), Ingredients (90%)|
|Microsoft||Food (97%), Cup (90%), Indoor (89%), Fruit (88%), Plate (87%), Dessert (38%), Fresh (16%)|
Keeping things relatively simple, we ran a test with a fruit dessert cup. Google surprising came in last in confidence as “fruit” as a category. Clarifai not only hit fruit, but with their wide range of categorization options, we got plenty of results around “health” tags.
|Amazon||Bell Pepper (97%), Pepper (97%), Produce (97%), Vegetable (97%), Market (84%), Food (52%)|
|Malagueta Pepper (96%), Food (96%), Pepperoncini (92%), Chili Pepper (91%), Vegetable (91%), Produce (90%), Cayenne pepper (88%), Plant (87%), Bird’s Eye Chili (87%), Pimiento (81%)|
|Clarifai||Pepper (99%), Chili (99%), Vegetable (98%), Food (98%), Cooking (98%), No Person (97%), Spice (97%), Capsicum (97%), Bell (97%), Hot (97%), Market (96%), Pimento (96%), Healthy (95%), Ingredients (95%), Jalapeno (95%), Cayenne (95%), Health (94%), Farming (93%), Nutrition (93%), Grow (90%)|
|Microsoft||Pepper (97%), Hot Pepper (87%), Vegetable (84%)|
We spiced things up this round by presenting a photo full of various peppers. All providers performed pretty well here, especially Google and Clarifai with the most accurate pepper tags.
Herman the Dog
|Amazon||Animal (92%), Canine (92%), Dog (92%), Golden Retriever (92%), Mammal (92%), Pet (92%), Collie (51%)|
|Dog (98%), Mammal (93%), Vertebrate (92%), Dog Breed (90%), Nose (81%), Dog Like Mammal (78%), Golden Retriever (77%), Retriever (65%), Collie (56%), Puppy (51%)|
|Clarifai||Dog (99%), Mammal (99%), Canine (98%), Pet (98%), Animal (98%), Portrait (98%), Cute (98%), Fur (96%), Puppy (95%), No person (92%), Retriever (91%), One (91%), Eye (90%), Looking (89%), Adorable (89%), Golden Retriever (88%), Little (87%), Nose (86%), Breed (86%), Tongue (86%)|
|Microsoft||Dog (99%), Floor (91%), Animal (90%), Indoor (90%), Brown (88%), Mammal (71%), Tan (27%), Starting (18%)|
Next up was a stock photo of a black dog, let’s call him Herman. All services fared well, most interesting to see here is Amazon, Google, and Clarifai all tag with “Golden Retriever.” I’m not confident that Herman is a golden retriever, but three out of four services said otherwise.
Flipped Herman the dog
|Amazon||Animal (98%), Canine (98%), Dog (98%), Mammal (98%), Pet (98%), Pug (98%)|
|Dog (97%), Mammal (92%), Vertebrate (90%), Dog Like Mammal (70%)|
|Clarifai||Dog (99%), Mammal (97%), No Person (96%), Pavement (96%), Pet (95%), Canine (94%), Portrait (94%), One (93%), Animal (93%), Street (93%), Cute (93%), Sit (91%), Outdoors (89%), Walk (88%), Sitting (87%), Puppy (87%), Looking (87%), Domestic (87%), Guard (86%), Little (86%)|
|Microsoft||Ground (99%), Floor (90%), Sidewalk (86%), Black (79%), Domestic Cat (63%), Tile (55%), Mammal (53%), Tiled (45%), Dog (42%), Cat (17%)|
Going back to our adorable stock photo of Herman, we flipped the image and tossed it back into the mixer. Herman adequately gave Microsoft some heartburn, as “Dog” fell to 42%, and it even gave a 17% certainty of Herman being a “Cat.”
Zoomed in Herman
|Amazon||Animal (89%), Canine (89%), Dog (89%), Labrador Retriever (89%), Mammal (89%), Pet (89%)|
|Dog (96%), Mammal (92%), Vertebrate (90%), Dog Like Mammal (69%)|
|Clarifai||Animal (99%), Mammal (98%), Nature (97%), Wildlife (97%), Wild (96%), Cute (96%), Fur (95%), No Person (95%), Looking (93%), Portrait (92%), Grey (92%), Dog(91%), Young (89%), Hair (88%), Face (87%), Chordata (87%), Little (86%), One (86%), Water (85%), Desktop (85%)|
|Microsoft||Dog (99%), Animal (98%), Ground (97%), Black (97%), Mammal (97%), Looking (86%), Standing (86%), Staring (16%)|
Ending our round of Herman based tests, we presented a zoomed image to each of the platforms. Kudos to Amazon for “Labrador Retriever” as that is what breed Herman is. Everything else here is pretty standard.
|Emblem (51%), Logo (51%)|
|Text (92%), Font (86%), Circle (64%), Trademark (63%), Brand (59%), Number (53%)|
|Clarifai||Business (96%), Round (94%), No Person (94%), Abstract (94%), Symbol (92%), Internet (90%), Technology (90%), Round out (89%), Desktop (88%), Illustration (87%), Arrow (86%), Conceptual (85%), Reflection (84%), Guidance (83%), Shape (82%), Focus (82%), Design (81%), Sign (81%), Number (81%), Glazed (80%)|
|Microsoft||Bicycle (99%), Metal (89%), Sign (68%), Close (65%), Orange (50%), Round (27%), Bicycle Rack (15%)|
In an effort to stretch the image categories, decided we’d throw a logo into our testing and see what was reported. Kudos to Amazon for reporting “Logo”, as that is what we were expecting, and no one else hit. Microsoft leading with a 99% certainty this is a “Bicycle” was the first big miss we found.
Uncle Sam logo
|Amazon||Clown (54%), Mime (54%), Performer (54%), Person (54%), Costume (52%), People (51%)|
|Figurine (55%), Costume Accessory (51%)|
|Clarifai||Lid (99%), Desktop (98%), Man (97%), Person (96%), Isolated (95%), Adult (93%), Costume (92%), Young (92%), Retro (91%), Culture (91%), Style (91%), Boss (90%), Traditional (90%), Authority (89%), Party (89%), Crown (89%), Funny (87%), People (87%), Celebration (86%), Fun (86%)|
|Microsoft||No data returned|
The only test we ran that stumped one of our contestants, a logo of Uncle Sam was not detected at all by Microsoft. The other three service were varied in their responses, but none of the them were correctly able to identify the logo.
What did we learn?
- Google Vision was quite accurate and detailed. There was no major misses, the biggest being the telephone logo picture, but it did pick up “brand.”
- Microsoft took a bit of a beating, as it did not perform well on flipped Herman, or the telephone logo. The “Uncle Sam” logo also returned no data.
- Amazon Rekognition was pretty reliable and did not have any major surprises. As far a feature maturity goes, this is one of the newest entrants to the market. Given Amazon Web Service’s track record, we expect this service to grow very fast and new features to be added at a blistering pace.
- Clarifai by far had the most image tags, but had a few hiccups along the way. The zoomed Herman was able to make it drop “Dog” from it’s suggested tags. More tags is not always better, as some of them were inaccurate.
- Filestack POV: Logo detection and accuracy is very difficult. We also learned that more image tagging categories does not necessarily correlate to more accurate tagging.
Winner: Google Vision
With speed and accuracy being our top priorities, Google’s Vision API was the winner this time around. We have confidence in what we saw in our testing, and the pedigree of Google alone instills confidence that they will continue to improve their service. We encourage you to pick the service that solves your needs best, as every platform has strengths and weaknesses. We will continue to test, review, and ultimately integrate with best of breed services — this space is super hot right now, this is only the beginning. Our goal is to ensure we provide our customers with the best value for their dollar and help them solve complex challenges around intelligent content.
If you want to check out our image tagging implementation, we’ve added it to our “Services” catalog and you can demo it today! If you’re ready to add it into your application, it’s currently only available via API, but we are hard at work to add it into our flagship file picker widget. If you have any questions please reach out to firstname.lastname@example.org or hit us up on Twitter at @filestack.
In a follow-up post, I will discuss the explicit content detection feature across these four providers. That post will detail the results, and how we took them into consideration when integrating with a partner for our new image tagging service.