1. UNIT III
Cloud APIs for Computer Vision: The landscape of visual Recognition APIs, Clarifai, Microsoft
cognitive services, Google Cloud Vision, IBM Watson Visual Recognition. Getting up and running with
cloud APIs, Training our custom classifier. Performance tuning for cloud APIs: Effect of Resizing on
image labelling APIs, Effect of Compression on Image Labelling APIs, Effect of compression on OCR
APIs, Effect of Resizing on OCR APIs
Google Cloud ML Engine: Pros of using cloud ML Engine, Cons of using Cloud ML Engine, Building
classification API, TensorFlow Serving, KubeFlow: Pipelines, Fairing.
Edge ML: Constraints and Optimizations, Tensoflow Lite, Running tensorflow lite, Processing the
Image Buffer, Federated Learning
2. The landscape of visual Recognition APIs
Clarifai
• Clarifai was one of the first visual recognition API Started by Matthew Zeiler, a graduate student
from New York University.
• It offers multilingual tagging in more than 23 languages, visual similarity search among previously
uploaded photographs, face-based multicultural appearance classifier, photograph aesthetic
scorer, focus scorer, and embedding vector generation to help us build our own reverse-image search.
• It also offers recognition in specialized domains including clothing and fashion, travel and
hospitality, and weddings.
• Through its public API, the image tagger supports 11,000 concepts.
4. Microsoft Cognitive Services
• With the creation of ResNet-152 in 2015, Microsoft was able to win seven tasks at the ILSVRC, the COCO Image
Captioning Challenge as well as the Emotion Recognition in the Wild challenge, ranging from classification and
detection (localization) to image descriptions.
• Originally starting out as Project Oxford from Microsoft Research in 2015, it was eventually renamed Cognitive
Services in 2016. And most of this research was translated to cloud APIs.
• It’s a comprehensive set of more than 50 APIs ranging from vision, natural language processing, speech,
search, knowledge graph linkage, and more.
• Historically, many of the same libraries were being run at divisions at Xbox (image tagging) and Bing(Image
Search+tagging), but they are now being exposed to developers externally.
• Some viral applications showcasing creative ways developers use these APIs include how-old.net (How Old Do I
Look?), Mimicker Alarm (which requires making a particular facial expression in order to defuse the morning
alarm), and CaptionBot.ai.
5. Microsoft Cognitive Services
As illustrated in Figure 8-2, the API offers image captioning, handwriting understanding, and
headwear recognition. Due to many enterprise customers, Cognitive Services does not use
customer image data for improving its services.
6. Google Cloud Vision
• Google won the 2014 ILSVRC (ImageNet Large Scale Visual Recognition Challenge)
using GoogLeNet, a deep 22-layer neural network.
• This led to the development of the now-standard Inception architectures.
• In December 2015, Google released a set of Vision APIs to complement the
Inception models.
• With access to vast amounts of consumer data, Google can significantly improve its
classifiers.
• For instance, insights from Google Street View help enhance real-world text
extraction, such as reading billboards.
7. Google Cloud Vision
• For human faces, it provides the most detailed facial key points (Figure 8-3)
including roll, tilt, and pan to accurately localize the facial features.
• The APIs also return similar images on the web to the given input. A simple way to
try out the performance of Google’s system without writing code is by uploading
photographs to Google Photos and searching through the tags.
8. Amazon Rekognition
• Amazon Rekognition API is largely based on Orbeus, a Sunnyvale, California-based
startup that was acquired by Amazon in late 2015.
• Founded in 2012, its chief scientist also had winning entries in the ILSVRC 2014
detection challenge.
• The same APIs were used to power PhotoTime, a famous photo organization app.
The API’s services are available as part of the AWS offerings.
9. Amazon Rekognition
• License plate recognition, video recognition APIs, and better end-to-end
integration examples of Rekognition APIs with AWS offerings like Kinesis Video
Streams, Lambda, and others.
• Also, Amazon’s API is the only one that can determine whether the subject’s eyes
are open or closed.
10. IBM Watson Visual Recognition
• Under the Watson brand, IBM’s Visual Recognition offering started in early 2015.
• After purchasing AlchemyAPI (Web3 development), a Denver-based startup,
AlchemyVision has been used for powering the Visual Recognition APIs.
• Like others, IBM also offers custom classifier training.
12. Algorithmia
• Algorithmia is a marketplace for hosting algorithms as APIs on the cloud.
• Founded in 2013, this Seattle-based startup has both its own in-house algorithms as
well as those created by others .This API did tend to have the slowest response time.
Colorization service for black and
white photos (Figure 8-6), image
stylization, image similarity, and the
ability to run these services on premises,
or on any cloud provider.
13. Getting Up and Running with Cloud APIs
• Calling these cloud services requires minimal code.
• At a high level, get an API key, load the image, specify the intent, make a POST
request with the proper encoding (e.g., base64 for the image), and receive the
results.
• Most of the cloud providers offer software development kits (SDKs) and sample code
showcasing how to call their services.
• They additionally provide pip installable Python packages to further simplify calling
them.
14. Getting Up and Running with Cloud APIs
• Now, let’s test the same image using Google Vision APIs. Get an API key from their
website and use it in the code.
google_cloud_tagimage('DogAndBaby.jpg')
15. Getting Up and Running with Cloud APIs
cognitive_services_tagimage('DogAndBaby.jpg')
16. Training Our Own Custom Classifier
A few of these cloud providers give us the ability to train our own
custom classifier by merely using a drag-and-drop interface. The
pretty user interfaces provide no indication that under the hood
they are using transfer learning. As a result, Cognitive Services
Custom Vision, Google AutoML, Clarifai, and IBM Watson all provide
us the option for custom training.
Additionally, some of them even allow building custom detectors,
which can identify the location of objects with a bounding box.
17. Training Our Own Custom Classifier
The key process in all of them being the following:
1. Upload images
2. Label them
3. Train a model
4. Evaluate the model
5. Publish the model as a REST API
6. Bonus: Download a mobile-friendly model for inference on
smartphones and edge devices
18. Training Our Own Custom Classifier
step-by-step example of Microsoft’s Custom
Vision.
1. Create a project (Figure 8-14): Choose a domain
that best describes our use case. For most purposes,
“General” would be optimal. For more specialized
scenarios, we might want to choose a relevant
domain.
20. As an example, if we have an ecommerce website with
photos of products against a pure white background, we
might want to select the “Retail” domain.
If we intend to run this model on a mobile phone
eventually, we should choose the “Compact” version
of the model, instead; it is smaller in size with only a
slight loss in accuracy.
21. 2.Upload (Figure 8-15): For each category, upload
images and tag them.
It’s important to upload at least 30 photographs per
category.
For our test, we uploaded more than 30 images of
Maltese dogs and tagged them appropriately.
23. 3. Train (Figure 8-16): Click the Train button, and then in about three minutes, we
have a spanking new classifier ready.
4. Analyze the model’s performance: Check the precision and recall of the model. By default,
the system sets the threshold at 90% confidence and gives the precision and recall metrics at
that value.
For higher precision, increase the confidence threshold. This would come at the expense of
reduced recall. Figure 8-17 shows example output.
5. Ready to go: We now have a production-ready API endpoint that we can call from any
application.
25. Performance Tuning for Cloud APIs
A photograph taken by a modern cell phone can have a high resolution and be
upward of 4 MB in size.
Depending on the network quality, it can take a few seconds to upload such an
image to the service.
26. Performance Tuning for Cloud APIs
There are two ways to reduce the size of the image:
Resizing
Most CNNs take an input image with a size of 224 x 224 or 448 x 448 pixels. Much of a cell
phone photo’s resolution would be unnecessary for a CNN. It would make sense to
downsize the image prior to sending it over the network, instead of sending a large image
over the network and then downsizing it on the server.
Compression
Most image libraries perform lossy compression while saving a file. Even a little bit of
compression can go a long way in reducing the size of the image while minimally affecting
the quality of the image itself. Compression does introduce noise, but CNNs are usually
robust enough to deal with some of it.