Realtime Machine Learning with PyTorch and Filestack

Richard HerbertTutorials, Working with Filestack0 Comments

Ai GIF - Find & Share on GIPHY

Only a few years after its name was coined, deep learning found itself at the forefront of the digital zeitgeist. Over the course of the past two decades, online services evolved into large-scale cloud platforms, while popular libraries like Tensorflow, Torch and Theano later made machine learning integration far simpler and more efficient. Algorithms deemed intractable prior to 2000 became the de facto standard for classification and regression tasks on large datasets. Now, deep learning is a staple for any company whose bread and butter is data.

However, deep learning requires access to thousands or millions of hand-labelled, highly-representative examples. Ian Goodfellow suggests in his book, Deep Learning, that to achieve merely “acceptable” results, a model needs at least 5,000 labeled samples per class.

This is not easy to come by, especially for new services and companies without a defined corner of the market. A company that must wait to collate hundreds of thousands of good, labelled examples before getting to work on their machine learning goals risks losing headway to their competitors. They also lose precious time training, important for deep models, which are known to require great amounts of time for hyper-parameter tuning.

Luckily, recent improvements in unsupervised learning and file uploading mean it’s easier than ever to build, implement and train deep models without labels or supervision. This article will show how to create a real-time, unsupervised deep autoencoder using PyTorch, Filestack, and perceptual loss.

Warning: I presume a basic level of machine learning understanding and experience

Getting the Data

There is no cloud data without file ingestion, so the first thing we need to do is build a simple webpage that allows users to upload a photo. Filestack offers an incredibly easy-to-integrate file uploader that lets users pick files from any of the major cloud storage providers, as well as locally from their computer. All you need is to do is sign up for the free plan and get an API key to embed it into your webpage.

Still though, as the common parlance goes, nothing on the frontend is easy, so do what you must to mentally prepare — download a mindfulness app, schedule some puppy therapy, spend a year meditating near the mountainous regions of Hakone — because this first step is a doozy.

    <script src=""></script>

        var client = filestack.init('YOUR_API_KEY_HERE')
            accept: ['image/jpeg', 'image/png'],
            fromSources: ['local_file_system', 'imagesearch', 'facebook', 'instagram', 'googledrive', 'dropbox', 'flickr', 'github', 'gmail', 'onedrive']

Just kidding. It’s a few lines of HTML and Javascript. All done. Now you have a fully functioning file uploader with access to myriad cloud providers, without spending dozens of hours connecting those services yourself.

As you can see from the code snippet, Filestack allows developers to easily specify restrictions on filetypes, either through mimetypes or extensions. We will therefore restrict our inputs to JPEGs and PNGs, good image types for convolutional learning and usage with PIL, the standard Python imaging library. There are multitudes of preprocessing options, and the picker lets users crop and edit their photos to their liking prior to uploading, which is handy if you need, say, images of cropped faces.

Then all we need to do is add some Javascript to send the image and wait for a response from the server (we will send back the autoencoded output).

var pickFile = function(options){
    client.pick(options).then(function(results) {
        var files = results.filesUploaded

        var fd = new FormData()
        fd.append('url', files[0].url)

        var xhr = new XMLHttpRequest()
        xhr.responseType = 'blob'

        xhr.addEventListener('load', function(response){
            var data =
            var image_field = document.getElementById('image')
            image_field.src = URL.createObjectURL(data)
        })'POST', 'http://localhost:5000')

With that out of the way, we can build a deep convolutional network.  


PyTorch is a relatively new machine learning framework that runs on Python, but retains the accessibility and speed of Torch. It also offers the graph-like model definitions that Theano and Tensorflow popularized, as well as the sequential-style definitions of Torch. You can find installation instructions here.

If you’ve never used PyTorch or any machine learning framework before, take a look at this tutorial, which goes over the basic operations and some simple models. There is also a tutorial made specifically for previous Torch users migrating to PyTorch. 


Our model will be an autoencoder using recent techniques in perceptual loss to gain a rich understanding of important visual features from a stream of images, all without the need for labels. This kind of unsupervised pre-training radically reduces the amount of samples and training time needed to reach convergence on classification and regression tasks.

Traditional autoencoders use an encoder-decoder format: an image (or any data you like) is compressed to a hidden state and fed into a decoder that transforms the data back to its original form. This allows for learning the salient visual features of images without human supervision. 

The problem with traditional autoencoders is they learn via pixel-by-pixel reconstruction loss, typically using mean-squared error regression. In other words, they attempt to learn lossless compression, which typically results in blurry, sometimes formless reconstructions that do not perform well. For a few years, this relegated autoencoders to being merely a fun exercise for beginner’s rather than a tool in the professional deep learning engineer’s toolbox.

However, using perceptual loss radically increases the performance of autoencoders and their ability to extract useful features.

Perceptual Loss

Prisma, the app that transforms your photos into something that looks like it could be showcased at a fine art museum, is largely based on an open-source paper, A Neural Algorithm of Artistic Style. In that paper, the authors created an algorithm that separately reconstructed the content and style of an image by sending it through a pretrained convolutional network, VGG19. The authors found that higher convolutional layers preserved the images’s content, while lower layers seemed to preserve its style. If you send two images through, they found, you can mix the content of one with the style of the other. 

Prisma thus takes one of your iPhone selfies and turns them into something akin to a Vincent van Gogh. This feature-based extraction is called perceptual loss, and it is how we will train our autoencoder.


The hope is that by building a traditional autoencoder, and then passing both its output and the original image through a pretrained network, and computing loss on the extracted features, we can teach the model to reconstruct not the exact image pixel-by-pixel, but an image that leads to the same extracted features. We want the general form of Tom Cruise’s face — male, skinny, brown hair, wide smile, the things you would use to describe him — not the exact flop of his hair, the precise background details, or the particular folds in his suit. 

To do this means we need a public, pretrained network like VGG, Inception or Alexnet, all of which come bundled with PyTorch.

Instantiating the Perceptual Loss Model

Following the general algorithm defined in, we will use the ReLU layer “relu2_2” of the publicly-available VGG19 deep convolutional network. We will pass both our autoencoder’s output, as well as the original training image, through this network before computing the error.

vgg = models.vgg19(pretrained=True)
class VGG(nn.Module):
    def __init__(self):
        super(VGG, self).__init__()
        self.main = nn.Sequential(
    def forward(self, x):
        return self.main(x)

Important to note is that we don’t want to change the gradients of the VGG network as we run our backpropagation, so we need to go through each VGG layer and add a flag that lets Autograd, the PyTorch differentiation module, know not to update those gradients.

for param in vgg.parameters():
    param.requires_grad = False

Building the Model

This article does not have the scope to provide an in-depth tutorial or explanation of convolutional neural networks. I would suggest Ian Goodfellow’s Deep Learning or the public website for the Stanford deep learning course to get yourself up to speed.

For the autoencoder, I used the deconvolutional architecture popularized by Chintala et al in their DCGAN network, but any kind of model that encodes to a hidden state and decodes back to the original image will work.

No matter what specific architecture you end up using, I would highly recommend adding batch normalization and leaky ReLU activations between convolutions. You should also use the hyperbolic tangent function (tanh) as the final activation layer of your decoder, instead of sigmoid.

self.encoder = nn.Sequential(
    nn.Conv2d(channels, ndf, 4, 2, 1, bias=False),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ndf * 2),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ndf * 4),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ndf * 8),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf * 8, ndf * 16, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ndf * 16),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf * 16, ndf * 32, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ndf * 32),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(ndf * 32, z_dim, 4, 1, 0, bias=False),
self.decoder = nn.Sequential(
    nn.ConvTranspose2d(z_dim, ngf * 32, 4, 1, 0, bias=False),
    nn.BatchNorm2d(ngf * 32),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf * 32, ngf * 16, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ngf * 16),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ngf * 8),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ngf * 4),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
    nn.BatchNorm2d(ngf * 2),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(ngf, channels, 4, 2, 1, bias=False),

For optimization, I used ADAM without learning rate decay, but you are free to choose the gradient optimization algorithm of your choice. PyTorch’s optim package provides you with implementations of the most popular ones, as well as giving you direct access to the parameters with the model.parameters function, if you prefer a custom optimization method.

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Integrating Filestack Into the Backend

The free-tier of Filestack stores our images for us and gives back a public URL, which we can then send to our Flask server and use at our leisure. This is perfect for our toy example, and you can always upgrade if you want to direct the files to your own S3 bucket or an alternative storage solution. For now, all we need to do is send the URL to our deep learning server on the backend, where our model lives, before redirecting it to wherever your user data ultimately goes.

def train_on_image(url):
    response = requests.get(url)
    image =

    image_tensor = loader(image).cuda()
    image_tensor = image_tensor.view(1, 3, 256, 256)

Using the loader, we can then turn the image into a tensor and preprocess it however we need using Torchvision, the PyTorch image transformation library.

loader = transforms.Compose([


Once we have the image tensor ready, we can pass it through either alone in real-time or as a member of a queued-up minibatch. 

target = Variable(image_tensor, volatile=True)
output = model(Variable(image_tensor))

Pretrained PyTorch models expect a certain kind of normalization for their inputs, so we must modify the outputs from our autoencoder using the mean and standard deviation declared here before sending it through the loss model. Note that these alterations must happen via PyTorch Variables so they can be stored in the differentiation graph. 

def normalize(x):
    """normalization for VGG input"""
    mean_tensor = torch.FloatTensor(
    mean_tensor[:, 0, :, :] = 0.485
    mean_tensor[:, 1, :, :] = 0.456
    mean_tensor[:, 2, :, :] = 0.406

    std_tensor = torch.FloatTensor(
    std_tensor[:, 0, :, :] = 0.229
    std_tensor[:, 1, :, :] = 0.224
    std_tensor[:, 2, :, :] = 0.225

    x = (x - Variable(mean_tensor)) / Variable(std_tensor)


Then, we can compute the mean-squared error (or the loss function of your choice) on the outputted perceptual features, and make the backwards call to update the model’s gradients and optimize them.

vgg_fake = vgg(output)
vgg_real = vgg(target)
loss = criterion(vgg_fake, Variable(, requires_grad=False))


I ran my model by sending images from the CelebA dataset through one-at-a-time and validating on a test set the network never encountered. It should be noted that certain celebrities may exist in both the test and training set (though in different photos), and celebrity images tend to over-represent certain facial characteristics/ages, meaning that this model performs less well on non-celebrity photos. 

Animated GIF  - Find & Share on GIPHY

As you can see, the result is not an exact reconstruction, but a (creepily) representative reconstruction, building a face that shares the broader features of the input without worrying about getting environmental or ephemeral details from the photo exactly right. Because of this, we can be confident that our realtime, pretrained model has already learned the salient features for any potential classification task in the future involving celebrity faces. All that’s left is to specify which specific features we would like to extract and run predictions on by modifying the last layer of the encoder and further training it on a small set of labelled examples, once they are available.


In summation, problems companies encounter when training models in realtime include immediate access to high-quality labels and access to a wide array of connected cloud services without time-consuming development. Filestack makes file uploading simple and streamlined, enabling quick deployment of data services, while an autoencoder that learns using perceptual loss removes the immediate need for supervised learning. This allows any company encountering data to start pretraining their model immediately, before a large, labelled dataset is available. It also possibly reduces the amount of samples needed for model convergence on classification and regression tasks.