Project Description

photosearch

I’ve been studying deep learning technology lately, and I was looking for a project to solidify all the reading I’ve been doing. It’s incredibly promising technology that’s made dramatic leaps in the past few years in the area of perception (object detection in images, speech recognition, etc.) This post is a bit of a simplification of the space, but if you’re interested in learning more I definitely recommend you check out all the awesome and amazing resources out there to get ramped up in the space. Facebook has been making impressive progress in the space recently. The theory behind how these deep networks operate is fascinating, but not it’s certainly not something you can really do justice to in a single blog post.

After toying around with a few ideas, I put together a Photosearch, a mobile web app that allows you to search your Facebook images using natural language. Facebook photos are organized into albums and tagged with people and places, but this isn’t really how we as humans remember events. When was the last time you thought, “Give me the 11th photo in the album! That one was really cool.” Computers index data this way but human think much more associatively: “What was that really cool photo of me and Bill biking last year?” Photosearch lets you search for images exactly that way (click to see larger images):

2015-10-29 02.30.02 2015-10-29 02.20.30 2015-10-29 02.27.20 2015-10-29 02.23.54

I set up the server on an LEMP stack (Ubuntu Linux, nginx, MongoDB, and PHP) on a Dreamhost VPS. Once you log into the web app with Facebook, the server downloads all the photos that you’re tagged in and analyzes them by passing them through an AlexNet implementation on Caffe. Caffe is a framework for configuring and running deep networks for image analysis. You can use the framework to show a computer a bunch of images of, say, footballs, and out comes a library that is able to detect footballs in any image you throw at it. It’s not perfectly accurate, but the technology is good enough to suggest tags to Facebook photos, recognize text from camera streams in Google Translate, and power Google’s photo search. In goes your Facebook photo, and out comes a list of objects and corresponding probabilities indicating how sure the deep network is that the particular object was recognized.

As you can imagine, downloading every single Facebook photo where someone is tagged and analyzing each one individually takes a bit of time to complete. Since we can’t do all the work in the AJAX handler, it made sense to decouple the web UI from the background work. So the initial AJAX call to the server sent a message over RabbitMQ to a daemon running on the same machine. The RabbitMQ consumer loads Facebook photos for the requested user in pages of 10, passes the photo into Caffe through shell_exec, and stores the object labels from Caffe in the database along with the people and place tags for each photo. A cron job kicks off every minute to make sure the RabbitMQ consumer daemon is still running and restarts it if need be.

This process took about 15 minutes to complete for me, so I had the RabbitMQ consumer email the user once their photos were ready. The design also had the cool approach where you could see the intermediate status of the analysis on any device you were logged into Facebook with. Once the analysis was complete, the UI showed a single search bar and you could type in whatever you wanted to find. No multiple fields, no “Advanced Options”, none of that. Just a single search bar.

Once you issue a search, the AJAX handler sent the string off to a Python app using NLTK, a natural language library that performed part-of-speech tagging on the search query. This analysis let the search handler understand time frames, nouns, verbs, and more specific entities like people and places. That info was used to build a MongoDB search query to find matching photos. Since there are multiple ways to ask for the same information, I added a thesaurus layer to map synonyms to each other. If the network detected a creek, then queries for “brook” or “stream” should also work. Understanding context is a difficult problem — for example, is “June” a person or a month? There is an entire academic discipline devoted to solving this problem, but my goal was to be just accurate enough to give meaningful results and I didn’t really push much further.

A side note on the database technology: I picked MongoDB because of the variable nature of the image metadata. A photo could have no people in it, or five. You can handle this kind of data in relational databases like MySQL with junction tables but it’s kind of a pain. In MongoDB you can store the all the data in a single document so it makes the selection easier.

I didn’t make the app available to the world because the neural network and language processing took quite a bit of RAM to execute (~1 GB per user). I didn’t really have the time to optimize for memory usage (e.g., moving to a microservice framework and spreading the work across multiple machines), and I didn’t have the budget to dump the whole shebang on AWS and pay for the consequences, so I ended up just treating this as an exercise. Feel free to let me know if you have any interest in playing around with it and I’ll see what I can do to get you access.

Here are some more photos of the app in action (click to enlarge):

2015-10-29 02.22.48 2015-10-29 02.21.57 2015-10-29 02.20.53 2015-10-29 02.25.06

Let’s walk through a sample search query to see what happens at each step. The query string “with Lindsay and Erik at a waterfall in June” is passed back to the query handler.

The query handler validates the access token with Facebook to make sure the user has been authenticated, then the first step in the analysis process is to extract time information. In this case “in June” is expanded to timestamp limits:

array(4) {
 ["min_date"]=>
 string(19) "2015-06-01 00:00:00"
 ["max_date"]=>
 string(19) "2015-06-30 23:59:59"
 ["start_idx"]=>
 int(37)
 ["length"]=>
 int(7)
 }

Since the time information has been handled at this point, it is removed from the query string. The start index and length from the time extraction indicate the portion of the query string to be removed. That leaves us with:

string(36) "with Lindsay and Erik at a waterfall"

The remaining query string is passed into the Python script to be analyzed with NLTK. This step determines the part of speech of each phrase in the query:

('with', 'IN')
[PERSON -> ('Lindsay', 'NNP')]
('and', 'CC')
[PERSON -> ('Erik', 'NNP')]
('at', 'IN')
('a', 'DT')
('waterfall', 'NN')

The abbreviations after each word indicate the part of speech. Here’s a quick cheat sheet for the previous example (full list, if you’re interested):

IN Preposition
NNP Singular proper noun
CC Coordinating conjunction
DT Determiner
NN Common noun singular

 

As a side note, NLTK handles variations in names pretty well. For example, if I give a first and last name together, it realizes it’s still one name and combines the two into a single entity:

('with', 'IN')
[PERSON -> ('Lindsay', 'NNP') ('Smith', 'NNP')]
('and', 'CC')
[PERSON -> ('Erik', 'NNP')]
('at', 'IN')
('a', 'DT')
('waterfall', 'NN')

Once the parts of speech are known, the query parser build arrays of entities of interest for the search query. These arrays are categorized by people, places, objects, and actions:

array(4) {
 ["people"]=>
   array(2) {
   [0]=>
   string(7) "Lindsay"
   [1]=>
   string(4) "Erik"
 }
 ["places"]=>
   array(0) {
   }
 ["objects"]=>
   array(1) {
   [0]=>
   string(9) "waterfall"
 }
 ["actions"]=>
   array(0) {
   }
 }

From there, each object in the list is passed to the thesaurus collection in MongoDB to find synonyms. That information is bundled together into a search query that is sent off to the photo collection:

array(1) {
  ["$and"]=>
  array(5) {
    [0]=>
    array(1) {
      ["fbuser"]=>
      string(17) "10153064890191401"
    }
    [1]=>
    array(1) {
      ["time"]=>
      array(2) {
        ["$gte"]=>
        object(MongoDate)#21 (2) {
          ["sec"]=>
          int(1433116800)
          ["usec"]=>
          int(0)
        }
        ["$lte"]=>
        object(MongoDate)#23 (2) {
          ["sec"]=>
          int(1435708799)
          ["usec"]=>
          int(0)
        }
      }
    }
    [2]=>
    array(1) {
      ["people.name"]=>
      object(MongoRegex)#28 (2) {
        ["regex"]=>
        string(11) ".*Lindsay.*"
        ["flags"]=>
        string(1) "i"
      }
    }
    [3]=>
    array(1) {
      ["people.name"]=>
      object(MongoRegex)#29 (2) {
        ["regex"]=>
        string(8) ".*Erik.*"
        ["flags"]=>
        string(1) "i"
      }
    }
    [4]=>
    array(1) {
      ["$or"]=>
      array(7) {
        [0]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#32 (2) {
            ["regex"]=>
            string(13) ".*waterfall.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [1]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#34 (2) {
            ["regex"]=>
            string(10) ".*geyser.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [2]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#35 (2) {
            ["regex"]=>
            string(9) ".*river.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [3]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#36 (2) {
            ["regex"]=>
            string(9) ".*creek.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [4]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#37 (2) {
            ["regex"]=>
            string(9) ".*brook.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [5]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#38 (2) {
            ["regex"]=>
            string(10) ".*stream.*"
            ["flags"]=>
            string(1) "i"
          }
        }
        [6]=>
        array(1) {
          ["objects"]=>
          object(MongoRegex)#39 (2) {
            ["regex"]=>
            string(10) ".*rapids.*"
            ["flags"]=>
            string(1) "i"
          }
        }
      }
    }
  }
}

And, out comes a photo:
waterfall