Moscow, the city of mystery... (This photo is provided by Igor Shabalin)

1. Description

These days, everyone with a smartphone has the potential to be a photographer. As a result, tons of new photos appear on social media, websites, blogs and personal photo libraries every day. While the process of taking pictures can be very exciting, sorting them out and manually describing each one afterwards can be tedious and time-consuming.

This article discusses how computer vision (CV) and natural language processing (NLP) techniques can be combined to obtain a set of descriptive labels for photos, and then generate meaningful descriptions based on these labels, saving valuable time.

2. What's in the photo?

We humans can answer that question in a split second, once the photo is in our hands. Machines can also answer this question, as long as they are familiar with CV and NLP. Please see the photos below:

How does your application know what's in the picture above? With a tool like Clarifai's Predict API , this can be a breeze. Here's a set of descriptive tags that the API gives you after processing the photo above:

‘straw’, ‘hay’, ‘pasture’, ‘wheat’, ‘cereal’, ‘rural’, ‘bale’, …

As you can see, these tags give you appropriate information about what can be seen in the image. If all you need is to automatically categorize your visual content, having these tags should be enough to get your job done. However, for the task of image description generation, you need to go a step further and leverage some NLP techniques.

In this article, you'll see a simplified example of how this can be done, showing you how to weave some of the words in the resulting tag list into simple phrases. For a conceptual discussion on this topic, you might also want to check out my article on the Clarifai blog: Generating Image Descriptions Using Natural Language Processing .

3. Prepare

To follow the scripts discussed in this article, the following software components are required:

Python 2.7+∕3.4+

spaCy v2.0+

A pretrained English model for spaCy

Clarifai API Python client

Clarifai API key

You can find installation instructions on the respective sites. Among other things, you'll need a Python library that allows you to fetch and parse data from Wikipedia.

4. Automatically tag photos

First, let's take a look at the code you can use to automatically tag photos. In the implementation below, we use Clarifai's generic image recognition model to obtain descriptive labels for submitted photos.

from clarifai.rest import ClarifaiApp, client, Image
def what_is_photo(photofilename):
  app = ClarifaiApp(api_key='Your Clarifai API key here')
  model = app.models.get("general-v1.3")
  image = Image(file_obj=open(photofilename, 'rb'))
  result = model.predict([image])
  tags = ''
  items = result['outputs'][0]['data']['concepts']
  for item in items:
    if item['name'] == 'no person':
      continue
    result = "{}, ".format(item['name'])
    tags = tags +result
  return tags

To test the above function, you can append the following main block to your script:

if __name__ == "__main__":
  tag_list = list(what_is_photo("country.jpg").split(", "))
  print(tag_list[:7])

In this particular example, we pick the top seven descriptive tags generated for submitted photos. So for the photo provided in the photo what's inside the photo? In the previous section, this script produced the following list of descriptive tags:

['straw', 'hay', 'pasture', 'wheat', 'cereal', 'rural', 'bale']

This is sufficient for classification purposes and can be used as source data for NLP to generate meaningful descriptions, as described in the next section.

5. Use NLP to convert descriptive labels into descriptions

They tell us in school that in order to master a language, you need to read a lot. In other words, you have to train on the best examples of the language. Going back to our discussion, we need some text that uses the words from the tag list. Of course, you can get a huge corpus, such as the Wikipedia database dump, that contains tons of different articles. However, in the age of AI-driven search, you can only narrow your corpus down to the text most relevant to the words in the list of tags you have. The following code shows how to get the content of a single entry from Wikipedia, which contains information related to a list of tags (you'll need to append this to the code in the main function from the previous section):

import wikipedia
tags = ""
tags = tags.join(tag_list[:7])
wiki_resp = wikipedia.page(tags)
print("Article url: ", wiki_resp.url)

Now that you have some text data to process, it's time to put NLP to work. Here are the initial steps to initialize spaCy's text processing pipeline and then apply it to text (appending it to the previous code snippet).

nlp = spacy.load('en')
doc = nlp(wiki_resp.content)
print(len(list(doc.sents)))

In the code below, you iterate through the sentences in the submission, analyzing the grammatical dependencies in each sentence. In particular, you look for phrases that contain words from the submitted token list. In a phrase, the two words in the list should be grammatically related to the head/child relationship. If you're confused about the terminology used here, I recommend checking out Natural Language Processing with Python , which explains NLP concepts in detail and includes many easy-to-follow examples. You can start reading right away: Chapters 2 and 12 are free. Also, an example of how syntactic dependency analysis might be used in practice can be found in my recent article for Oracle Magazine on Generating Intents and Entities for Oracle Digital Assistant Skills .

Going back to the code below, note that this is a simplification - of course real world code will be a bit more complex. (append the code below to the previous code in the main script)

x = []
for sent in doc.sents:
  if bool([t for t in sent if t.lemma_ in tag_list[:7] and t.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_]):
     t = [t for t in sent if t.lemma_ in tag_list[:7] and t.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_][0]
     y = [(t.i, t), (t.head.i, t.head)]
     y.sort(key=lambda tup: tup[0])
     x.append((y[0][1].text + ' ' + y[1][1].text, 2))
  if bool([t for t in sent if t.lemma_ in tag_list[:7] and t.head.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_ and t.head.head.lemma_ != t.head.lemma_]):
    t = [t for t in sent if t.lemma_ in tag_list[:7] and t.head.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_ and t.head.head.lemma_ != t.head.lemma_][0]
    if t.i > t.head.i > t.head.head.i:
      y = [(t.head.head.i, t.head.head), (t.head.i, t.head), (t.i, t)]
      x.append((y[0][1].text + ' ' + y[1][1].text + ' ' + y[2][1].text, 3))
x.sort(key=lambda tup: tup[1], reverse= True)
if len(x) != 0:
  print(x[0][0])

This code gives me the following phrase for the photo provided in the photo? Earlier part of this article:

Hay in bales

This looks like a relevant description for the photo.

Computer Vision and NLP for Intelligent Image Processing

1. Description

2. What's in the photo?

3. Prepare

4. Automatically tag photos

5. Use NLP to convert descriptive labels into descriptions

Guess you like