Using Vision Framework for Text Detection in iOS 11

Among many of the powerful frameworks Apple released at this year’s WWDC, the Vision framework was one of them. With the Vision framework, you can easily implement computer vision techniques into your apps with no higher knowledge at all! With Vision, you can have your app perform a number of powerful tasks such as identifying faces and facial features (ex: smile, frown, left eyebrow, etc.), barcode detection, classifying scenes in images, object detection and tracking, and horizon detection.

Now for those of you who have been programming in Swift for some time are probably wondering, what is the purpose of Vision when there is Core Image and AVFoundation? If we take a look at the table below presented in WWDC, we can see that Vision is far more accurate and available on more platforms. However it does require more processing time and power.

Difference between AVFoundation and Vision framework
Image credit: Apple’s WWDC video – Vision Framework: Building on Core ML

In this tutorial, we will be leveraging the Vision framework for text detection. We will build an app that will be able to detect text regardless of the font, object, and color. As shown in the picture below, the Vision framework can recognize text that are both printed and hand-written.

Text Recognition Demo App

To save you time from building the UI of the app and focus on learning the Vision framework, download the starter project to begin with.

Please note that you will need Xcode 9 to complete the tutorial. You will also need a device that is running iOS 11 in order to test this tutorial. Also the code is written in Swift 4.

Creating a Live Stream

When you open the project, you see that the views in the storyboard are all ready and set up for you. Heading over to ViewController.swift, you will find the code skeleton with a couple of outlets and funcions. Our first step is to create the live stream that will be used to detect text. Right under the imageView outlet, declare another property for AVCaptureSession:

This initalizes an object of AVCaptureSession that performs a real-time or offline capture. It is used whenever you want to perform some actions based on a live stream. Next, we need to connect the session to our device. Start by adding the following function in ViewController.swift.

If you have worked with AVFoundation before, you will find most of this code familiar. If you haven’t, don’t worry. We’ll go thorough the code line-by-line.

  1. We begin by modifying the settings of our AVCaptureSession. Then, we set the AVMediaType as video because we want a live stream so it should always be continuously running.
  2. Next, we define the device input and output. The input is what the camera is seeing and the output is what the video should appear as. We want the video to appear as a kCVPixelFormatType_32BGRA which is a type of video format. You can learn more about pixel format types here. Lastly, we add the input and output to the AVCaptureSession.
  3. Finally, we add a sublayer containing the video preview to the imageView and get the session running.

Call this function in the viewWillAppear method:

Since the bounds of the image view is not yet finalized in viewWillAppear(), override the viewDidLayoutSubviews() method to update the layer’s bound:

Before you give it a run, add an entry in Info.plist to provide a reason why you need to use the camera. This is required by Apple since the release of iOS 10:


The live stream should work as expected. However, there is no text detection going on because we haven’t implemented the Vision framework yet. This is what we will do next.

Implementing Text Detection

Before we implement the text detection part, we need to understand how the Vision framework works. Basically, there are 3 steps to implement Vision in your app, whic are:

  • Requests – Requests are when you request the framework to detect something for you.
  • Handlers – Handlers are when you want the framework to perform something after the request is made or “handle” the request.
  • Observations – Observations are what you want to do with the data provided with you.

Now to start, let’s begin with a request. Right under the initialization of the variable session, declare another variable as follows:

We have created an array that will contain a generic VNRequest. Next, let’s create the function that will start the text detection in the ViewController class.

In this function, we create a constant textRequest that is a VNDetectTextRectanglesRequest. Basically it is just a specific type of VNRequest that only looks for rectangles with some text in them. When the framework has completed this request, we want it to call the function detectTextHandler. We also want to know exactly what the framework has recognized which is why we set the property reportCharacterBoxes equal to true. Finally, we set the variable requests created earlier to textRequest.

Now, at this point you should get an error. This is because we have not defined the function that is supposed to handle the request. To get rid of the error, create the function like this:

In the code above, we begin by defining a constant observations which will contain all the results of our VNDetectTextRectanglesRequest. Next, we define another constant named result which will go through all the results of the request and transform them into the type of VNTextObservation.

Now update the viewWillAppear() method:

If you run your app now, you won’t see any difference. This is because while we told the VNDetectTextRectanglesRequest to report the character boxes, we never told it how to do so. This is what we’ll accomplish next.

Drawing the Boxes

In our app, we’ll have the framework to draw 2 boxes: one for each letter it detects and the other one for each word. Let’s start by creating the function for each word.

In this function we begin by defining a constant named boxes which is a combination of all the characterBoxes our request has found. Then, we define some points on our view to help us position our boxes. Finally, we create a CALayer with the given constraints defined and apply it to our imageView. Next, let’s create the boxes for each letter.

Similar to the code we wrote earlier, we use the VNRectangleObservation to define our constraints that will make outlining the box easier. Now, we have all our function laid out. The final step is connecting all the dots.

Connecting the Dots

There are 2 main dots to connect. The first thing is the boxes to the “handle” function of our request. Let’s do that first. Update the detectTextHandler method like this:

We begin by having the code run asynchronously. First, we remove the bottommost layer in our imageView (if you noticed, we were adding a lot of layers to our imageView). Next, we check to see if a region exists within the results from our VNTextObservation. Now, we call in our function which draws a box around the region, or as we defined it, the word. Then, we check to see if there are character boxes within the region. If there are, we call in the function which draws a box around each letter.

Now the last step in connecting the dots is to run our Vision code with the live stream. We need to take the video output and convert it into a CMSampleBuffer. In the extension of ViewController.swift insert the following code:

Hang in there! It’s our last part of the code. The extension adopts the AVCaptureVideoDataOutputSampleBufferDelegate protocol. Basically what this function does is that it checks if the CMSampleBuffer exists and is giving an AVCaptureOutput. Next, we create a variable requestOptions which is a dictionary for the type VNImageOption. VNImageOption is a type of structure that can hold the properties and data from the camera. Finally we create a VNImageRequestHandler object and perform the text request that we create earlier.

Build and run the app and see what you get!



Well, that was a big one! Try testing the app on different fonts, sizes, objects, lighting, etc. See if you can expand upon this app. Post how you’ve expanded this project in the comments below. You can also go beyond by combining Vision with Core ML. For more information on Core ML, check out my introductory tutorial on Core ML.

For reference, you can refer to the complete Xcode project on GitHub.

For more details about the Vision framework, you can refer to the official Vision framework documentation. You an also refer to Apple’s sessions on the Vision framework during WWDC 2017:

Vision Framework: Building on Core ML

Advances in Core Image: Filters, Metal, Vision, and More

Mastering Swift: Enumerations, Closures, Generics, Protocols and High Order Functions
How To Fetch and Parse JSON Using iOS SDK
Using Braintree to Accept Credit Card Payment in iOS Apps
  • Jong Hwan Kim

    hi great post : )

    does vision framework provide actual OCR capability? like instead of just detecting a portion of image is a text, can it actually return a ‘text’ string? if this isn’t embedded, I’m wondering whats the use for the framework..

    • Sai Kambampati

      I’m not sure if it can actually return a ‘text’ string, but that’s definitely something I can look into. Coming to the use of the Vision framework, text detection isn’t the only possibility. You can also use it for facial recognition (i.e. detecting key features in a face to create, for example, something like Snapchat’s filters), barcode detection and and a lot more as listed above.

      • Josh Kneedler

        any luck finding a sane solution to getting the ‘text’ string? i wonder why apple hit this wall and didn’t follow through?

        • Sai Kambampati

          Unfortunately, I have not found a way. However here are 2 alternatives.
          1) If you want to recognize the text based off of handwriting, I encourage you to check out Apple’s NLP API’s. Here is the WWDC 2017 video -> https://developer.apple.com/videos/play/wwdc2017/208/
          2) If you want to recognize text from a picture, I would suggest using a Core ML model to do so. You can see some of my Core ML tutorials for a better understanding.
          Unfortunately, that’s all I have. I will update you guys if there are any developments. Thanks!

    • Josh Braun

      Josh BraunJosh Braun

      Author Reply

      AFAIK, it cannot actually return the text detected, just the rectangle/region that was detected. Whatever one might say about Vision and its usefulness, a glaring omission in most of the tutorials demonstrating Vision text detection is that they tend to gloss over this point. At this point in time you’ll need to use an OCR lib or a CoreML compatible ocr model (e.g., a trained Tesseract model can do this).

      • Jong Hwan Kim

        thanks for the reply. should either go with OCR library or deep-train a local text.

  • Tony Merritt

    can we get it to snapshot a photo once a rectangle of text is detected to use in an ocr?

  • Gabriele Quatela

    Is it possible to detect the text and draw the box inside a view that is on the camera?

  • JoshD


    Author Reply

    Can you please explain why CGImagePropertyOrientation needs to be “right” when initialising the VNImageRequestHandler?

  • Hackt


    Author Reply

    Hey Sai,
    I am trying to build this on the iPhone X, but I keep getting this error https://uploads.disquscdn.com/images/f03197516c6f9816820c1ca54aa26af40f1a66188e62f9aef0ba0a92c4ee84d9.png
    It is happening on this line:
    let deviceInput = try! AVCaptureDeviceInput(device: captureDevice!)
    its because captureDevice is null
    I tried changing out this:
    let captureDevice = AVCaptureDevice.default(for: AVMediaType.video)
    to this:
    let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back)
    but had the same result, any ideas?

  • Francesco Maddaloni

    Hi, I added a button for saving the image, but opening another videostream make the image darker, do u know how I can save the image from an external button?

  • shimon rubin

    Hi, Thank you very much for the info!
    In the beginning of this Article you wrote: “(ex: smile, frown, left eyebrow, etc.)”
    Are you sure about the smile thing? Is it really possible to get a “isSmiling” boolean or so from vision?
    If so, do you know how?

  • Manolo Suarez

    Great article, I follow “Appcoda” since the beginning, happy with all their books, and the way you guys teach step-by-step. I have a time calculator already embed in an app, I insert the times to sum or subtract, I was wondering if I can capture the times with the iPhone camera in a screen or paper and make the calculations without need to insert manually.