Using Vision Framework for Text Detection in iOS

Among many of the powerful frameworks Apple released at this year’s WWDC, the Vision framework was one of them. With the Vision framework, you can easily implement computer vision techniques into your apps with no higher knowledge at all! With Vision, you can have your app perform a number of powerful tasks such as identifying faces and facial features (ex: smile, frown, left eyebrow, etc.), barcode detection, classifying scenes in images, object detection and tracking, and horizon detection.

Now for those of you who have been programming in Swift for some time are probably wondering, what is the purpose of Vision when there is Core Image and AVFoundation? If we take a look at the table below presented in WWDC, we can see that Vision is far more accurate and available on more platforms. However it does require more processing time and power.

Difference between AVFoundation and Vision framework

Image credit: Apple’s WWDC video – Vision Framework: Building on Core ML

In this tutorial, we will be leveraging the Vision framework for text detection. We will build an app that will be able to detect text regardless of the font, object, and color. As shown in the picture below, the Vision framework can recognize text that are both printed and hand-written.

Text Recognition Demo App

To save you time from building the UI of the app and focus on learning the Vision framework, download the starter project to begin with.

Please note that you will need Xcode 9 to complete the tutorial. You will also need a device that is running iOS 11 in order to test this tutorial. Also the code is written in Swift 4.

Creating a Live Stream

When you open the project, you see that the views in the storyboard are all ready and set up for you. Heading over to ViewController.swift, you will find the code skeleton with a couple of outlets and funcions. Our first step is to create the live stream that will be used to detect text. Right under the imageView outlet, declare another property for AVCaptureSession:

var session = AVCaptureSession()

This initalizes an object of AVCaptureSession that performs a real-time or offline capture. It is used whenever you want to perform some actions based on a live stream. Next, we need to connect the session to our device. Start by adding the following function in ViewController.swift.

func startLiveVideo() {
    //1
    session.sessionPreset = AVCaptureSession.Preset.photo
    let captureDevice = AVCaptureDevice.default(for: AVMediaType.video)
    
    //2
    let deviceInput = try! AVCaptureDeviceInput(device: captureDevice!)
    let deviceOutput = AVCaptureVideoDataOutput()
    deviceOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_32BGRA)]
    deviceOutput.setSampleBufferDelegate(self, queue: DispatchQueue.global(qos: DispatchQoS.QoSClass.default))
    session.addInput(deviceInput)
    session.addOutput(deviceOutput)
       
    //3
    let imageLayer = AVCaptureVideoPreviewLayer(session: session)
    imageLayer.frame = imageView.bounds
    imageView.layer.addSublayer(imageLayer)
        
    session.startRunning()
}

If you have worked with AVFoundation before, you will find most of this code familiar. If you haven’t, don’t worry. We’ll go thorough the code line-by-line.

We begin by modifying the settings of our AVCaptureSession. Then, we set the AVMediaType as video because we want a live stream so it should always be continuously running.
Next, we define the device input and output. The input is what the camera is seeing and the output is what the video should appear as. We want the video to appear as a kCVPixelFormatType_32BGRA which is a type of video format. You can learn more about pixel format types here. Lastly, we add the input and output to the AVCaptureSession.
Finally, we add a sublayer containing the video preview to the imageView and get the session running.

Call this function in the viewWillAppear method:

override func viewWillAppear(_ animated: Bool) {
    startLiveVideo()
}

Since the bounds of the image view is not yet finalized in viewWillAppear(), override the viewDidLayoutSubviews() method to update the layer’s bound:

override func viewDidLayoutSubviews() {
    imageView.layer.sublayers?[0].frame = imageView.bounds
}

Before you give it a run, add an entry in Info.plist to provide a reason why you need to use the camera. This is required by Apple since the release of iOS 10:

text-detection-infoplist

The live stream should work as expected. However, there is no text detection going on because we haven’t implemented the Vision framework yet. This is what we will do next.

Implementing Text Detection

Before we implement the text detection part, we need to understand how the Vision framework works. Basically, there are 3 steps to implement Vision in your app, whic are:

Requests – Requests are when you request the framework to detect something for you.
Handlers – Handlers are when you want the framework to perform something after the request is made or “handle” the request.
Observations – Observations are what you want to do with the data provided with you.

request-observation

Now to start, let’s begin with a request. Right under the initialization of the variable session, declare another variable as follows:

var requests = [VNRequest]()

We have created an array that will contain a generic VNRequest. Next, let’s create the function that will start the text detection in the ViewController class.

func startTextDetection() {
    let textRequest = VNDetectTextRectanglesRequest(completionHandler: self.detectTextHandler)
    textRequest.reportCharacterBoxes = true
    self.requests = [textRequest]
}

In this function, we create a constant textRequest that is a VNDetectTextRectanglesRequest. Basically it is just a specific type of VNRequest that only looks for rectangles with some text in them. When the framework has completed this request, we want it to call the function detectTextHandler. We also want to know exactly what the framework has recognized which is why we set the property reportCharacterBoxes equal to true. Finally, we set the variable requests created earlier to textRequest.

Now, at this point you should get an error. This is because we have not defined the function that is supposed to handle the request. To get rid of the error, create the function like this:

func detectTextHandler(request: VNRequest, error: Error?) {
    guard let observations = request.results else {
        print("no result")
        return
    }
        
    let result = observations.map({$0 as? VNTextObservation})
}

In the code above, we begin by defining a constant observations which will contain all the results of our VNDetectTextRectanglesRequest. Next, we define another constant named result which will go through all the results of the request and transform them into the type of VNTextObservation.

Now update the viewWillAppear() method:

override func viewWillAppear(_ animated: Bool) {
    startLiveVideo()
    startTextDetection()
}

If you run your app now, you won’t see any difference. This is because while we told the VNDetectTextRectanglesRequest to report the character boxes, we never told it how to do so. This is what we’ll accomplish next.

Drawing the Boxes

In our app, we’ll have the framework to draw 2 boxes: one for each letter it detects and the other one for each word. Let’s start by creating the function for each word.

func highlightWord(box: VNTextObservation) {
    guard let boxes = box.characterBoxes else {
        return
    }
        
    var maxX: CGFloat = 9999.0
    var minX: CGFloat = 0.0
    var maxY: CGFloat = 9999.0
    var minY: CGFloat = 0.0
        
    for char in boxes {
        if char.bottomLeft.x < maxX {
            maxX = char.bottomLeft.x
        }
        if char.bottomRight.x > minX {
            minX = char.bottomRight.x
        }
        if char.bottomRight.y < maxY {
            maxY = char.bottomRight.y
        }
        if char.topRight.y > minY {
            minY = char.topRight.y
        }
    }
        
    let xCord = maxX * imageView.frame.size.width
    let yCord = (1 - minY) * imageView.frame.size.height
    let width = (minX - maxX) * imageView.frame.size.width
    let height = (minY - maxY) * imageView.frame.size.height
        
    let outline = CALayer()
    outline.frame = CGRect(x: xCord, y: yCord, width: width, height: height)
    outline.borderWidth = 2.0
    outline.borderColor = UIColor.red.cgColor
        
    imageView.layer.addSublayer(outline)
}

In this function we begin by defining a constant named boxes which is a combination of all the characterBoxes our request has found. Then, we define some points on our view to help us position our boxes. Finally, we create a CALayer with the given constraints defined and apply it to our imageView. Next, let’s create the boxes for each letter.

func highlightLetters(box: VNRectangleObservation) {
    let xCord = box.topLeft.x * imageView.frame.size.width
    let yCord = (1 - box.topLeft.y) * imageView.frame.size.height
    let width = (box.topRight.x - box.bottomLeft.x) * imageView.frame.size.width
    let height = (box.topLeft.y - box.bottomLeft.y) * imageView.frame.size.height
        
    let outline = CALayer()
    outline.frame = CGRect(x: xCord, y: yCord, width: width, height: height)
    outline.borderWidth = 1.0
    outline.borderColor = UIColor.blue.cgColor
    
    imageView.layer.addSublayer(outline)
}

Similar to the code we wrote earlier, we use the VNRectangleObservation to define our constraints that will make outlining the box easier. Now, we have all our function laid out. The final step is connecting all the dots.

Connecting the Dots

There are 2 main dots to connect. The first thing is the boxes to the “handle” function of our request. Let’s do that first. Update the detectTextHandler method like this:

func detectTextHandler(request: VNRequest, error: Error?) {
    guard let observations = request.results else {
        print("no result")
        return
    }
    
    let result = observations.map({$0 as? VNTextObservation})
    
    DispatchQueue.main.async() {
        self.imageView.layer.sublayers?.removeSubrange(1...)
        for region in result {
            guard let rg = region else {
                continue
            }
            
            self.highlightWord(box: rg)
            
            if let boxes = region?.characterBoxes {
                for characterBox in boxes {
                    self.highlightLetters(box: characterBox)
                }
            }
        }
    }
}

We begin by having the code run asynchronously. First, we remove the bottommost layer in our imageView (if you noticed, we were adding a lot of layers to our imageView). Next, we check to see if a region exists within the results from our VNTextObservation. Now, we call in our function which draws a box around the region, or as we defined it, the word. Then, we check to see if there are character boxes within the region. If there are, we call in the function which draws a box around each letter.

Now the last step in connecting the dots is to run our Vision code with the live stream. We need to take the video output and convert it into a CMSampleBuffer. In the extension of ViewController.swift insert the following code:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        return
    }
        
    var requestOptions:[VNImageOption : Any] = [:]
        
    if let camData = CMGetAttachment(sampleBuffer, kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, nil) {
        requestOptions = [.cameraIntrinsics:camData]
    }
        
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: 6, options: requestOptions)
        
    do {
        try imageRequestHandler.perform(self.requests)
    } catch {
        print(error)
    }
}

Hang in there! It’s our last part of the code. The extension adopts the AVCaptureVideoDataOutputSampleBufferDelegate protocol. Basically what this function does is that it checks if the CMSampleBuffer exists and is giving an AVCaptureOutput. Next, we create a variable requestOptions which is a dictionary for the type VNImageOption. VNImageOption is a type of structure that can hold the properties and data from the camera. Finally we create a VNImageRequestHandler object and perform the text request that we create earlier.

Build and run the app and see what you get!

text-detection-example

Conclusion

Well, that was a big one! Try testing the app on different fonts, sizes, objects, lighting, etc. See if you can expand upon this app. Post how you’ve expanded this project in the comments below. You can also go beyond by combining Vision with Core ML. For more information on Core ML, check out my introductory tutorial on Core ML.

For reference, you can refer to the complete Xcode project on GitHub.

For more details about the Vision framework, you can refer to the official Vision framework documentation. You an also refer to Apple’s sessions on the Vision framework during WWDC 2017:

Vision Framework: Building on Core ML

Advances in Core Image: Filters, Metal, Vision, and More