Among many of the powerful frameworks Apple released at this year’s WWDC, the Vision framework was one of them. With the Vision framework, you can easily implement computer vision techniques into your apps with no higher knowledge at all! With Vision, you can have your app perform a number of powerful tasks such as identifying faces and facial features (ex: smile, frown, left eyebrow, etc.), barcode detection, classifying scenes in images, object detection and tracking, and horizon detection.
Now for those of you who have been programming in Swift for some time are probably wondering, what is the purpose of Vision when there is Core Image and AVFoundation? If we take a look at the table below presented in WWDC, we can see that Vision is far more accurate and available on more platforms. However it does require more processing time and power.

In this tutorial, we will be leveraging the Vision framework for text detection. We will build an app that will be able to detect text regardless of the font, object, and color. As shown in the picture below, the Vision framework can recognize text that are both printed and hand-written.

To save you time from building the UI of the app and focus on learning the Vision framework, download the starter project to begin with.
Please note that you will need Xcode 9 to complete the tutorial. You will also need a device that is running iOS 11 in order to test this tutorial. Also the code is written in Swift 4.
Creating a Live Stream
When you open the project, you see that the views in the storyboard are all ready and set up for you. Heading over to ViewController.swift
, you will find the code skeleton with a couple of outlets and funcions. Our first step is to create the live stream that will be used to detect text. Right under the imageView
outlet, declare another property for AVCaptureSession
:
1 |
var session = AVCaptureSession() |
This initalizes an object of AVCaptureSession
that performs a real-time or offline capture. It is used whenever you want to perform some actions based on a live stream. Next, we need to connect the session to our device. Start by adding the following function in ViewController.swift
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
func startLiveVideo() { //1 session.sessionPreset = AVCaptureSession.Preset.photo let captureDevice = AVCaptureDevice.default(for: AVMediaType.video) //2 let deviceInput = try! AVCaptureDeviceInput(device: captureDevice!) let deviceOutput = AVCaptureVideoDataOutput() deviceOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_32BGRA)] deviceOutput.setSampleBufferDelegate(self, queue: DispatchQueue.global(qos: DispatchQoS.QoSClass.default)) session.addInput(deviceInput) session.addOutput(deviceOutput) //3 let imageLayer = AVCaptureVideoPreviewLayer(session: session) imageLayer.frame = imageView.bounds imageView.layer.addSublayer(imageLayer) session.startRunning() } |
If you have worked with AVFoundation
before, you will find most of this code familiar. If you haven’t, don’t worry. We’ll go thorough the code line-by-line.
- We begin by modifying the settings of our
AVCaptureSession
. Then, we set theAVMediaType
as video because we want a live stream so it should always be continuously running. - Next, we define the device input and output. The input is what the camera is seeing and the output is what the video should appear as. We want the video to appear as a
kCVPixelFormatType_32BGRA
which is a type of video format. You can learn more about pixel format types here. Lastly, we add the input and output to theAVCaptureSession
. - Finally, we add a sublayer containing the video preview to the
imageView
and get the session running.
Call this function in the viewWillAppear
method:
1 2 3 |
override func viewWillAppear(_ animated: Bool) { startLiveVideo() } |
Since the bounds of the image view is not yet finalized in viewWillAppear()
, override the viewDidLayoutSubviews()
method to update the layer’s bound:
1 2 3 |
override func viewDidLayoutSubviews() { imageView.layer.sublayers?[0].frame = imageView.bounds } |
Before you give it a run, add an entry in Info.plist to provide a reason why you need to use the camera. This is required by Apple since the release of iOS 10:

The live stream should work as expected. However, there is no text detection going on because we haven’t implemented the Vision framework yet. This is what we will do next.
Implementing Text Detection
Before we implement the text detection part, we need to understand how the Vision framework works. Basically, there are 3 steps to implement Vision in your app, whic are:
- Requests – Requests are when you request the framework to detect something for you.
- Handlers – Handlers are when you want the framework to perform something after the request is made or “handle” the request.
- Observations – Observations are what you want to do with the data provided with you.

Now to start, let’s begin with a request. Right under the initialization of the variable session
, declare another variable as follows:
1 |
var requests = [VNRequest]() |
We have created an array that will contain a generic VNRequest
. Next, let’s create the function that will start the text detection in the ViewController
class.
1 2 3 4 5 |
func startTextDetection() { let textRequest = VNDetectTextRectanglesRequest(completionHandler: self.detectTextHandler) textRequest.reportCharacterBoxes = true self.requests = [textRequest] } |
In this function, we create a constant textRequest
that is a VNDetectTextRectanglesRequest
. Basically it is just a specific type of VNRequest
that only looks for rectangles with some text in them. When the framework has completed this request, we want it to call the function detectTextHandler
. We also want to know exactly what the framework has recognized which is why we set the property reportCharacterBoxes
equal to true
. Finally, we set the variable requests
created earlier to textRequest
.
Now, at this point you should get an error. This is because we have not defined the function that is supposed to handle the request. To get rid of the error, create the function like this:
1 2 3 4 5 6 7 8 |
func detectTextHandler(request: VNRequest, error: Error?) { guard let observations = request.results else { print("no result") return } let result = observations.map({$0 as? VNTextObservation}) } |
In the code above, we begin by defining a constant observations
which will contain all the results of our VNDetectTextRectanglesRequest
. Next, we define another constant named result
which will go through all the results of the request and transform them into the type of VNTextObservation
.
Now update the viewWillAppear()
method:
1 2 3 4 |
override func viewWillAppear(_ animated: Bool) { startLiveVideo() startTextDetection() } |
If you run your app now, you won’t see any difference. This is because while we told the VNDetectTextRectanglesRequest
to report the character boxes, we never told it how to do so. This is what we’ll accomplish next.
Drawing the Boxes
In our app, we’ll have the framework to draw 2 boxes: one for each letter it detects and the other one for each word. Let’s start by creating the function for each word.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
func highlightWord(box: VNTextObservation) { guard let boxes = box.characterBoxes else { return } var maxX: CGFloat = 9999.0 var minX: CGFloat = 0.0 var maxY: CGFloat = 9999.0 var minY: CGFloat = 0.0 for char in boxes { if char.bottomLeft.x < maxX { maxX = char.bottomLeft.x } if char.bottomRight.x > minX { minX = char.bottomRight.x } if char.bottomRight.y < maxY { maxY = char.bottomRight.y } if char.topRight.y > minY { minY = char.topRight.y } } let xCord = maxX * imageView.frame.size.width let yCord = (1 - minY) * imageView.frame.size.height let width = (minX - maxX) * imageView.frame.size.width let height = (minY - maxY) * imageView.frame.size.height let outline = CALayer() outline.frame = CGRect(x: xCord, y: yCord, width: width, height: height) outline.borderWidth = 2.0 outline.borderColor = UIColor.red.cgColor imageView.layer.addSublayer(outline) } |
In this function we begin by defining a constant named boxes
which is a combination of all the characterBoxes
our request has found. Then, we define some points on our view to help us position our boxes. Finally, we create a CALayer
with the given constraints defined and apply it to our imageView
. Next, let’s create the boxes for each letter.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
func highlightLetters(box: VNRectangleObservation) { let xCord = box.topLeft.x * imageView.frame.size.width let yCord = (1 - box.topLeft.y) * imageView.frame.size.height let width = (box.topRight.x - box.bottomLeft.x) * imageView.frame.size.width let height = (box.topLeft.y - box.bottomLeft.y) * imageView.frame.size.height let outline = CALayer() outline.frame = CGRect(x: xCord, y: yCord, width: width, height: height) outline.borderWidth = 1.0 outline.borderColor = UIColor.blue.cgColor imageView.layer.addSublayer(outline) } |
Similar to the code we wrote earlier, we use the VNRectangleObservation
to define our constraints that will make outlining the box easier. Now, we have all our function laid out. The final step is connecting all the dots.
Connecting the Dots
There are 2 main dots to connect. The first thing is the boxes to the “handle” function of our request. Let’s do that first. Update the detectTextHandler
method like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
func detectTextHandler(request: VNRequest, error: Error?) { guard let observations = request.results else { print("no result") return } let result = observations.map({$0 as? VNTextObservation}) DispatchQueue.main.async() { self.imageView.layer.sublayers?.removeSubrange(1...) for region in result { guard let rg = region else { continue } self.highlightWord(box: rg) if let boxes = region?.characterBoxes { for characterBox in boxes { self.highlightLetters(box: characterBox) } } } } } |
We begin by having the code run asynchronously. First, we remove the bottommost layer in our imageView
(if you noticed, we were adding a lot of layers to our imageView
). Next, we check to see if a region exists within the results from our VNTextObservation
. Now, we call in our function which draws a box around the region, or as we defined it, the word. Then, we check to see if there are character boxes within the region. If there are, we call in the function which draws a box around each letter.
Now the last step in connecting the dots is to run our Vision code with the live stream. We need to take the video output and convert it into a CMSampleBuffer
. In the extension of ViewController.swift
insert the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) { guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return } var requestOptions:[VNImageOption : Any] = [:] if let camData = CMGetAttachment(sampleBuffer, kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, nil) { requestOptions = [.cameraIntrinsics:camData] } let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: 6, options: requestOptions) do { try imageRequestHandler.perform(self.requests) } catch { print(error) } } |
Hang in there! It’s our last part of the code. The extension adopts the AVCaptureVideoDataOutputSampleBufferDelegate
protocol. Basically what this function does is that it checks if the CMSampleBuffer
exists and is giving an AVCaptureOutput
. Next, we create a variable requestOptions
which is a dictionary for the type VNImageOption
. VNImageOption
is a type of structure that can hold the properties and data from the camera. Finally we create a VNImageRequestHandler
object and perform the text request that we create earlier.
Build and run the app and see what you get!

Conclusion
Well, that was a big one! Try testing the app on different fonts, sizes, objects, lighting, etc. See if you can expand upon this app. Post how you’ve expanded this project in the comments below. You can also go beyond by combining Vision with Core ML. For more information on Core ML, check out my introductory tutorial on Core ML.
For reference, you can refer to the complete Xcode project on GitHub.
For more details about the Vision framework, you can refer to the official Vision framework documentation. You an also refer to Apple’s sessions on the Vision framework during WWDC 2017:
Vision Framework: Building on Core ML
Advances in Core Image: Filters, Metal, Vision, and More
Comments
Jong Hwan Kim
Authorhi great post : )
does vision framework provide actual OCR capability? like instead of just detecting a portion of image is a text, can it actually return a ‘text’ string? if this isn’t embedded, I’m wondering whats the use for the framework..
Sai Kambampati
AuthorI’m not sure if it can actually return a ‘text’ string, but that’s definitely something I can look into. Coming to the use of the Vision framework, text detection isn’t the only possibility. You can also use it for facial recognition (i.e. detecting key features in a face to create, for example, something like Snapchat’s filters), barcode detection and and a lot more as listed above.
Josh Kneedler
Authorany luck finding a sane solution to getting the ‘text’ string? i wonder why apple hit this wall and didn’t follow through?
Sai Kambampati
AuthorUnfortunately, I have not found a way. However here are 2 alternatives.
1) If you want to recognize the text based off of handwriting, I encourage you to check out Apple’s NLP API’s. Here is the WWDC 2017 video -> https://developer.apple.com/videos/play/wwdc2017/208/
2) If you want to recognize text from a picture, I would suggest using a Core ML model to do so. You can see some of my Core ML tutorials for a better understanding.
Unfortunately, that’s all I have. I will update you guys if there are any developments. Thanks!
Josh Braun
AuthorAFAIK, it cannot actually return the text detected, just the rectangle/region that was detected. Whatever one might say about Vision and its usefulness, a glaring omission in most of the tutorials demonstrating Vision text detection is that they tend to gloss over this point. At this point in time you’ll need to use an OCR lib or a CoreML compatible ocr model (e.g., a trained Tesseract model can do this).
Jong Hwan Kim
Authorthanks for the reply. should either go with OCR library or deep-train a local text.
Tony Merritt
Authorcan we get it to snapshot a photo once a rectangle of text is detected to use in an ocr?
Wei Keat Ho
AuthorTry tesseract
Gabriele Quatela
AuthorIs it possible to detect the text and draw the box inside a view that is on the camera?
JoshD
AuthorCan you please explain why CGImagePropertyOrientation needs to be “right” when initialising the VNImageRequestHandler?
Hackt
AuthorHey Sai,
I am trying to build this on the iPhone X, but I keep getting this error https://uploads.disquscdn.com/images/f03197516c6f9816820c1ca54aa26af40f1a66188e62f9aef0ba0a92c4ee84d9.png
It is happening on this line:
let deviceInput = try! AVCaptureDeviceInput(device: captureDevice!)
its because captureDevice is null
I tried changing out this:
let captureDevice = AVCaptureDevice.default(for: AVMediaType.video)
to this:
let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back)
but had the same result, any ideas?
Hackt
AuthorI have also tried to change the format inside the line deviceOutput.videoSettings but without any luck also
Sai Kambampati
AuthorThis should help you out! https://stackoverflow.com/questions/40318481/avcapturevideodataoutput-setvideosettings-unsupported-pixel-format-type-in-ios1/42240045#42240045
The code is in Obj-C but you can use an online converter to make sense of it.
Hackt
AuthorNow I am getting this error https://uploads.disquscdn.com/images/74c00e5b7cfe007c4d886b863dd0321d743fff2e45bf7e22827419e68b722ceb.png
I had the other error when I was trying different devices, but I went back to the iPhone 8 Plus like you get up the project with and not captureDevice does not get defined so then when it unwraps it, it is nil
Francesco Maddaloni
AuthorHi, I added a button for saving the image, but opening another videostream make the image darker, do u know how I can save the image from an external button?
shimon rubin
AuthorHi, Thank you very much for the info!
In the beginning of this Article you wrote: “(ex: smile, frown, left eyebrow, etc.)”
Are you sure about the smile thing? Is it really possible to get a “isSmiling” boolean or so from vision?
If so, do you know how?
Manolo Suarez
AuthorGreat article, I follow “Appcoda” since the beginning, happy with all their books, and the way you guys teach step-by-step. I have a time calculator already embed in an app, I insert the times to sum or subtract, I was wondering if I can capture the times with the iPhone camera in a screen or paper and make the calculations without need to insert manually.