You can use Firebase ML to recognize text in images. Firebase ML has
both a general-purpose API suitable for recognizing text in images, such as
the text of a street sign, and an API optimized for recognizing the text of
documents.
Before you begin
If you have not already added Firebase to your app, do so by following the
steps in the
getting started guide
.
Use Swift Package Manager to install and manage Firebase dependencies.
- In Xcode, with your app project open, navigate to
File > Add Packages
.
- When prompted, add the Firebase Apple platforms SDK repository:
https://github.com/firebase/firebase-ios-sdk.git
- Choose the Firebase ML library.
- Add the
-ObjC
flag to the
Other Linker Flags
section of your target's build settings.
-
When finished, Xcode will automatically begin resolving and downloading your
dependencies in the background.
Next, perform some in-app setup:
- In your app, import Firebase:
Swift
import FirebaseMLModelDownloader
Objective-C
@import FirebaseMLModelDownloader;
-
If you have not already enabled Cloud-based APIs for your project, do so
now:
- Open the
Firebase ML
APIs page
of the Firebase console.
-
If you have not already upgraded your project to the Blaze pricing plan, click
Upgrade
to do so. (You will be prompted to upgrade only if your
project isn't on the Blaze plan.)
Only Blaze-level projects can use Cloud-based APIs.
- If Cloud-based APIs aren't already enabled, click
Enable Cloud-based
APIs
.
Now you are ready to start recognizing text in images.
-
For Firebase ML to accurately recognize text, input images must contain
text that is represented by sufficient pixel data. Ideally, for Latin
text, each character should be at least 16x16 pixels. For Chinese,
Japanese, and Korean text, each
character should be 24x24 pixels. For all languages, there is generally no
accuracy benefit for characters to be larger than 24x24 pixels.
So, for example, a 640x480 image might work well to scan a business card
that occupies the full width of the image. To scan a document printed on
letter-sized paper, a 720x1280 pixel image might be required.
-
Poor image focus can hurt text recognition accuracy. If you aren't
getting acceptable results, try asking the user to recapture the image.
Recognize text in images
To recognize text in an image, run the text recognizer as described
below.
1. Run the text recognizer
Pass the image as a
UIImage
or a
CMSampleBufferRef
to the
VisionTextRecognizer
's
process(_:completion:)
method:
- Get an instance of
VisionTextRecognizer
by calling
cloudTextRecognizer
:
Swift
let vision = Vision.vision()
let textRecognizer = vision.cloudTextRecognizer()
// Or, to provide language hints to assist with language detection:
// See https://cloud.google.com/vision/docs/languages for supported languages
let options = VisionCloudTextRecognizerOptions()
options.languageHints = ["en", "hi"]
let textRecognizer = vision.cloudTextRecognizer(options: options)
Objective-C
FIRVision *vision = [FIRVision vision];
FIRVisionTextRecognizer *textRecognizer = [vision cloudTextRecognizer];
// Or, to provide language hints to assist with language detection:
// See https://cloud.google.com/vision/docs/languages for supported languages
FIRVisionCloudTextRecognizerOptions *options =
[[FIRVisionCloudTextRecognizerOptions alloc] init];
options.languageHints = @[@"en", @"hi"];
FIRVisionTextRecognizer *textRecognizer = [vision cloudTextRecognizerWithOptions:options];
-
In order to call Cloud Vision, the image must be formatted as a base64-encoded
string. To process a
UIImage
:
Swift
guard let imageData = uiImage.jpegData(compressionQuality: 1.0) else { return }
let base64encodedImage = imageData.base64EncodedString()
Objective-C
NSData *imageData = UIImageJPEGRepresentation(uiImage, 1.0f);
NSString *base64encodedImage =
[imageData base64EncodedStringWithOptions:NSDataBase64Encoding76CharacterLineLength];
-
Then, pass the image to the
process(_:completion:)
method:
Swift
textRecognizer.process(visionImage) { result, error in
guard error == nil, let result = result else {
// ...
return
}
// Recognized text
}
Objective-C
[textRecognizer processImage:image
completion:^(FIRVisionText *_Nullable result,
NSError *_Nullable error) {
if (error != nil || result == nil) {
// ...
return;
}
// Recognized text
}];
If the text recognition operation succeeds, it will return a
VisionText
object. A
VisionText
object contains the full text
recognized in the image and zero or more
VisionTextBlock
objects.
Each
VisionTextBlock
represents a rectangular block of text, which contain
zero or more
VisionTextLine
objects. Each
VisionTextLine
object contains zero or more
VisionTextElement
objects,
which represent words and word-like entities (dates, numbers, and so on).
For each
VisionTextBlock
,
VisionTextLine
, and
VisionTextElement
object,
you can get the text recognized in the region and the bounding coordinates of
the region.
For example:
Swift
let resultText = result.text
for block in result.blocks {
let blockText = block.text
let blockConfidence = block.confidence
let blockLanguages = block.recognizedLanguages
let blockCornerPoints = block.cornerPoints
let blockFrame = block.frame
for line in block.lines {
let lineText = line.text
let lineConfidence = line.confidence
let lineLanguages = line.recognizedLanguages
let lineCornerPoints = line.cornerPoints
let lineFrame = line.frame
for element in line.elements {
let elementText = element.text
let elementConfidence = element.confidence
let elementLanguages = element.recognizedLanguages
let elementCornerPoints = element.cornerPoints
let elementFrame = element.frame
}
}
}
Objective-C
NSString *resultText = result.text;
for (FIRVisionTextBlock *block in result.blocks) {
NSString *blockText = block.text;
NSNumber *blockConfidence = block.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *blockLanguages = block.recognizedLanguages;
NSArray<NSValue *> *blockCornerPoints = block.cornerPoints;
CGRect blockFrame = block.frame;
for (FIRVisionTextLine *line in block.lines) {
NSString *lineText = line.text;
NSNumber *lineConfidence = line.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *lineLanguages = line.recognizedLanguages;
NSArray<NSValue *> *lineCornerPoints = line.cornerPoints;
CGRect lineFrame = line.frame;
for (FIRVisionTextElement *element in line.elements) {
NSString *elementText = element.text;
NSNumber *elementConfidence = element.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *elementLanguages = element.recognizedLanguages;
NSArray<NSValue *> *elementCornerPoints = element.cornerPoints;
CGRect elementFrame = element.frame;
}
}
}
Next steps
Recognize text in images of documents
To recognize the text of a document, configure and run the
document text recognizer as described below.
The document text recognition API, described below, provides an interface that
is intended to be more convenient for working with images of documents. However,
if you prefer the interface provided by the sparse text API, you can use it
instead to scan documents by configuring the cloud text recognizer to
use the dense text model
.
To use the document text recognition API:
1. Run the text recognizer
Pass the image as a
UIImage
or a
CMSampleBufferRef
to the
VisionDocumentTextRecognizer
's
process(_:completion:)
method:
- Get an instance of
VisionDocumentTextRecognizer
by calling
cloudDocumentTextRecognizer
:
Swift
let vision = Vision.vision()
let textRecognizer = vision.cloudDocumentTextRecognizer()
// Or, to provide language hints to assist with language detection:
// See https://cloud.google.com/vision/docs/languages for supported languages
let options = VisionCloudDocumentTextRecognizerOptions()
options.languageHints = ["en", "hi"]
let textRecognizer = vision.cloudDocumentTextRecognizer(options: options)
Objective-C
FIRVision *vision = [FIRVision vision];
FIRVisionDocumentTextRecognizer *textRecognizer = [vision cloudDocumentTextRecognizer];
// Or, to provide language hints to assist with language detection:
// See https://cloud.google.com/vision/docs/languages for supported languages
FIRVisionCloudDocumentTextRecognizerOptions *options =
[[FIRVisionCloudDocumentTextRecognizerOptions alloc] init];
options.languageHints = @[@"en", @"hi"];
FIRVisionDocumentTextRecognizer *textRecognizer = [vision cloudDocumentTextRecognizerWithOptions:options];
-
In order to call Cloud Vision, the image must be formatted as a base64-encoded
string. To process a
UIImage
:
Swift
guard let imageData = uiImage.jpegData(compressionQuality: 1.0) else { return }
let base64encodedImage = imageData.base64EncodedString()
Objective-C
NSData *imageData = UIImageJPEGRepresentation(uiImage, 1.0f);
NSString *base64encodedImage =
[imageData base64EncodedStringWithOptions:NSDataBase64Encoding76CharacterLineLength];
-
Then, pass the image to the
process(_:completion:)
method:
Swift
textRecognizer.process(visionImage) { result, error in
guard error == nil, let result = result else {
// ...
return
}
// Recognized text
}
Objective-C
[textRecognizer processImage:image
completion:^(FIRVisionDocumentText *_Nullable result,
NSError *_Nullable error) {
if (error != nil || result == nil) {
// ...
return;
}
// Recognized text
}];
If the text recognition operation succeeds, it will return a
VisionDocumentText
object. A
VisionDocumentText
object
contains the full text recognized in the image and a hierarchy of objects that
reflect the structure of the recognized document:
For each
VisionDocumentTextBlock
,
VisionDocumentTextParagraph
,
VisionDocumentTextWord
, and
VisionDocumentTextSymbol
object, you can get the
text recognized in the region and the bounding coordinates of the region.
For example:
Swift
let resultText = result.text
for block in result.blocks {
let blockText = block.text
let blockConfidence = block.confidence
let blockRecognizedLanguages = block.recognizedLanguages
let blockBreak = block.recognizedBreak
let blockCornerPoints = block.cornerPoints
let blockFrame = block.frame
for paragraph in block.paragraphs {
let paragraphText = paragraph.text
let paragraphConfidence = paragraph.confidence
let paragraphRecognizedLanguages = paragraph.recognizedLanguages
let paragraphBreak = paragraph.recognizedBreak
let paragraphCornerPoints = paragraph.cornerPoints
let paragraphFrame = paragraph.frame
for word in paragraph.words {
let wordText = word.text
let wordConfidence = word.confidence
let wordRecognizedLanguages = word.recognizedLanguages
let wordBreak = word.recognizedBreak
let wordCornerPoints = word.cornerPoints
let wordFrame = word.frame
for symbol in word.symbols {
let symbolText = symbol.text
let symbolConfidence = symbol.confidence
let symbolRecognizedLanguages = symbol.recognizedLanguages
let symbolBreak = symbol.recognizedBreak
let symbolCornerPoints = symbol.cornerPoints
let symbolFrame = symbol.frame
}
}
}
}
Objective-C
NSString *resultText = result.text;
for (FIRVisionDocumentTextBlock *block in result.blocks) {
NSString *blockText = block.text;
NSNumber *blockConfidence = block.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *blockRecognizedLanguages = block.recognizedLanguages;
FIRVisionTextRecognizedBreak *blockBreak = block.recognizedBreak;
CGRect blockFrame = block.frame;
for (FIRVisionDocumentTextParagraph *paragraph in block.paragraphs) {
NSString *paragraphText = paragraph.text;
NSNumber *paragraphConfidence = paragraph.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *paragraphRecognizedLanguages = paragraph.recognizedLanguages;
FIRVisionTextRecognizedBreak *paragraphBreak = paragraph.recognizedBreak;
CGRect paragraphFrame = paragraph.frame;
for (FIRVisionDocumentTextWord *word in paragraph.words) {
NSString *wordText = word.text;
NSNumber *wordConfidence = word.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *wordRecognizedLanguages = word.recognizedLanguages;
FIRVisionTextRecognizedBreak *wordBreak = word.recognizedBreak;
CGRect wordFrame = word.frame;
for (FIRVisionDocumentTextSymbol *symbol in word.symbols) {
NSString *symbolText = symbol.text;
NSNumber *symbolConfidence = symbol.confidence;
NSArray<FIRVisionTextRecognizedLanguage *> *symbolRecognizedLanguages = symbol.recognizedLanguages;
FIRVisionTextRecognizedBreak *symbolBreak = symbol.recognizedBreak;
CGRect symbolFrame = symbol.frame;
}
}
}
}
Next steps