{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "mjAScbd2vl9P"
},
"source": [
"# OCR model for reading Captchas\n",
"\n",
"**Author:** [A_K_Nain](https://twitter.com/A_K_Nain)
\n",
"**Date created:** 2020/06/14
\n",
"**Last modified:** 2020/06/26
\n",
"**Description:** How to implement an OCR model using CNNs, RNNs and CTC loss."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wWvlZPBJvl9U"
},
"source": [
"## Introduction\n",
"\n",
"This example demonstrates a simple OCR model built with the Functional API. Apart from\n",
"combining CNN and RNN, it also illustrates how you can instantiate a new layer\n",
"and use it as an \"Endpoint layer\" for implementing CTC loss. For a detailed\n",
"guide to layer subclassing, please check out\n",
"[this page](https://keras.io/guides/making_new_layers_and_models_via_subclassing/)\n",
"in the developer guides."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yq0Pe4Zuvl9U"
},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "5q-xCl8Qvl9V"
},
"outputs": [],
"source": [
"import os\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import sys\n",
"\n",
"from pathlib import Path\n",
"from collections import Counter\n",
"\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"from tensorflow.keras import layers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "KIc-3qB0L5OE",
"outputId": "c6fb9b04-386f-4d84-ae46-f85cc4a36647",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 2
}
],
"source": [
"tf.executing_eagerly()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sSm7N--8vl9W"
},
"source": [
"## Load the data: [Captcha Images](https://www.kaggle.com/fournierp/captcha-version-2-images)\n",
"Let's download the data."
]
},
{
"cell_type": "code",
"source": [
"!unzip -qq images_10k.zip"
],
"metadata": {
"id": "GmxnAtyRz-L2"
},
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "g3EVJfHBvl9X",
"outputId": "9869f54b-be6e-4cdf-8a6c-6c2a034495a4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Number of images found: 10780\n",
"Number of labels found: 10780\n",
"Number of unique characters: 21\n",
"Characters present: [' ', '0', '2', '4', '8', 'A', 'D', 'G', 'H', 'J', 'K', 'M', 'N', 'P', 'R', 'S', 'T', 'V', 'W', 'X', 'Y']\n"
]
}
],
"source": [
"substitutions = {\n",
" 'B': '8',\n",
" 'F': 'P',\n",
" 'U': 'V',\n",
" '5': 'S',\n",
" '6': 'G',\n",
" 'Z': '2',\n",
" 'O': '0'\n",
"}\n",
"\n",
"def apply_substitutions(input_string):\n",
" output_string = \"\"\n",
" for char in input_string:\n",
" if char in substitutions:\n",
" output_string += substitutions[char]\n",
" else:\n",
" output_string += char\n",
"\n",
" return output_string\n",
"\n",
"data_dir = Path(\"./images_10k/\")\n",
"\n",
"# Get list of all the images\n",
"images = sorted(list(map(str, list(data_dir.glob(\"*.png\")))))\n",
"labels = [apply_substitutions(img.split(os.path.sep)[-1].split(\".png\")[0]) for img in images]\n",
"\n",
"# Maximum length of any captcha in the dataset\n",
"max_length = max([len(label) for label in labels])\n",
"labels = [x + ' ' * (max_length - len(x)) for x in labels]\n",
"\n",
"characters = set(char for label in labels for char in label)\n",
"characters = sorted(list(characters))\n",
"\n",
"print(\"Number of images found: \", len(images))\n",
"print(\"Number of labels found: \", len(labels))\n",
"print(\"Number of unique characters: \", len(characters))\n",
"print(\"Characters present: \", characters)\n",
"\n",
"# Batch size for training and validation\n",
"batch_size = 16\n",
"\n",
"# Desired image dimensions\n",
"img_width = 300\n",
"img_height = 80\n",
"\n",
"# Factor by which the image is going to be downsampled\n",
"# by the convolutional blocks. We will be using two\n",
"# convolution blocks and each block will have\n",
"# a pooling layer which downsample the features by a factor of 2.\n",
"# Hence total downsampling factor would be 4.\n",
"downsample_factor = 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gqn-NjRovl9Y"
},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "MjQltH0Mvl9Y"
},
"outputs": [],
"source": [
"# Mapping characters to integers\n",
"char_to_num = layers.StringLookup(\n",
" vocabulary=list(characters), mask_token=None,\n",
")\n",
"\n",
"# Mapping integers back to original characters\n",
"num_to_char = layers.StringLookup(\n",
" vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True\n",
")\n",
"\n",
"\n",
"def split_data(images, labels, train_size=0.75, shuffle=True):\n",
" # 1. Get the total size of the dataset\n",
" size = len(images)\n",
" # 2. Make an indices array and shuffle it, if required\n",
" indices = np.arange(size)\n",
" if shuffle:\n",
" np.random.shuffle(indices)\n",
" # 3. Get the size of training samples\n",
" train_samples = int(size * train_size)\n",
" # 4. Split data into training and validation sets\n",
" x_train, y_train = images[indices[:train_samples]], labels[indices[:train_samples]]\n",
" x_valid, y_valid = images[indices[train_samples:]], labels[indices[train_samples:]]\n",
" return x_train, x_valid, y_train, y_valid\n",
"\n",
"\n",
"# Splitting data into training and validation sets\n",
"x_train, x_valid, y_train, y_valid = split_data(np.array(images), np.array(labels))\n",
"\n",
"\n",
"def encode_single_sample(img_path, label):\n",
" # 1. Read image\n",
" img = tf.io.read_file(img_path)\n",
" # 2. Decode and convert to grayscale\n",
" img = tf.io.decode_png(img, channels=1)\n",
" # 3. Convert to float32 in [0, 1] range\n",
" img = tf.image.convert_image_dtype(img, tf.float32)\n",
" # 4. Resize to the desired size\n",
" img = tf.image.resize(img, [img_height, img_width])\n",
" # 5. Transpose the image because we want the time\n",
" # dimension to correspond to the width of the image.\n",
" img = tf.transpose(img, perm=[1, 0, 2])\n",
" # 6. Map the characters in label to numbers\n",
" label = char_to_num(tf.strings.unicode_split(label, input_encoding=\"UTF-8\"))\n",
" # 7. Return a dict as our model is expecting two inputs\n",
" return {\"image\": img, \"label\": label}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fnwhurZ-vl9Z"
},
"source": [
"## Create `Dataset` objects"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "k2MZdcpXvl9Z"
},
"outputs": [],
"source": [
"train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))\n",
"train_dataset = (\n",
" train_dataset.map(\n",
" encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE\n",
" )\n",
" .batch(batch_size)\n",
" .prefetch(buffer_size=tf.data.AUTOTUNE)\n",
")\n",
"\n",
"validation_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))\n",
"validation_dataset = (\n",
" validation_dataset.map(\n",
" encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE\n",
" )\n",
" .batch(batch_size)\n",
" .prefetch(buffer_size=tf.data.AUTOTUNE)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NI0NRV5Ivl9Z"
},
"source": [
"## Visualize the data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "7GT5RSNgvl9Z",
"outputId": "d1ba100d-2b96-448f-f41b-7a77742374cc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 405
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"