Image To Speech Converter

2021.07.07

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Introduction

    Visual impairment is one of the biggest limitation for humanity, especially in this day and age when information is communicated a lot using text messages (electronic and paper based) rather than voice. Optical character recognition has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. Character recognition or optical character recognition (OCR), is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), into a computer format text. Speech synthesis is the artificial synthesis of human speech. A Google Text-To speech (GTTS) synthesizer is a computer-based system that should be able to read any text aloud, whether it was directly introduced in the computer by an operator or scanned and submitted to an Optical Character Recognition (OCR) system.

    Operational stages of the system consist of image capture, image preprocessing, image filtering, character recognition and text to speech conversion. The software platforms used is python with set of libraries. This technology aims to help the disabled people. The basic framework is a system that captures an image, extracts only the region of interest (i.e. region of the image that contains only text) and converts that text to speech.

Pre-Requisites

Programming language: - Python 3, Atom IDE. Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasises readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms and can be freely distributed.

Libraries Used:-

  1. TKINTER
  2. PYTESERACT
  3. GTTS
  4. PIL
TKINTER:

Tkinter is the standard GUI library for Python. Python when combined with Tkinter provides a fast and easy way to create GUI applications. Tkinter provides a powerful object-oriented interface to the Tk GUI toolkit. Here we had used Buttons and Labels For user interaction.

PYTESSERACT:

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images. Python-tesseract is a wrapper for Google’s Tesseract-OCR. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff, and others, whereas tesseract-ocr by default only supports tiff and bmp. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

GTTS:

gTTS (Google Text to Speech): A Python interface for Google's Text to Speech API. Create an mp3 file with the gTTS module or gtts-cli command line utility. It allows unlimited lengths to be spoken by tokenizing long sentences where the speech would naturally pause.

PIL:

The Python Imaging Library (PIL) adds image processing capabilities to your Python interpreter. This library supports many file formats and provides powerful image processing and graphics capabilities.

Proposed System

Implementation

import PIL.Image
import pytesseract
from gtts import gTTS
import os
from tkinter import *
import cv2
CODE TO CAPTURE THE IMAGE FROM THE WEB CAMERA.
def cam():
camera_port = 0
#Number of frames to throw away while the camera adjusts to light levels
ramp_frames = 30
# Now we can initialize the camera capture object with the cv2.VideoCaptureclass.
# All it needs is the index to a camera port.
camera = cv2.VideoCapture(camera_port)
# Captures a single image from the camera and returns it in PIL format
def get_image():
# read is the easiest way to get a full image out of a VideoCapture object.
retval, im = camera.read()
return im
# Ramp the camera - these frames will be discarded and are only used to allow
v4l2
# to adjust light levels, if necessary
for i in range(ramp_frames):
     temp = get_image()
     print("Taking image...")
     # Take the actual image we want to keep
     camera_capture = get_image()
     file = "imgconvert.jpg"
# A nice feature of the imwrite method is that it will automatically choose the
# correct format based on the file extension you provide. Convenient!
cv2.imwrite(file, camera_capture)
# You'll want to release the camera, otherwise you won't be able to create a new
# capture object until your script exits
del(camera)
EXTRACT THE TEXT FROM THE IMAGE AND SAVE THE TEXT IN TO A TEXT DOCUMENT
def convert():
    img = PIL.Image.open('image1.jpg')
    img = img.convert('L')
    img.save('image.jpg')
    text = pytesseract.image_to_string(PIL.Image.open('image.jpg'))
    print(text)
    f=open("output.txt","w")
    f.write(text)
TEXT DOCUMENT IS GIVEN AS INPUT AND CONVERTS FROM TEXT TO SPEECH USING GTTS LIBRARY.
def speech():
    f=open("output.txt","r")
    tts=f.read()
    tt= gTTS(tts,lang='en')
    tt.save("good.mp3")
    os.system(" good.mp3")
root.mainloop()

Output:

Run the script in Windows PowerShell.

GUI of the application after the script is being executed.

Click on the Capture button to click the picture.

Output that displays once the button "capture" is being pressed.

The Image that has to be captured.

Now click on the convert button that identifies the text in the given image and stores it into the file.

The text in the image is recognized and saved in the Text document.

Now click on the speech button to listen.

The output of the text document is converted to voice.

Benefits:

  • People with learning
  • People who have literacy
  • People who speak the language but do not read
  • People who multitask.
  • People with visual
  • People who access content on mobile
  • People with different learning styles.

Conclusion

An approach for image to speech conversion using optical character recognition and text to speech technology. The application developed is user friendly, cost effective and applicable in the real time. By this approach we can read text from a document, Web page or e-Book and can generate synthesized speech through a computer's speakers or phone’s speaker. People with poor vision or visual dyslexia or totally blindness can use this approach for reading the documents and books. People with speech loss or totally dumb person can utilize this approach to turn typed words into vocalization.