SlideShare a Scribd company logo
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
DOI: 10.5121/ijma.2018.10301 1
COMPUTER VISION FOR DATA CATALOGING IN
ECOLOGICAL RESEARCH
Shannon Heh
Lynbrook High School, San Jose, California, USA
ABSTRACT
Data collection is an essential, but time-consuming procedure in ecological research. An algorithm was
developed by the author which incorporated two important computer vision techniques to automate
butterfly cataloguing. Optical Character Recognition is used for character recognition and Contour
Detection is used for image-processing. Proper pre-processing is first done on the images to improve
accuracy of character recognition and butterfly measurement. Although there are limitations to Tesseract’s
detection of certain fonts, overall, it can successfully identify words of basic fonts. Contour detection is an
advanced technique that can be utilized to measure an image. Multiple mathematical algorithms are used
to calculate and determine the precise location of the points on which to draw the body and forewing lines
of the butterfly. Overall, 92% accuracy are achieved by the program for the set of butterflies measured.
KEYWORDS
Computer Vision, Image Recognition, Character Recognition, Ecology, Butterfly Cataloguing.
1. INTRODUCTION
Data collection is an important step of scientific research, especially in ecological and
evolutionary studies. Scientists must gather a large amount of data to perform analyses of certain
species and support their hypotheses. Much of the information is still contained in physical books
that include images and descriptions. The traditional way to digitize the data is to manually
measure the body sizes of organisms from these images and type the information into documents
and spreadsheets. However, this method is time-consuming and inefficient. It requires a lot of
manpower and is also prone to inaccurate measurements due to human errors.
During my Earth Science internship at Stanford University, one of my tasks was to go through 50
books, measure the forewing and body length of thousands of butterflies, extract taxonomic
information about each species, and record everything in an Excel spreadsheet. The data
collection process not only requires knowledge of Lepidoptera classification, but also needs focus
and patience. Two interns can only measure 300 butterflies in three hours.
My internship experience inspired me to search for efficient ways to automate the data collection
process. While studying computer vision with Professor Susan Fox, I realized the application of
computer vision would be the perfect solution to optimize the data collection process for
ecological research. The goal of my project is to employ computer vision techniques to facilitate
mass data collection and measurement. With computer vision, the ecological data catalogue
process can be sped up significantly.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
2
2. BACKGROUND INFORMATION
2.1. What is Tesseract?
Tesseract is an open-source Optical Character Recognition (OCR) engine that can detect
approximately 100 languages [1]. In this project, pyTesseract [2], a Python wrapper for Google’s
Tesseract-OCR Engine, is implemented for character detection. Tesseract follows a step-by-step
process for character detection. The first step is a connected component analysis to organize
character outlines into “Blobs.” Text lines are then found and broken into words based on the
spacing between words. Character recognition is described as a “two-pass process”: the first pass
involves recognizing each word and passing it to an adaptive classifier as training data, while the
second pass recognizes words again to ensure that the all words are well-detected [3].
2.2. How are the butterflies measured?
Two measurements are taken to quantify the butterfly’s body size. The first is the forewing length
(or basal-apical length) which extends approximately from the upper half of the butterfly body to
the outer edge of the butterfly’s forewing as shown below. The second is the butterfly’s body
length which starts right below the butterfly’s head and ends at the tip of the body. (In my
program, the body length will be measured from the top of the head to the end of the body.)
Figure 1. Measurements taken for a butterfly
2.3. How are the butterflies typically displayed in books?
Figure 2 is a mock-up example showing how the butterflies are displayed in books. The data
recorded about the butterflies include but are not limited to: the book in which the butterflies are
displayed, the butterfly’s family, genus, species, subspecies, authority, authority year, sex, page
number, figure number, magnitude or scale, forewing length, and body length. The characteristics
of the butterfly are often displayed beside or below the butterfly’s images. Other descriptions of
the butterfly’s habitat and behaviours are sometimes included in the books, but unnecessary for
data collection about the butterflies’ body sizes.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
3
Figure 2. A mock-up examples of a typical butterfly display in books
2.4. Related Work
In past studies of computer vision, butterflies were never detected and measured for ecological
studies. However, fish have been studied and measured quite often to automate the inefficient
manual sorting of fish. D.J. White, C. Svellingen, and N.J.C. Strachan [4] detected the principal
axis of the fish by finding the two furthest points on the outline of a fish, an idea I implemented
into my own project to determine the butterfly’s forewing length.
Various interesting methods were proposed for text detection, including Support Vector Machine
(SVM), K-Nearest Neighbor (KNN), Neural Networks, and Tesseract for OpenCV. SVM, KNN,
and Neural Networks all implement machine learning techniques to train the system to recognize
letters and symbols. SVM [5] is a discriminative classifier defined by a hyperplane that maximizes
the margin of training data to find the best match between the characters with which the classifier
has been trained and the character needing recognition. Yafang Xue [6] implemented SVM
classifiers as one method of OCR, but the algorithm accuracy reached only 83%. P. Kumar, N.
Sharma, and A. Rana [7] measured a 94.8% accuracy for the SVM Classifier, but only 80.96%
accuracy for Neural Networks. KNN [8] is another classification algorithm which locates the
“nearest neighbor,” or most similar match, in the training data to the character needing recognition.
The data collector can gauge how many, k, neighbors should be detected for matching. While
SVM and KNN are simple machine learning algorithms for character recognition, the process for
training the classifiers is arduous; one needs to train the system for each character and font to
ensure the highest accuracy. After reading a study from K.M. Sajjad [9] about Tesseract in
automatic license plate recognition, I decided that using Tesseract in my project would be the
fastest, most accurate OCR method, since the character recognition is already implemented into the
engine.
The papers by Xue, Sajjad, White, and Kumar also presented effective pre-processing techniques,
such as binarization, color thresholding, and resizing, that were applied to my butterfly images.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
4
3. SOLUTION
Two main computer vision techniques were investigated for this project
:
1. Optical Character Recognition (OCR): to read the taxonomic information of butterflies
2. Contour Detection: to process the images of butterflies
Optical Character Recognition is the conversion of typed, printed, or handwritten texts into
machine-readable texts by a computer. I integrated an existing OCR Engine called Tesseract to
conduct the character identification. The butterfly descriptions and measurements are then
scanned and input into columns of a spreadsheet using the openpyxl library in Python.
Contour Detection was necessary to isolate the shape of the butterfly and locate the points on the
butterfly from which to measure their body and wing lengths. Multiple mathematical methods are
used to help calculate and detect the locations for measurement. Although I used butterflies as a
proxy for measurement, any object can be measured with computer vision.
The solution is divided into two phases: (1) character recognition and (2) butterfly measurement
and detection. Phase one incorporates Optical Character Recognition (OCR) to recognize and print
the characters in the image, while Phase two manipulates various Python functions and libraries to
detect contours, measure linear distances on butterflies, and enter data into a spreadsheet. The two-
phrase solution can be applied to scientific research for automated and efficient data collection.
3.1 Character Recognition
Figure 3 is the flow-chart to show how character recognition is implemented.
Figure 3. Character Recognition flow-chart
3.1.1 Pre-processing
Pre-processing is necessary to enhance the features of an image. Listed below are the pre-
processing techniques that I implemented.
1) Gray-scale: The first step is to convert the image to a gray-scale image which only uses one
channel and eliminates extraneous color information. Feature detection is improved, and
noise is reduced as a result.
2) Binarization: Binarization converts the image to black and white using a threshold value.
Binary thresholding sets values above the threshold to the max value (white), and values
less than or equal to the threshold to zero (black).
3) Resize: The next step is to resize the dimensions of the image to two times the original. By
testing the code, I found that scaling the text larger makes the character recognition
more accurate. The font size is assumed to be around 10-12 points.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
5
4) Erosion: The following pre-processing step is erosion, which adds a layer of pixels to the
image and thickens the text.
5) Sharpening: The final pre-processing step is sharpening. Sharpening is used to enhance the
edges and features of each character. For example, I found that with one text, the program
would confuse the letter ‘1’ and the symbol ‘]’. After image sharpening, the program was
able to make the distinction.
Figure 4. Result of each step of pre-processing
3.1.2. Optical Character Recognition
In the second half of the character recognition process, the program utilizes Tesseract, an
open-source Optical Character Recognition (OCR) engine. Developed by HP in 1985,
Tesseract is currently one of the most accurate OCR software tools.
3.1.3. Test the Program
I tested the program with three different fonts: sans-Serif (Calibri), Serif (Times New Roman),
and Lucida Calligraphy to analyze the different ways the type of font can affect the accuracy. I
also included narrow (Arial Narrow), italicized, bold, all capitalized, colored, and highlighted
text, the full alphabet, and a text comparison for further analysis.
The standard text I use for the tests is:
Species: Danaus
Genus: Plexippus
Authority: Shannon 2017
Shown in the table (Table 1) below are the results for each font type and characteristic added to
the text. Note that kerning and line spacing are detected by Tesseract. All text was written in size
12 font. Serif font was written in Times New Roman and Sans-Serif was written in Calibri.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
6
Table 1. Final results of OCR in the program
Sans-Serif – regular Sans-Serif -- all capitalized
Serif – regular Serif -- all capitalized
Lucida calligraphy – regular Lucida calligraphy -- all capitalized
Arial narrow – regular Arial narrow -- all capitalized
Italicized -- Sans-Serif, regular Bold -- Sans-Serif, regular
Full Alphabet – Serif Full Alphabet -- Sans-Serif
Red Text -- Sans-Serif Yellow Highlight -- Sans-Serif
Results on August 19, 2017
Lucida Calligraphy (top), Serif, Sans-Serif,
Arial Narrow (bottom)
*Note the text is different in this example.
Results on August 26, 2017
Serif (top), Sans-Serif, Lucida Calligraphy,
Arial Narrow (bottom)
*Note the text is different in this example.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
7
In general, simple fonts (Times New Roman, Calibri, and Arial Narrow) are more accurately
recognized by the program. Text written in all capitals produced more incorrect results compared
to text written regularly. Sans-Serif fonts, written regularly, were all recognized by Tesseract.
When the Sans-Serif text is written in all capital letters, the software produced some errors (e.g. in
the word ‘Plexippus’).” Serif fonts, written both ways, are accurately detected. There is again
only one small error in the word “Plexippus,” when written in all capitals. Recognition of fonts
with ornamentation, such as Lucida Calligraphy, produced the most inaccurate results, with no
words being detected correctly. Narrow spacing between letters (Arial Narrow) did not hinder the
program’s detection; this text was actually detected with the best accuracy.
I input the full alphabet to see whether the program could detect all letters with the two most
basic Serif and Sans-Serif fonts. Both fonts’ recognition produces the same results, with both c’s
replaced with o’s. I then tested the program with italicized, bolded, colored, and highlighted text.
Italicized and bolded text had no effect on the program, beside the small error in “Plexippus”
where ‘l’ was confused with ‘i’ for the bold text. This may be because the bolding over-thickened
the characters (which were already thickened during the eroding during pre-processing). The
color and background color of the text did not affect the program’s character detection because
color information was eliminated in the pre-processing.
I found it interesting that the results of the text recognition changed when recorded on separate
days. The earlier results show errors in both Times New Roman and Lucida Calligraphy text. The
later results produce errors only in the Lucida Calligraphy text, but the errors are in different
characters. The printed text also showed differences in line spacing. Recognition of Times New
Roman and Lucida Calligraphy text is improved, while recognition of Calibri and Arial Narrow
text are consistently perfect. My hypothesis is that the differences in the image window size
affected the clarity of the text. Further research would have to be done to determine the exact
causes of the discrepancies.
4.1. Butterfly Recognition and Measurement
Figure 5 shows the flow-chart of butterfly recognition and measurement.
Figure 5. Butterfly measurement flow-chart
4.1.1 Pre-processing
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
8
The butterfly image was converted to a gray-scale image to eliminate unnecessary color
information. Again, color information makes the image contours more difficult to detect.
The image is also sharpened to make contours clearer.
4.1.2 Contour Detection
All the contours on the butterfly image are detected. The outer shape of the butterfly is
needed, so only the second largest contour (by area) is chosen. This step is shown in
Figure 6.
Figure 6. Before and after the largest contour was found
4.1.3 Measuring the Body Length
The following step is to calculate the body length. My initial approach was to locate the midline of
the bounding box of the butterfly and find the highest and lowest points within 10 pixels of the
midline, which correspond to the head and bottom end of the body. However, this method is
ineffective for non-symmetrical butterflies, where the midline is skewed.
Figure 7. Close-up of the bottom end of the body
My current technique is to locate the bottom end of the butterfly body and calculate its distance
from the head. I limited the range of contours to only look at the those within 50 pixels, left and
right, of the midline to account for the potential skewness caused by non-symmetrical cases. Next,
I chose three points that are 0, 2, and 4 pixels apart from the first. I then tested whether the middle
point is the local minima as displayed in Figure 8. A line is directly drawn from the bottom end to
the point where the line and head intersect, and the body length is determined.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
9
Figure 8. Body length of the butterfly (in green)
4.1.4 Measuring the forewing length
1) Drawing an Ellipse
In order to measure the butterfly’s forewing length, I first drew an ellipse, which is similar to the
butterfly body shape, around the butterfly’s body. The major axis is half of the body length (Figure
9) and the minor axis is around 10 pixels, an approximation of the maximum body width of the
butterfly. The minor axis is kept constant because there is usually little variation in the body width
of butterflies among different species.
Figure 9. Ellipse around the butterfly’s body (in red)
2) Locating Point A
The next task is to find the two points on which to draw the line for the forewing length. Figure 11
shows where the points are usually found. I will call one point A, and the other point B.
Figure 10. How the butterfly’s forewing length is measured
I tested two different methods for detecting Point A. I first tried detecting the top of the butterfly
head, then shifted this point 10 pixels down and five pixels to the right, and used that point as Point
A. However, because of variations in the butterfly’s body size, this method proved unreliable.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
10
After drawing the ellipse, I discovered that I could use the equation of the ellipse to find Point A(x,
y) on the body.
After testing multiple angles for θ, I observed that θ should be around π/20.
3) Locating Point B
The last step for the wing length measurement is finding Point B. I predicted that Point B would be
the furthest point on the butterfly contour from Point A. Point B, in this case, must be on the right
wing, so I made sure that the point was above and to the right of Point A. Finally, the distance
formula was used to calculate the distance between Point A and B.
Figure 11. Forewing length of the butterfly (in green)
4.5. Converting from Pixels to Millimeters
The final task for butterfly measurements is converting the body and wing length units from pixels
to millimeters. I measured both lengths on a printed image of a butterfly, and used the pixel to
millimeter ratio as the scale.
4.6. Separating the butterfly image from the text
The contours around the letters can be a distraction to the program’s line and contour detection on
the butterfly. To avoid this issue, I separated the butterfly and text into two sub-images. Anything
below the lowest point on the butterfly contour would be considered as text (assuming the text is
below the butterfly). The character recognition was done on the text image, while butterfly
measurements were performed on the butterfly images.
4.7. Sorting the information into an Excel Spreadsheet
(1)
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
11
I incorporated the openpyxl library in Python to manipulate Excel spreadsheets for data collection.
Each time a set of images is passed through the program, a new workbook is generated. The first
row of the spreadsheet are the designated headers, as shown in Figure 12. Labels in original image
are used for each line of text to identify the genus, species, sex, etc. (in reality, there are no labels
for the butterfly’s description). Each line of text is then separated into a list of strings. For each
butterfly image, a row is appended to the worksheet, and the corresponding information is filled in
the cells by matching the text labels to the column headers. An example is shown in Figure 12.
The text labels are detected through fuzzy matching, with the SequenceMatcher function in the
difflib library. In case Tesseract incorrectly recognizes some character(s) in the text label, I set the
standard so that as long as 75% of the characters in the label are correct, the program can match the
label with the header.
Family → YES (100% match) Famly → YES (83% match) Femllv → NO (50% match)
4.8. Test the Program
The Python program was tested on 12 images of butterflies taken from the book Learning About
Butterflies by Carolyn Klass and Robert Dirig [10] and various websites. The program goes
through all given directories and subdirectories to select the butterfly images.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
12
Figure 12. Final result for the 12 butterfly images
Results show that 11 out of 12 (92%) butterflies were successful in being measured and cataloged.
The reason that the one butterfly (red box in Figure 14) was not identified correctly is that the
butterfly’s wing color is similar to the background color. These factors affected the program’s
ability to distinguish the contour of the butterfly. I also noticed that the rotation of the butterfly
image disrupts the program’s ability to detect the contours. Because of differences in font for the
text, there were minor errors in the character recognition. Overall, this program is proven to be
highly accurate and efficient for measurements and data cataloging.
Figure 14. Cataloged data in spreadsheet for 12 butterflies
5. CONCLUSIONS
Data collection is an essential, but a time consuming and manpower intensive step in ecological
research. In this project, two important computer vision methods were implemented to automate
data cataloging for ecological study. Optical Character Recognition is an effective approach to
scan in written or printed information, which will then be sorted into a spreadsheet. Proper pre-
processing is first done on the images before Tesseract can be integrated for character recognition.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
13
Although there are limitations to Tesseract’s detection of certain fonts, overall, Tesseract can
successfully identify words of basic fonts. Contour detection is an advanced technique that can be
utilized to measure an image. Shapes and mathematical calculations are crucial in determining the
precise location of the points on which to draw the body and forewing lines of the butterfly.
Finally, the butterfly information is input into a spreadsheet and a 92% accuracy are achieved by
the program for the set of butterflies measured. While this program is currently limited to butterfly
measurements, similar techniques can be applied to the measurement of more organisms.
With the help of computer vision, scientists no longer need to invest significant amounts of their
time on data cataloging and measurement. The outcome of this project allows researchers to
automate the data collection process and focus more on their research and analyses.
6. FUTURE WORK
In this project, I integrated OCR, specifically Tesseract, to automate mass data collection.
Tesseract is a highly useful tool for character detection, but there are still some limitations that
must be considered. Many factors, such as font size, font type, kerning, and line spacing, must be
considered to ensure the most accurate results. I would like to explore more of these limitations,
and analyze whether other OCR algorithms, such as K-Nearest Neighbor (KNN) and Support
Vector Machine (SVM), can be trained to detect more fonts and handwritten text. Another possible
project would be the implementation of a function that recognizes which words are the family,
species, genus, sex, and authority and year, without the labels.
Computer-based measurements were also successfully implemented in this project. The next step
will be to test the program on more species of butterflies, and to observe whether various wing
shapes, patterns, and colors may disrupt the program’s detection and yield unexpected results. To
achieve a higher accuracy with the 12 butterflies above, I would explore additional pre-processing
techniques that could discern the butterfly contour from the background. I would also like to
integrate a feature that could rotate a butterfly image so the butterfly’s body would be parallel to
the vertical edges of the image window. Finally, since the current program is specific to butterflies
only, a potential project would be to apply similar techniques in the measurement of other
organisms.
ACKNOWLEDGEMENTS
I would like to thank Professor Susan Fox at Macalester College for mentoring me in computer
vision and guiding me through this research project. Thank you to the staff at Pioneer for all your
help. Thank you to my Earth Science internship mentor Dr. Noel Heim at Stanford University for
your encouragement in pursuing this project.
REFERENCES
[1] Tesseract OCR (Optical Character Recognition) Google. Retrieved September 04, 2017, from
https://opensource.google.com/projects/tesseract
[2] pytesseract 0.1.7 Python Package Index. Retrieved September 04, 2017, from
https://pypi.python.org/pypi/pytesseract
[3] Smith, R. (2007, September). An overview of the Tesseract OCR engine. In Document Analysis and
Recognition, 2007. ICDAR 2007. Ninth International Conference on (Vol. 2, pp. 629-633). IEEE.
[4] White, D. J., Svellingen, C., & Strachan, N. J. C. (2006). Automated Measurement of species and
length of fish by computer vision. Fisheries Research, 80(2), 3rd ser., pg. 203-210.
The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018
14
[5] Introduction to Support Vector Machines (SVM). Retrieved September 2017
http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
[6] Xue, Y. (2014). Optical Character Recognition. Department of Biomedical Engineering, University of
Michigan. https://pdfs.semanticscholar.org/5b6b/e3357dbdbac38e92515a7b7aebb7e622f635.pdf
[7] Kumar, P., Sharma, N., & Rana A. (2012). Handwritten Character Recognition using Different Kernel
based SVM Classifier and MLP Neural Network (COMPARISON) International Journal of Computer
Applications, Volume 53, No. 11, September 2012.
[8] Understanding k-Nearest Neighbour. Retrieved September 04, 2017, from
http://docs.opencv.org/3.0-
beta/doc/py_tutorials/py_ml/py_knn/py_knn_understanding/py_knn_understanding.html#knn-
understanding
[9] Sajjad, K. M. (2012). Automatic license plate recognition using python and opencv. Department of
Computer Science and Engineering MES College of Engineering. Retrieved Sept., 2017
http://sajjad.in/content/ALPR_paper.pdf
[10]Dirig, R. & Klass, C. (1992, March) Learning About Butterflies. Cornell University, NY: Cornell
Cooperative Extension.
[11]Sweigart A. (2015, April 14) Automate the Boring Stuff with Python: Practical Programming for Total
Beginners. San Francisco, CA: No Starch Press, Inc.
Author
Shannon Heh is a high school student (11th
grade) at Lynbrook High
School, San Jose, California, USA. She is President of the Girls Who Code
Organization in San Jose for the 2018 to 2019 school year.

More Related Content

PDF
Volume 2-issue-6-2159-2164
PDF
IRJET - Automating the Identification of Forest Animals and Alerting in Case ...
PDF
Recognition of Silverleaf Whitefly and Western Flower Thrips Via Image Proces...
PDF
IRJET- Hand Sign Recognition using Convolutional Neural Network
PDF
K0966468
PDF
A Hybrid Auto Surveillance Model Using Scale Invariant Feature Transformation...
PDF
Inverted File Based Search Technique for Video Copy Retrieval
Volume 2-issue-6-2159-2164
IRJET - Automating the Identification of Forest Animals and Alerting in Case ...
Recognition of Silverleaf Whitefly and Western Flower Thrips Via Image Proces...
IRJET- Hand Sign Recognition using Convolutional Neural Network
K0966468
A Hybrid Auto Surveillance Model Using Scale Invariant Feature Transformation...
Inverted File Based Search Technique for Video Copy Retrieval

What's hot (20)

PDF
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PDF
Using K-Nearest Neighbors and Support Vector Machine Classifiers in Personal ...
PDF
Pixel Based Fusion Methods for Concealed Weapon Detection
PDF
Bimodal Biometric System using Multiple Transformation Features of Fingerprin...
PDF
A Fast and Accurate Palmprint Identification System based on Consistency Orie...
PDF
PDF
IRJET- Object Detection in an Image using Deep Learning
PDF
D010312230
PDF
Design and Analysis for Removing Salt and Pepper Noise in Image Processing
PDF
Lsb hiding using random approach for image watermarking
PDF
Lsb hiding using random approach for image watermarking
PDF
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
PDF
E0543135
PDF
MultiModal Identification System in Monozygotic Twins
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
PDF
Motor Imagery Recognition of EEG Signal using Cuckoo Search Masking Empirical...
PDF
Ct35535539
PDF
NeuroPG - Open source software for optical pattern generation and data acquis...
PDF
Multimodal Biometrics at Feature Level Fusion using Texture Features
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
Using K-Nearest Neighbors and Support Vector Machine Classifiers in Personal ...
Pixel Based Fusion Methods for Concealed Weapon Detection
Bimodal Biometric System using Multiple Transformation Features of Fingerprin...
A Fast and Accurate Palmprint Identification System based on Consistency Orie...
IRJET- Object Detection in an Image using Deep Learning
D010312230
Design and Analysis for Removing Salt and Pepper Noise in Image Processing
Lsb hiding using random approach for image watermarking
Lsb hiding using random approach for image watermarking
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
E0543135
MultiModal Identification System in Monozygotic Twins
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Motor Imagery Recognition of EEG Signal using Cuckoo Search Masking Empirical...
Ct35535539
NeuroPG - Open source software for optical pattern generation and data acquis...
Multimodal Biometrics at Feature Level Fusion using Texture Features
Ad

Similar to COMPUTER VISION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH (20)

PDF
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
PPTX
Advances in insect taxonomy
PDF
Vision based entomology how to effectively exploit color and shape features
PDF
Flickr Image Classification using SIFT Algorism
PPTX
Object recognition
PDF
IRJET- Intelligent Character Recognition of Handwritten Characters
PDF
Vision Based Entomology : A Survey
PDF
Vision based entomology a survey
PDF
CIFAR-10
PDF
Reconstruction of Inner Worm Passage of Quercus Mongolica Seeds
PDF
Recognition of Handwritten Mathematical Equations
PPTX
patternfinders-141205004957-conversion-gate02.pptx
PDF
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
PDF
(44-53) SVM-Based Flower Image Classification for Commercial Applications (RE...
KEY
Building a Naive OCR System
PDF
Mit6870 template matching and histograms
PDF
Colour Object Recognition using Biologically Inspired Model
PDF
PDF
A case study in computer
PPTX
Large scale object recognition (AMMAI presentation)
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
Advances in insect taxonomy
Vision based entomology how to effectively exploit color and shape features
Flickr Image Classification using SIFT Algorism
Object recognition
IRJET- Intelligent Character Recognition of Handwritten Characters
Vision Based Entomology : A Survey
Vision based entomology a survey
CIFAR-10
Reconstruction of Inner Worm Passage of Quercus Mongolica Seeds
Recognition of Handwritten Mathematical Equations
patternfinders-141205004957-conversion-gate02.pptx
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
(44-53) SVM-Based Flower Image Classification for Commercial Applications (RE...
Building a Naive OCR System
Mit6870 template matching and histograms
Colour Object Recognition using Biologically Inspired Model
A case study in computer
Large scale object recognition (AMMAI presentation)
Ad

Recently uploaded (20)

PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPT
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPT
Occupational Health and Safety Management System
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Artificial Intelligence
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Safety Seminar civil to be ensured for safe working.
Categorization of Factors Affecting Classification Algorithms Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
86236642-Electric-Loco-Shed.pdf jfkduklg
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Occupational Health and Safety Management System
Abrasive, erosive and cavitation wear.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Artificial Intelligence
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Automation-in-Manufacturing-Chapter-Introduction.pdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...

COMPUTER VISION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH

  • 1. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 DOI: 10.5121/ijma.2018.10301 1 COMPUTER VISION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH Shannon Heh Lynbrook High School, San Jose, California, USA ABSTRACT Data collection is an essential, but time-consuming procedure in ecological research. An algorithm was developed by the author which incorporated two important computer vision techniques to automate butterfly cataloguing. Optical Character Recognition is used for character recognition and Contour Detection is used for image-processing. Proper pre-processing is first done on the images to improve accuracy of character recognition and butterfly measurement. Although there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify words of basic fonts. Contour detection is an advanced technique that can be utilized to measure an image. Multiple mathematical algorithms are used to calculate and determine the precise location of the points on which to draw the body and forewing lines of the butterfly. Overall, 92% accuracy are achieved by the program for the set of butterflies measured. KEYWORDS Computer Vision, Image Recognition, Character Recognition, Ecology, Butterfly Cataloguing. 1. INTRODUCTION Data collection is an important step of scientific research, especially in ecological and evolutionary studies. Scientists must gather a large amount of data to perform analyses of certain species and support their hypotheses. Much of the information is still contained in physical books that include images and descriptions. The traditional way to digitize the data is to manually measure the body sizes of organisms from these images and type the information into documents and spreadsheets. However, this method is time-consuming and inefficient. It requires a lot of manpower and is also prone to inaccurate measurements due to human errors. During my Earth Science internship at Stanford University, one of my tasks was to go through 50 books, measure the forewing and body length of thousands of butterflies, extract taxonomic information about each species, and record everything in an Excel spreadsheet. The data collection process not only requires knowledge of Lepidoptera classification, but also needs focus and patience. Two interns can only measure 300 butterflies in three hours. My internship experience inspired me to search for efficient ways to automate the data collection process. While studying computer vision with Professor Susan Fox, I realized the application of computer vision would be the perfect solution to optimize the data collection process for ecological research. The goal of my project is to employ computer vision techniques to facilitate mass data collection and measurement. With computer vision, the ecological data catalogue process can be sped up significantly.
  • 2. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 2 2. BACKGROUND INFORMATION 2.1. What is Tesseract? Tesseract is an open-source Optical Character Recognition (OCR) engine that can detect approximately 100 languages [1]. In this project, pyTesseract [2], a Python wrapper for Google’s Tesseract-OCR Engine, is implemented for character detection. Tesseract follows a step-by-step process for character detection. The first step is a connected component analysis to organize character outlines into “Blobs.” Text lines are then found and broken into words based on the spacing between words. Character recognition is described as a “two-pass process”: the first pass involves recognizing each word and passing it to an adaptive classifier as training data, while the second pass recognizes words again to ensure that the all words are well-detected [3]. 2.2. How are the butterflies measured? Two measurements are taken to quantify the butterfly’s body size. The first is the forewing length (or basal-apical length) which extends approximately from the upper half of the butterfly body to the outer edge of the butterfly’s forewing as shown below. The second is the butterfly’s body length which starts right below the butterfly’s head and ends at the tip of the body. (In my program, the body length will be measured from the top of the head to the end of the body.) Figure 1. Measurements taken for a butterfly 2.3. How are the butterflies typically displayed in books? Figure 2 is a mock-up example showing how the butterflies are displayed in books. The data recorded about the butterflies include but are not limited to: the book in which the butterflies are displayed, the butterfly’s family, genus, species, subspecies, authority, authority year, sex, page number, figure number, magnitude or scale, forewing length, and body length. The characteristics of the butterfly are often displayed beside or below the butterfly’s images. Other descriptions of the butterfly’s habitat and behaviours are sometimes included in the books, but unnecessary for data collection about the butterflies’ body sizes.
  • 3. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 3 Figure 2. A mock-up examples of a typical butterfly display in books 2.4. Related Work In past studies of computer vision, butterflies were never detected and measured for ecological studies. However, fish have been studied and measured quite often to automate the inefficient manual sorting of fish. D.J. White, C. Svellingen, and N.J.C. Strachan [4] detected the principal axis of the fish by finding the two furthest points on the outline of a fish, an idea I implemented into my own project to determine the butterfly’s forewing length. Various interesting methods were proposed for text detection, including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Neural Networks, and Tesseract for OpenCV. SVM, KNN, and Neural Networks all implement machine learning techniques to train the system to recognize letters and symbols. SVM [5] is a discriminative classifier defined by a hyperplane that maximizes the margin of training data to find the best match between the characters with which the classifier has been trained and the character needing recognition. Yafang Xue [6] implemented SVM classifiers as one method of OCR, but the algorithm accuracy reached only 83%. P. Kumar, N. Sharma, and A. Rana [7] measured a 94.8% accuracy for the SVM Classifier, but only 80.96% accuracy for Neural Networks. KNN [8] is another classification algorithm which locates the “nearest neighbor,” or most similar match, in the training data to the character needing recognition. The data collector can gauge how many, k, neighbors should be detected for matching. While SVM and KNN are simple machine learning algorithms for character recognition, the process for training the classifiers is arduous; one needs to train the system for each character and font to ensure the highest accuracy. After reading a study from K.M. Sajjad [9] about Tesseract in automatic license plate recognition, I decided that using Tesseract in my project would be the fastest, most accurate OCR method, since the character recognition is already implemented into the engine. The papers by Xue, Sajjad, White, and Kumar also presented effective pre-processing techniques, such as binarization, color thresholding, and resizing, that were applied to my butterfly images.
  • 4. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 4 3. SOLUTION Two main computer vision techniques were investigated for this project : 1. Optical Character Recognition (OCR): to read the taxonomic information of butterflies 2. Contour Detection: to process the images of butterflies Optical Character Recognition is the conversion of typed, printed, or handwritten texts into machine-readable texts by a computer. I integrated an existing OCR Engine called Tesseract to conduct the character identification. The butterfly descriptions and measurements are then scanned and input into columns of a spreadsheet using the openpyxl library in Python. Contour Detection was necessary to isolate the shape of the butterfly and locate the points on the butterfly from which to measure their body and wing lengths. Multiple mathematical methods are used to help calculate and detect the locations for measurement. Although I used butterflies as a proxy for measurement, any object can be measured with computer vision. The solution is divided into two phases: (1) character recognition and (2) butterfly measurement and detection. Phase one incorporates Optical Character Recognition (OCR) to recognize and print the characters in the image, while Phase two manipulates various Python functions and libraries to detect contours, measure linear distances on butterflies, and enter data into a spreadsheet. The two- phrase solution can be applied to scientific research for automated and efficient data collection. 3.1 Character Recognition Figure 3 is the flow-chart to show how character recognition is implemented. Figure 3. Character Recognition flow-chart 3.1.1 Pre-processing Pre-processing is necessary to enhance the features of an image. Listed below are the pre- processing techniques that I implemented. 1) Gray-scale: The first step is to convert the image to a gray-scale image which only uses one channel and eliminates extraneous color information. Feature detection is improved, and noise is reduced as a result. 2) Binarization: Binarization converts the image to black and white using a threshold value. Binary thresholding sets values above the threshold to the max value (white), and values less than or equal to the threshold to zero (black). 3) Resize: The next step is to resize the dimensions of the image to two times the original. By testing the code, I found that scaling the text larger makes the character recognition more accurate. The font size is assumed to be around 10-12 points.
  • 5. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 5 4) Erosion: The following pre-processing step is erosion, which adds a layer of pixels to the image and thickens the text. 5) Sharpening: The final pre-processing step is sharpening. Sharpening is used to enhance the edges and features of each character. For example, I found that with one text, the program would confuse the letter ‘1’ and the symbol ‘]’. After image sharpening, the program was able to make the distinction. Figure 4. Result of each step of pre-processing 3.1.2. Optical Character Recognition In the second half of the character recognition process, the program utilizes Tesseract, an open-source Optical Character Recognition (OCR) engine. Developed by HP in 1985, Tesseract is currently one of the most accurate OCR software tools. 3.1.3. Test the Program I tested the program with three different fonts: sans-Serif (Calibri), Serif (Times New Roman), and Lucida Calligraphy to analyze the different ways the type of font can affect the accuracy. I also included narrow (Arial Narrow), italicized, bold, all capitalized, colored, and highlighted text, the full alphabet, and a text comparison for further analysis. The standard text I use for the tests is: Species: Danaus Genus: Plexippus Authority: Shannon 2017 Shown in the table (Table 1) below are the results for each font type and characteristic added to the text. Note that kerning and line spacing are detected by Tesseract. All text was written in size 12 font. Serif font was written in Times New Roman and Sans-Serif was written in Calibri.
  • 6. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 6 Table 1. Final results of OCR in the program Sans-Serif – regular Sans-Serif -- all capitalized Serif – regular Serif -- all capitalized Lucida calligraphy – regular Lucida calligraphy -- all capitalized Arial narrow – regular Arial narrow -- all capitalized Italicized -- Sans-Serif, regular Bold -- Sans-Serif, regular Full Alphabet – Serif Full Alphabet -- Sans-Serif Red Text -- Sans-Serif Yellow Highlight -- Sans-Serif Results on August 19, 2017 Lucida Calligraphy (top), Serif, Sans-Serif, Arial Narrow (bottom) *Note the text is different in this example. Results on August 26, 2017 Serif (top), Sans-Serif, Lucida Calligraphy, Arial Narrow (bottom) *Note the text is different in this example.
  • 7. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 7 In general, simple fonts (Times New Roman, Calibri, and Arial Narrow) are more accurately recognized by the program. Text written in all capitals produced more incorrect results compared to text written regularly. Sans-Serif fonts, written regularly, were all recognized by Tesseract. When the Sans-Serif text is written in all capital letters, the software produced some errors (e.g. in the word ‘Plexippus’).” Serif fonts, written both ways, are accurately detected. There is again only one small error in the word “Plexippus,” when written in all capitals. Recognition of fonts with ornamentation, such as Lucida Calligraphy, produced the most inaccurate results, with no words being detected correctly. Narrow spacing between letters (Arial Narrow) did not hinder the program’s detection; this text was actually detected with the best accuracy. I input the full alphabet to see whether the program could detect all letters with the two most basic Serif and Sans-Serif fonts. Both fonts’ recognition produces the same results, with both c’s replaced with o’s. I then tested the program with italicized, bolded, colored, and highlighted text. Italicized and bolded text had no effect on the program, beside the small error in “Plexippus” where ‘l’ was confused with ‘i’ for the bold text. This may be because the bolding over-thickened the characters (which were already thickened during the eroding during pre-processing). The color and background color of the text did not affect the program’s character detection because color information was eliminated in the pre-processing. I found it interesting that the results of the text recognition changed when recorded on separate days. The earlier results show errors in both Times New Roman and Lucida Calligraphy text. The later results produce errors only in the Lucida Calligraphy text, but the errors are in different characters. The printed text also showed differences in line spacing. Recognition of Times New Roman and Lucida Calligraphy text is improved, while recognition of Calibri and Arial Narrow text are consistently perfect. My hypothesis is that the differences in the image window size affected the clarity of the text. Further research would have to be done to determine the exact causes of the discrepancies. 4.1. Butterfly Recognition and Measurement Figure 5 shows the flow-chart of butterfly recognition and measurement. Figure 5. Butterfly measurement flow-chart 4.1.1 Pre-processing
  • 8. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 8 The butterfly image was converted to a gray-scale image to eliminate unnecessary color information. Again, color information makes the image contours more difficult to detect. The image is also sharpened to make contours clearer. 4.1.2 Contour Detection All the contours on the butterfly image are detected. The outer shape of the butterfly is needed, so only the second largest contour (by area) is chosen. This step is shown in Figure 6. Figure 6. Before and after the largest contour was found 4.1.3 Measuring the Body Length The following step is to calculate the body length. My initial approach was to locate the midline of the bounding box of the butterfly and find the highest and lowest points within 10 pixels of the midline, which correspond to the head and bottom end of the body. However, this method is ineffective for non-symmetrical butterflies, where the midline is skewed. Figure 7. Close-up of the bottom end of the body My current technique is to locate the bottom end of the butterfly body and calculate its distance from the head. I limited the range of contours to only look at the those within 50 pixels, left and right, of the midline to account for the potential skewness caused by non-symmetrical cases. Next, I chose three points that are 0, 2, and 4 pixels apart from the first. I then tested whether the middle point is the local minima as displayed in Figure 8. A line is directly drawn from the bottom end to the point where the line and head intersect, and the body length is determined.
  • 9. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 9 Figure 8. Body length of the butterfly (in green) 4.1.4 Measuring the forewing length 1) Drawing an Ellipse In order to measure the butterfly’s forewing length, I first drew an ellipse, which is similar to the butterfly body shape, around the butterfly’s body. The major axis is half of the body length (Figure 9) and the minor axis is around 10 pixels, an approximation of the maximum body width of the butterfly. The minor axis is kept constant because there is usually little variation in the body width of butterflies among different species. Figure 9. Ellipse around the butterfly’s body (in red) 2) Locating Point A The next task is to find the two points on which to draw the line for the forewing length. Figure 11 shows where the points are usually found. I will call one point A, and the other point B. Figure 10. How the butterfly’s forewing length is measured I tested two different methods for detecting Point A. I first tried detecting the top of the butterfly head, then shifted this point 10 pixels down and five pixels to the right, and used that point as Point A. However, because of variations in the butterfly’s body size, this method proved unreliable.
  • 10. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 10 After drawing the ellipse, I discovered that I could use the equation of the ellipse to find Point A(x, y) on the body. After testing multiple angles for θ, I observed that θ should be around π/20. 3) Locating Point B The last step for the wing length measurement is finding Point B. I predicted that Point B would be the furthest point on the butterfly contour from Point A. Point B, in this case, must be on the right wing, so I made sure that the point was above and to the right of Point A. Finally, the distance formula was used to calculate the distance between Point A and B. Figure 11. Forewing length of the butterfly (in green) 4.5. Converting from Pixels to Millimeters The final task for butterfly measurements is converting the body and wing length units from pixels to millimeters. I measured both lengths on a printed image of a butterfly, and used the pixel to millimeter ratio as the scale. 4.6. Separating the butterfly image from the text The contours around the letters can be a distraction to the program’s line and contour detection on the butterfly. To avoid this issue, I separated the butterfly and text into two sub-images. Anything below the lowest point on the butterfly contour would be considered as text (assuming the text is below the butterfly). The character recognition was done on the text image, while butterfly measurements were performed on the butterfly images. 4.7. Sorting the information into an Excel Spreadsheet (1)
  • 11. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 11 I incorporated the openpyxl library in Python to manipulate Excel spreadsheets for data collection. Each time a set of images is passed through the program, a new workbook is generated. The first row of the spreadsheet are the designated headers, as shown in Figure 12. Labels in original image are used for each line of text to identify the genus, species, sex, etc. (in reality, there are no labels for the butterfly’s description). Each line of text is then separated into a list of strings. For each butterfly image, a row is appended to the worksheet, and the corresponding information is filled in the cells by matching the text labels to the column headers. An example is shown in Figure 12. The text labels are detected through fuzzy matching, with the SequenceMatcher function in the difflib library. In case Tesseract incorrectly recognizes some character(s) in the text label, I set the standard so that as long as 75% of the characters in the label are correct, the program can match the label with the header. Family → YES (100% match) Famly → YES (83% match) Femllv → NO (50% match) 4.8. Test the Program The Python program was tested on 12 images of butterflies taken from the book Learning About Butterflies by Carolyn Klass and Robert Dirig [10] and various websites. The program goes through all given directories and subdirectories to select the butterfly images.
  • 12. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 12 Figure 12. Final result for the 12 butterfly images Results show that 11 out of 12 (92%) butterflies were successful in being measured and cataloged. The reason that the one butterfly (red box in Figure 14) was not identified correctly is that the butterfly’s wing color is similar to the background color. These factors affected the program’s ability to distinguish the contour of the butterfly. I also noticed that the rotation of the butterfly image disrupts the program’s ability to detect the contours. Because of differences in font for the text, there were minor errors in the character recognition. Overall, this program is proven to be highly accurate and efficient for measurements and data cataloging. Figure 14. Cataloged data in spreadsheet for 12 butterflies 5. CONCLUSIONS Data collection is an essential, but a time consuming and manpower intensive step in ecological research. In this project, two important computer vision methods were implemented to automate data cataloging for ecological study. Optical Character Recognition is an effective approach to scan in written or printed information, which will then be sorted into a spreadsheet. Proper pre- processing is first done on the images before Tesseract can be integrated for character recognition.
  • 13. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 13 Although there are limitations to Tesseract’s detection of certain fonts, overall, Tesseract can successfully identify words of basic fonts. Contour detection is an advanced technique that can be utilized to measure an image. Shapes and mathematical calculations are crucial in determining the precise location of the points on which to draw the body and forewing lines of the butterfly. Finally, the butterfly information is input into a spreadsheet and a 92% accuracy are achieved by the program for the set of butterflies measured. While this program is currently limited to butterfly measurements, similar techniques can be applied to the measurement of more organisms. With the help of computer vision, scientists no longer need to invest significant amounts of their time on data cataloging and measurement. The outcome of this project allows researchers to automate the data collection process and focus more on their research and analyses. 6. FUTURE WORK In this project, I integrated OCR, specifically Tesseract, to automate mass data collection. Tesseract is a highly useful tool for character detection, but there are still some limitations that must be considered. Many factors, such as font size, font type, kerning, and line spacing, must be considered to ensure the most accurate results. I would like to explore more of these limitations, and analyze whether other OCR algorithms, such as K-Nearest Neighbor (KNN) and Support Vector Machine (SVM), can be trained to detect more fonts and handwritten text. Another possible project would be the implementation of a function that recognizes which words are the family, species, genus, sex, and authority and year, without the labels. Computer-based measurements were also successfully implemented in this project. The next step will be to test the program on more species of butterflies, and to observe whether various wing shapes, patterns, and colors may disrupt the program’s detection and yield unexpected results. To achieve a higher accuracy with the 12 butterflies above, I would explore additional pre-processing techniques that could discern the butterfly contour from the background. I would also like to integrate a feature that could rotate a butterfly image so the butterfly’s body would be parallel to the vertical edges of the image window. Finally, since the current program is specific to butterflies only, a potential project would be to apply similar techniques in the measurement of other organisms. ACKNOWLEDGEMENTS I would like to thank Professor Susan Fox at Macalester College for mentoring me in computer vision and guiding me through this research project. Thank you to the staff at Pioneer for all your help. Thank you to my Earth Science internship mentor Dr. Noel Heim at Stanford University for your encouragement in pursuing this project. REFERENCES [1] Tesseract OCR (Optical Character Recognition) Google. Retrieved September 04, 2017, from https://opensource.google.com/projects/tesseract [2] pytesseract 0.1.7 Python Package Index. Retrieved September 04, 2017, from https://pypi.python.org/pypi/pytesseract [3] Smith, R. (2007, September). An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on (Vol. 2, pp. 629-633). IEEE. [4] White, D. J., Svellingen, C., & Strachan, N. J. C. (2006). Automated Measurement of species and length of fish by computer vision. Fisheries Research, 80(2), 3rd ser., pg. 203-210.
  • 14. The International Journal of Multimedia & Its Applications (IJMA) Vol.10, No.1/2/3, June 2018 14 [5] Introduction to Support Vector Machines (SVM). Retrieved September 2017 http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html [6] Xue, Y. (2014). Optical Character Recognition. Department of Biomedical Engineering, University of Michigan. https://pdfs.semanticscholar.org/5b6b/e3357dbdbac38e92515a7b7aebb7e622f635.pdf [7] Kumar, P., Sharma, N., & Rana A. (2012). Handwritten Character Recognition using Different Kernel based SVM Classifier and MLP Neural Network (COMPARISON) International Journal of Computer Applications, Volume 53, No. 11, September 2012. [8] Understanding k-Nearest Neighbour. Retrieved September 04, 2017, from http://docs.opencv.org/3.0- beta/doc/py_tutorials/py_ml/py_knn/py_knn_understanding/py_knn_understanding.html#knn- understanding [9] Sajjad, K. M. (2012). Automatic license plate recognition using python and opencv. Department of Computer Science and Engineering MES College of Engineering. Retrieved Sept., 2017 http://sajjad.in/content/ALPR_paper.pdf [10]Dirig, R. & Klass, C. (1992, March) Learning About Butterflies. Cornell University, NY: Cornell Cooperative Extension. [11]Sweigart A. (2015, April 14) Automate the Boring Stuff with Python: Practical Programming for Total Beginners. San Francisco, CA: No Starch Press, Inc. Author Shannon Heh is a high school student (11th grade) at Lynbrook High School, San Jose, California, USA. She is President of the Girls Who Code Organization in San Jose for the 2018 to 2019 school year.