An Algorithm to Transcribe Ancient Kuzushiji into Сontemporary Japanese Characters

Oleg Panichev
Author: Oleg Panichev, Senior Research Engineer at Ciklum. Oleg has 5+ years of experience in machine learning, deep learning and data science, with a background in biomedical signal processing. He took the 5th place with his team in the Epileptic Seizure Prediction competition organized by Melbourne University.

The Situation

Imagine the history contained in a thousand years of books. What stories are in those books? What knowledge can we learn from the world before our time? What was the weather like 500 years ago? What happened when Mt. Fuji erupted? How can one fold 100 cranes using only one piece of paper? The answers to these questions are in those books.

The Challenge

The central and foremost challenge that the experts had was to transcribe Kuzushiji into contemporary Japanese characters. That would help the Center for Open Data in the Humanities (CODH) be able to develop better algorithms for Kuzushiji recognition. The model is not only a great contribution to the machine learning community but also a great help for making millions of documents more accessible and leading to new discoveries in Japanese history and culture. The task was also complicated due to occasional visibility through especially thin paper and the characters from the opposite side of the page. Those characters should have also be ignored.

Duration: 2 months

DATA

The total number of unique characters in the Kuzushiji dataset is over 4300. However, the frequency distribution is very long-tailed and a large fraction of the characters (Kanji with very specific meaning) may only appear once or twice in a book. Therefore, the dataset is highly unbalanced.

Examples of the train images for the competition

The Solution

As the task of character recognition is quite complex and there are more than 4300 different characters, Ciklum team developed two models for this problem — the neural network for character detection and a separate model for character classification.

The Result

The final result was evaluated using the F1-score. A perfect model would have a performance of 1 — that would mean that all characters were detected and classified correctly. The Ciklum team developed a model that detected and classified Kuzushiji characters with an F1-score of 0.873 and the 24th place among 293 teams.

We write about digital transformation, data & analytics, security, and digital trends that affect the way you do business.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store