Parallel Annotated Synthetic Corpora

Parallel Annotated Synthetic Corpora

Perfect multilingual alignment

Accurate and idiomatic translations

Rich set of annotation tags

Overview

The goal of the Parallel Annotated Synthetic Corpora (PASC) project is to create large-scale synthetic corpora for various natural language processing (NLP) applications, including machine translation (MT) for NMT and generative AI. Synthetic data mimics natural language, used especially for training machine learning models when actual data is scarce or expensive to obtain. Artificial corpora show great promise in improving machine translation quality.

 The PASC project aims to create synthetic corpora using supervised generation techniques. Unlike augmented corpora, which expand existing corpora, PASC constructs synthetic corpora from scratch using predefined sentence templates, ensuring strict adherence to linguistic rules. This meticulous approach yields precise translations, accurate alignment, grammatical annotation, accurate phonemic transcriptions, and more.

PASC includes very large-scale databases consisting of tens to hundreds of millions of entries for each domain. Currently it focuses on named entities, especially personal names, place names and points of interest for CJK languages and Arabic, to be followed by technical terms. Its distinctive features include full alignment, translation accuracy, accurate transcriptions, multilingual formats, full annotation, and consistency.

Parallel Annotated Synthetic Corpora

* Select different languages by clicking on the tabs below.

IDENGLISHJAPANESE
0002-01My full name is [Michael Owen].私の姓名は[オーウェン・マイケル]です。
0002-02[Michael] is my given name and [Owen] is my surname.[マイケル]は私の名前で、[オーウェン]は私の苗字です。
0002-03I’m called [Michael Owen].[オーウェン・マイケル]と言います。
0002-04Both [Michael] and [Owen] are personal names.[オーウェン]と[マイケル]は両方とも人名です。
0002-05[Michael Owen] is my full name.[オーウェン・マイケル]とは私のフルネームです。
0002-06[Michael Owen] is what’s written on my ID.旅券に記載されている姓名は[オーウェン・マイケル]です。
0002-07I’ve never heard of anyone called [Michael Owen].[オーウェン・マイケル]と言う人のことを聞いたことがない。
0002-08I go by the name [Michael Owen].[オーウェン・マイケル]と言う名前で呼ばれています。
0002-09Do you know of anyone who goes by the name of [Michael Owen]?[オーウェン・マイケル]という人を知っていますか。

Practical Applications

PASC can enhance the quality of language models and NLP algorithms for various applications, such as:

Neural Machine Translation

Automatic Speech Recognition

Text-to-Speech

Related Resources

CNV

Chinese Personal Name Variants

Over 7 million Chinese and non-Chinese names and romanized variants

DAN

Database of Arabic Names

6.5 million Arabic personal names and their romanized variants

JNV

Japanese Personal Name Variants

Japanese personal names and their romanized variants