|
Corpora of Vietnamese Texts (CVT)
The Basics
The Corpora of Vietnamese Texts (CVT) consists of more than 1 million
Vietnamese words. Two corpus were combined to form the CVT: 1) Vietnamese
children’s literature and 2) Vietnamese online newspapers.
The corpus of Vietnamese children’s literature (i.e., picture books) includes
a variety of genres such as repetitative books, stories, translated stories, and
folklore, varying from preschool to fifth grade reading level. Chapter books and
comics were excluded from the corpus. Materials were published in Vietnam and
abroad. Click
here to see a full list of books included in the
Vietnamese children’s literature corpus. The second corpus is comprised of online Vietnamese newspaper articles from a
total of four sources: two sources published in Viet Nam and two sources
published in the U.S.A. Articles were collected from April to July of 2006.
Topics ranged from World news, Viet Nam news, Politics, Health and Medicine,
Education, Current Events, Sports, Editorials, News about Vietnamese abroad,
Economics, Science and Technology, Relaxation, Love, and Daily life. Non-story
items were excluded from the corpus such as advertisements and comics.
Click
here for more details on the Vietnamese newspaper corpus.
The following table summarizes the composition of the Corpora of Vietnamese
Texts.
Composition of CVT
|
Corpus |
Source |
Published |
# words |
|
1. Children’s literature |
78 books |
Abroad |
42,690 |
|
279 books |
VN |
161,793 |
|
SUBTOTAL |
|
204,443 |
|
2. Newspaper articles |
Thanh Niên |
VN |
114,099 |
|
Tuổi Trẻ |
VN |
151,183 |
|
VNN |
USA |
542,834 |
|
VOA |
USA |
43,058 |
|
SUBTOTAL |
|
851,174 |
|
TOTAL WORDS |
|
|
1,055,617 |
|