Truy cập Adobe 7.0
HOME For PARENTS For RESEARCHERS
Copyright © 2006 Giang M Tang
CORPORA OF VIETNAMESE TEXTS (CVT)
CVT: THE BASICS

CVT Table of Contents:

Corpora of Vietnamese Texts (CVT)

The Basics

The Corpora of Vietnamese Texts (CVT) consists of more than 1 million Vietnamese words. Two corpus were combined to form the CVT: 1) Vietnamese children’s literature and 2) Vietnamese online newspapers.

The corpus of Vietnamese children’s literature (i.e., picture books) includes a variety of genres such as repetitative books, stories, translated stories, and folklore, varying from preschool to fifth grade reading level. Chapter books and comics were excluded from the corpus. Materials were published in Vietnam and abroad. Click here to see a full list of books included in the Vietnamese children’s literature corpus.

The second corpus is comprised of online Vietnamese newspaper articles from a total of four sources: two sources published in Viet Nam and two sources published in the U.S.A. Articles were collected from April to July of 2006. Topics ranged from World news, Viet Nam news, Politics, Health and Medicine, Education, Current Events, Sports, Editorials, News about Vietnamese abroad, Economics, Science and Technology, Relaxation, Love, and Daily life. Non-story items were excluded from the corpus such as advertisements and comics. Click here for more details on the Vietnamese newspaper corpus.

The following table summarizes the composition of the Corpora of Vietnamese Texts.

Composition of CVT

Corpus

Source

Published

# words

1. Children’s literature

78 books

Abroad

42,690

279 books

VN

161,793

SUBTOTAL

 

204,443

2. Newspaper articles

Thanh Niên

VN

114,099

Tuổi Trẻ

VN

151,183

VNN

USA

542,834

VOA

USA

43,058

SUBTOTAL

 

851,174

TOTAL WORDS

   

1,055,617

© 2004-2006 VNSpeechTherapy.com. All rights reserved.