Corpora of Vietnamese Texts (CVT)
Methods
Data collection
Since a text scanning software for the Vietnamese language
was unavailable at the time, all the children’s books in the Vietnamese
children’s literature corpus were typed into a word processor and saved as text
files. (See Vietnamese children’s literature corpus for a complete list
of books). I used all Vietnamese picture books available to me (excluding
chapter books and comics), which consisted of more than 350 books borrowed from
elementary school and public libraries and purchased from bookstores in Viet
Nam. All texts were typed in Vietnamese using VPS Keys to type Vietnamese
fonts. For more information about VPS Keys by the Vietnamese Professional
Society © 1993-2003, please refer to www.vps.org
The newspaper articles in the Vietnamese newspaper corpus
were all collected from online sources. My intention was to gather articles
published in Viet Nam as well as in the United States since applications of CVT
primarily target Vietnamese American populations. Articles from a variety of
categories were selected to elicit a broad represenation of daily language use.
(See Vietnamese newspaper corpus for a detailed description of newspaper
categories). Online articles were copied and pasted onto a word processor and
saved as text files.
Data analysis
A concordance software program, MonoConc Pro 2.2 © 1996,
2004 Michael Barlow, was used to analyze the data. Although MonoConc Pro 2.2
has the capability to read multiple languages, it has yet to be programmed for
the Vietnamese language. Therefore, I needed to format and code
language-specific fonts such as tone markers and vowels to be read by MonoConc
Pro 2.2. Click here to see the font coding system created specifically
for this project. For more information about MonoConc software, please refer to
www.athel.com or write to info@athel.com
Click here to see the complete list of words in the
CVT in order of frequency.