The main objective of this project is to develop a method for creating a dictionary using corpus data harvested from websites on tourism.
The method comprises four stages in a cycle, shown in the following figure.
Figure 1. Method for Developing the Dictionary of English for Tourism
Stage 1. Collecting websites, relevance analysis, and selection websites
In the first stage, the researchers conducted an online search for collecting the addresses of websites on tourism. The search was focused on finding websites managed by governments (including tourism departments) and those owned and managed by individuals, such as travel blogger websites.
From the online search, the researchers collected 166 addresses of websites managed by governments and 524 addresses of websites owned and managed by individuals.
After the website addresses were obtained, relevance analysis was conducted to evaluate the contents of the websites – whether or not the contents are related to tourism. In the relevance analysis, the researchers visited the websites and conducted quick content scanning. The result of the scanning was used as a basis for selecting websites. Only websites containing contents relevant to tourism are included in the research.
Based on the results of relevance analysis, the researchers decided to include all of the websites (690 websites) in the next stage of the research, extraction of corpus data.
Stage 2. Extracting corpus data, analysis of data, classification and selection
In this stage, corpus data were extracted from the websites (690 websites) by using tlCorpus software (https://tshwanedje.com/corpus/).
To harvest the corpus data from the websites, the researchers used the web crawling feature of tlCorpus software. This feature, commonly known as web crawler or web spider, browsed the websites in a methodical, automated manner (Science Daily, 2018).
Each of the 690 websites was web-crawled. However, not all web crawling processes ran well and generated corpus data. Due to website restrictions set by the administrator of the websites, the web crawling processes ran successfully on 188 websites (27% – 73 websites managed by governments and 116 websites managed by individuals). The web crawling feature of tlCorpus could not extract corpus data from the other 502 websites (73% of the total number of the collected websites).
Considering that 73% is a significant percentage, the researchers applied some technical measures to collect corpus data from the 502 websites, such as setting tlCorpus features. To avoid breach/violation of copyright, the researchers did not make any changes to the websites. Although the technical measures had been taken, no corpus data could be retrieved from the websites. Therefore, the corpus data were only harvested from the 188 websites (sample of the generated data is presented in the figure below).
At the moment, the researchers are conducting an analysis of the corpus data. The collected words are classified, and only words related to tourism are selected to be included in the dictionary database. The selection process employs an expert-judgment method done by two researchers (visit the Researchers Section).
Stage 3. Database building and website development
Due to resource and time constraints, the entries of the words are borrowed from the New Oxford American Dictionary (NOAD – Oxford University Press, 2013) and the sample sentences are obtained from Writefull Database (Using Writefull Application).
To provide voice file, the following code from https://responsivevoice.org/ is used for each word:
On the page, the code will appear as a ‘play’ button which can produce voice for the word.
Transcription for each word is written using Phonetizer (https://www.phonetizer.com/ui) based on IPA – International Phonetic Standard.
Stage 4. Triangulation and improvement