Emesal Demo

Origins of Emesal // University of Helsinki

Process description
1. Data collection	Data is collected from Oracc with a modified Compass Data Acquisition script based on Niek Veldhuis' example. In addition, CDLI metadata is collected to fill out some metadata fields. Texts are provided with some additional metadata: they are dated into first or second millennium BCE based on Oracc / CDLI metadata. For composite texts this dating is done by a majority vote (i.e. if most of the witnesses come from the OB period, the text is labeled as 2nd millennium). In addition texts are provided with fixed GPS-coordinates and other Korp-Oracc metadata fields. All data can be re-downloaded and processed using pull_data.py. In case the new dataset has previously unseen ES words, they have to be manually added to the TSV version of the equivalence list. VRT-conversion with all data enrichment can be done for all downloaded files with make_vrt.py and all vector files and visualizations can be recreated with emesal_vectors.py. All other scripts are imported as modules to these scripts and they should not be modified.
2. Data selection	Data is collected from Oracc JSON files and transformed into Python objects via VRT format. Currently, following Oracc projects contain Emesal, and the final dataset is a subset of these corpora. The subset can be defined by filtering the Emesal Text objects: > vrt/blms.vrt > vrt/dsst.vrt > vrt/eisl.vrt > vrt/epsd2-earlylit.vrt > vrt/epsd2-literary.vrt > vrt/epsd2-praxis-liturgy.vrt > vrt/epsd2-praxis-udughul.vrt > vrt/epsd2-praxis-varia.vrt > vrt/epsd2-praxis.vrt > vrt/epsd2-royal.vrt > vrt/obel.vrt > vrt/dcclt.vrt > vrt/ribo-babylon2.vrt > vrt/ribo-babylon6.vrt
3. Anonymization	The data is anonymized, which means that distinction between ES and EG words are purposefully lost. For example, gašan[lady]N and nin[lady]N are anonymized into gašan[lady]N.nin[lady]N. This allows word vectors to better grasp semantic contexts of the words and give lesser bias to the language variant being used. In addition, all ES lemmata are suffixed with ^e to distinguish between ambiguous lemmata such as dumu[child]N, which in Oracc may point to ES or EG spelling of the word. Note: Some dubious ES:EG pairs such as a-ra-zu have been discarded from the equivalence lists, and thus in some instances words are not anonymized. For example, a-ra-zu exists as two lemmata: arazu[supplication]N and arazu[supplication]N^e. This is not a bug, but a result of unclean source data. Same apply to other words incorrectly labeled as ES in Oracc. In this demo, all proper nouns have been reduced to their part-of-speech tags, royal names into RN, temple names into TN, divine names into DN etc., as seeing ES or EG names may bias the center word prediction.
4. Center word prediction	The script emesal_vectors.py builds individual vector spaces for each ES:EG word pair in a way, that all the surrounding words are anonymized. Vector spaces are built using an n-fold split. I.e. the first split uses the first 20% of the data for testing and 80% for building the vectors, the second split uses the second 20% and the remaining 80% for building the vectors etc. Concatenation of all the test sets produce thus the original source data. This approach aims to ensure that exactly same contexts are not present in the training data when center word predictions are made. Note: At current state, each test data segment is predicted only once and the results are concatenated to produce predictions over the full data set. This is, because seeing same contexts is very rare and there there is no clear advantage taking average of the predictions in identical contexts. For this reason, at times exactly same contexts in the concorance view may have slightly different scores. In the prediction process every occurrence of words (center word) that have known ES and EG forms are iterated and the surrounding words within prediction window (which can be an arbirtrary number of preceding and following words, or a line) are used to predict whether the center word should be written in ES or EG. Prediction can be done either by taking average of all context words (excluding lacunae, unlemmatized words and line boundary markers) within the prediction window to the center word, or by summing the context word vectors and measuring its similarity to the center word. These approaches seem to yield approximately similar results. Predictions are scored by cosine similarity difference, that is, how much greater the context's cosine similarity is to the correct center word compared to the incorrect one. Thus, if the context has a cosine similarity of 0.2 to the expected center word and similarity of 0.15 to the incorrect center word, the score would be 0.05. Note: The scores displayed in the concordance view (on mouse-over), scatter plots and statistics view come from globally calculated word vectors instead of n-fold split. They show actual similarities measured from the data. The vector spaces are built using SPMI+SVD with pmi_embeddings.py. Following parameters are used by default: chunk_size = 400000 parameters = { "window_size": vector_window, "min_count": 2, "subsampling_rate": None, "k_factor": 0, "dynamic_window": True, "window_scaling": False, "dirty_stopwords": False, "verbose": False } pmi_parameters = { 'shift_type': 0, 'alpha': None, 'lambda_': None, 'threshold': 5, 'pmi_variant': None } train_path = f'./{vectordata}/global/{train}' vec_path = f'./{vectordata}/vec/global/' embeddings = embs.Cooc(train_path, chunk_size, parameters) embeddings.count_cooc() embeddings.calculate_pmi(pmi_parameters) dimensions = 60 embeddings.factorize(dimensions)
Folder structure
.	scripts
./oracc/	Oracc JSON zip files
./cdli/	CDLI cataloque
./vrt/	VRT files produced from JSON
./vectordata/global/	WPL files without n-fold split
./vectordata/vec/	Anonymized vectors
./vectordata/vec/global	Vectors without n-fold split
./vectordata/test/	WPL files for test data
./vectordata/train/	WPL files for train data
./vectordata/stats/	HTML output, see PREFIX_index.html for main page.

generated with emesal_vectors.py -- asahala 2022