Part 3: Analyse data with LoanPy

The following six steps describe how to input aligned CLDF data to loanpy, and how to mine sound correspondences and evaluate and visualise their predictive power.

Step 1: Mine phonotactic inventory

These are necessary to predict phonotactic repairs during loanword adaptation.

cldfbench ronataswestoldturkic.mineEAHinvs invs.json

Create a list of all possible prosodic structures (like “CVCV”) in the target language and store them in a json-file.

ronataswestoldturkiccommands.mineEAHinvs.register(parser): Register arguments. Only one argument necessary: The name of the output-file. Should end in .json.

ronataswestoldturkiccommands.mineEAHinvs.run(args)

Read the aligned data in edictor/WOT2EAHedicted.tsv with the inbuilt csv package
Pass it on to loanpy’s get_prosodic_inventory function, which will extract all phonotactic structures (like “CVCV”) from the target language.
Write the inventory of prosodic structures to a json-file with the inbuilt json package. It will have the name that was passed on as an argument to the command and will be written to the folder loanpy.

Step 2: Create heuristic sound substitutions

Since any existing phoneme can be adapted when entering a language through a loanword, we have to create a heuristic adaptation prediction for as many IPA characters as possible, in this case 6491.

cldfbench ronataswestoldturkic.makeheur EAH heur.json

Create heuristic predictions of phoneme adaptations based on feature vector similarities of phonemes.

ronataswestoldturkiccommands.makeheur.register(parser): Register arguments. Two argument necessary: The ID of the target language, i.e. the one in which loanwords are adapted. Valid IDs can be found in column ID in etc/language.csv. The second argument is the name of the output-file. Should end in .json.

ronataswestoldturkiccommands.makeheur.run(args)

Pass on the target language ID as defined in etc/languages.csv to loanpy’s get_heur function, which will read the cldf/.transcription-report.json file and extract the phoneme inventory of the target language from there. It will also read the file ipa_all.csv, which is shipped with loanpy. From these two files it creates a heuristic prediction of phoneme substitution patterns in loanword adaptations.
Write the results to a file named according to the value passed to the second argument. It will be written to the folder loanpy. Expected file size: ca. 2.5MB.

Step 3: Mine vertical and horizontal sound correspondences

The output will serve as fuel for predicting loanword adaptations and historical reconstructions later on.

cldfbench ronataswestoldturkic.minesc H EAH

cldfbench ronataswestoldturkic.minesc WOT EAH heur.json

Read in aligned data and write a json-file with information about sound correspondences.

ronataswestoldturkiccommands.minesc.register(parser): Register arguments. Two arguments necessary: The ID of the target and source language: In horizontal transfers, the donor language is the source and the recipient language is the target. In vertical transfers, the target language is the ancestor and the source the descendant for backwards reconstructions. Valid IDs can be found in column ID in etc/language.csv. A third argument is optional, namely the path to the json-file containing the heuristic phoneme adaptations.

ronataswestoldturkiccommands.minesc.run(args)

If argument three was provided, read the json-file containing heuristic phoneme adaptation predictions with the inbuilt json package.
Read aligned forms from edictor/{srclg}2{tgtlg}edicted.tsv
Extract sound and phonotactic correspondences from the data with loanpy’s get_corrspondences function
Write the sound and phonotactic correspondences to a file named {srclg}2{tgtlg}sc.json in the folder loanpy.

Step 4: Make sound correspondences human-readable

The sound-correspondence file is stored as a computer-readable json. To create a human-readable tsv-file, run:

cldfbench ronataswestoldturkic.vizsc H EAH
cldfbench ronataswestoldturkic.vizsc WOT EAH

Read json file containing six dictionaries about sound and phonotactic correspondences and turn it into a human-readable tsv-file with additional info for easier manual inspection.

ronataswestoldturkiccommands.vizsc.register(parser): Register arguments. Two arguments necessary: The ID of the target and source language: In horizontal transfers, the donor language is the source and the recipient language is the target. In vertical transfers, the target language is the ancestor and the source the descendant for backwards reconstructions. Valid IDs can be found in column ID in etc/language.csv.

ronataswestoldturkiccommands.vizsc.run(args)

Read sound-correspondence json at loanpy/{srclg}2{tgtlg}sc.json.
Transform computer-readable data structure to human-readable tables with loanpy.utils.scjson2tsv
Merge IDs with info from related tables for easier manual inspection.
Write the correspondence tables to two files named {srclg}2{tgtlg}sc.tsv and {srclg}2{tgtlg}sc_phonotactics.tsv in the folder loanpy.

Step 5: Evaluate vertical and horizontal sound correspondences

In this section, we are checking the predictive power of the mined sound correspondences with loanpy’s eval_all function

cldfbench ronataswestoldturkic.evalsc H EAH "[10, 100, 500, 700, 1000, 5000, 7000]"

cldfbench ronataswestoldturkic.evalsc WOT EAH "[10, 100, 500, 700, 1000, 5000, 7000]" True True heur.json

Specify the aligned wordlist data together with a few arguemnts and pass them on to loanpy’s evaluator module. Write the output to a json-file. It is a list of tuples with pairs of false and true positives.

ronataswestoldturkiccommands.evalsc.register(parser): Register arguments. Three arguments necessary: The first two are the IDs of the target and source language, as specified in column ID in etc/language.csv. The third is a list of integers that specifies how many guesses should be made per input word during each iteration. This roughly corresponds to the false positive rate. Another three arguments are optional: Set adapt=True if we are dealing with horizontal transfers, prosody=True if phonotactics should be repaired during loanword adaptation prediction, and pass a file name to parameter “heur”, to add heuristic predictions of sound adaptations.

ronataswestoldturkiccommands.evalsc.run(args)

Read aligned data between forms of the source and target language with the inbuilt csv library.
If a filename was passed to parameter “heur”, open that file. It has to be located in the loanpy folder.
Read the other parameters with the inbuilt literal_eval function.
Pass the parameters and the data to loanpy’s eval_all function
Write the results to a file called tpfp{srclg}2{tgtlg}.json.

Step 6: Plot the evaluations

To gauge the performance of the model, we can plot an ROC curve, calculate its optimum cut-off value and its area under the curve (AUC), a common metric to evaluate predictive models:

cldfbench ronataswestoldturkic.plot_eval H EAH
cldfbench ronataswestoldturkic.plot_eval WOT EAH

The results:

Predicting reconstructions from modern Hungarian words:

A coordinate system with an x and a y-axis and a blue graph going in an arch from the lower left hand corner towards the upper left hand corner and then the upper right hand corner. Title of the plot reads "Predicting Early Ancient Hungarian forms (AUC: 0.7177)". The y-axis reads "Relative number of true positives in data (100%=406)". Its values go from 0.45 until 0.75 in steps of 0.5. The x-axis reads "Relative number of guesses per word (100%=7000)". Its values go from 0.0 to 1.0 in steps of 0.2. In the bottom right hand corner there's an info box with a blue line saying "ROC Curve" and a dark yellow or ochre line with a dark yellow X on it that reads "Optimum: howmany=700 (tp: 70%)". There's a dark yellow/ochre X on the graph at 0.1 on the x-axis and 0.7 on the y-axis. — The ROC curve shows how the relative number of true positives (y-axis) increases, as the relative number of false positives (x-axis) increases. The optimal cut-off point is at 700 false positives per word, which yields 284 correct reconstructions out of 406 (i.e. 70%). The AUC is just above 0.7, which is considered acceptable. Note that the relative number of false positives and the AUC stay the same, irrespective of whether false positives are counted on a per-word basis (7000) or as an aggregate sum (7000 * 813= 5,691,000). The absolute number of possible true positives (406) was reached after prefiltering the 512 cognate sets of the raw data in Part 2 step 1.

The performance of this model can be improved by removing irregular sound correspondences. By inspecting the file loanpy/H2EAHsc.tsv we can see that many words contain sound correspondences that occur only once throughout the entire etymological dictionary. Counting the number of those cognate sets shows that 106 out of 406 or 26% of all etymologies contain at least one sound correspondence that is irregular, i.e. occurs only in one single etymology. (Note that the pre-filtering did not skew this ratio because it picked all cognate sets with an Early Ancient Hungarian and a Hungarian counterpart.) If we remove those 106 cognate sets with irregular sound correspondences from our training and test data, 300 cognate sets remain and we get following result:

A coordinate system with an x and a y-axis and a blue graph going in a steep arch from the lower left hand corner towards the upper left hand corner and then the upper right hand corner. Title of the plot reads "Predicting Early Ancient Hungarian forms (AUC: 0.9538)". The y-axis reads "Relative number of true positives in data (100%=300)". Its values go from 0.70 until 0.95 in steps of 0.5. The x-axis reads "Relative number of guesses per word (100%=7000)". Its values go from 0.0 to 1.0 in steps of 0.2. In the bottom right hand corner there's an info box with a blue line saying "ROC Curve" and a dark yellow or ochre line with a dark yellow X on it that reads "Optimum: howmany=100 (tp: 93%)". There's a dark yellow/ochre X on the graph at almost 0.0 on the x-axis and 0.93 on the y-axis. — This model performs significantly better than the previous one. At its optimum of 100 guesses per word it reconstructs 279 out of 300 forms (93%) correctly. The AUC is above 0.9, which is considered outstanding.

Predicting loanword adaptations from West Old Turkic words:

A coordinate system with an x and a y-axis and a blue graph going in a diagonal line from the lower left hand corner towards the upper left hand corner and then straight towards the upper right hand corner. The Title of the plot reads "Predicting Early Ancient Hungarian forms (AUC: 0.9318)". The y-axis reads "Relative number of true positives in data (100%=384)". Its values go from 0.84 until 0.94 in steps of 0.2. The x-axis reads "Relative number of guesses per word (100%=7000)". Its values go from 0.0 to 1.0 in steps of 0.2. This time, the info box is in the upper left-hand corner with a blue line saying "ROC Curve" and a dark yellow or ochre line with a dark yellow X on it that reads "Optimum: howmany=100 (tp: 93%)". There's a dark yellow/ochre X on the graph at almost 0.0 on the x-axis and 0.9 on the y-axis. — Out of 512 etymologies, 384 contained loanword adaptations from West Old Turkic into Early ancient Hungarian. This pre-filtering was carried out in Part 2 step 1. At its optimum of 100 guesses per word, the model predicted 346 words correctly out of 384 (90%). The AUC is above 0.9, which is considered outstanding.

What happened under the hood:

Plot the results of the sound correrspondence file evaluation.

ronataswestoldturkiccommands.plot_eval.auc(points: List[Tuple[Union[int, float], Union[int, float]]]) → float

Calculate the area under the curve with the trapezoidal rule

Parameters:: points (list of tuples of integers or floats) – A list of x-y-coordinates
Returns:: The area under the curve
Return type:: float

ronataswestoldturkiccommands.plot_eval.euclidean_distance(point1, point2)

Calculate the Euclidean distance between two points.

Parameters:

point1 (a tuple of two integers or floats) – The first point
point2 (a tuple of two integers or floats) – The second point

Returns:

The euclidean distance

Return type:

float

ronataswestoldturkiccommands.plot_eval.find_optimum(points: List[Tuple[Union[int, float], Union[int, float]]]) → Tuple[Union[int, float], Union[int, float]]

Calculate the euclidean distance of each point to the upper left hand corner and return the point with the lowest distance

Parameters:: points (a list of tuples of floats or integers) – A list of coordinates representing points in the graph.
Returns:: The optimal cut-off point of the ROC curve
Return type:: a tuple of two floats or integers

ronataswestoldturkiccommands.plot_eval.plot_curve(points: List[Tuple[Union[int, float], Union[int, float]]], absfp: List[int], maxtp: int, file_name: Union[str, Path]) → None

Get a list of x- and y-axis values out of the points argument
Plot them to a graph with matplotlib, add lables to axes.
Calculate the area under the curve and add it to the plot title
Calculate the optimal cut-off point and mark it with an “x” on the graph.
Add a legend to the plot and write the image as a jpeg to the specified path

Parameters:

points (list of tuples of floats or integers) – List of coordinate points as tuples
absfp (a list of integers) – The absolute numbers of guesses made. Needed to add information to legend on how many guesses are the optimum.
maxtp (an integer) – The absolute number of possible true positives. Needed to contextualise the relative information on the plot as a human reader.
file_name (a string or pathlike object) – The desired name and location of the output jpeg-file.

Returns:

Writes the images to the specified path, returns None

Return type:

None

ronataswestoldturkiccommands.plot_eval.register(parser): Register command line arguments and pass them on to the main function. Two non-optional arguments will be registered: srclg (source language) and tgtlg (target langauge). Only strings contained in column ID in etc/languages.csv are valid arguments.

ronataswestoldturkiccommands.plot_eval.run(args)

Read file loanpy/tpfp{srclg}2{tgtlg}.json containing true-positive false-positive ratios, the length of the dataframe with header (“maxtp”) and the guesslist, generated by the evalsc command.
Plot the data to an ROC-curve, providing the AUC and optimal cut-off.