Part 3: Analyse data with LoanPy
The following six steps describe how to input aligned CLDF data to loanpy, and how to mine sound correspondences and evaluate and visualise their predictive power.
Step 1: Mine phonotactic inventory
These are necessary to predict phonotactic repairs during loanword adaptation.
cldfbench ronataswestoldturkic.mineEAHinvs invs.json
Create a list of all possible prosodic structures (like “CVCV”) in the target language and store them in a json-file.
- ronataswestoldturkiccommands.mineEAHinvs.register(parser)
Register arguments. Only one argument necessary: The name of the output-file. Should end in .json.
- ronataswestoldturkiccommands.mineEAHinvs.run(args)
Read the aligned data in
edictor/WOT2EAHedicted.tsv
with the inbuilt csv packagePass it on to loanpy’s get_prosodic_inventory function, which will extract all phonotactic structures (like “CVCV”) from the target language.
Write the inventory of prosodic structures to a json-file with the inbuilt json package. It will have the name that was passed on as an argument to the command and will be written to the folder
loanpy
.
Step 2: Create heuristic sound substitutions
Since any existing phoneme can be adapted when entering a language through a loanword, we have to create a heuristic adaptation prediction for as many IPA characters as possible, in this case 6491.
cldfbench ronataswestoldturkic.makeheur EAH heur.json
Create heuristic predictions of phoneme adaptations based on feature vector similarities of phonemes.
- ronataswestoldturkiccommands.makeheur.register(parser)
Register arguments. Two argument necessary: The ID of the target language, i.e. the one in which loanwords are adapted. Valid IDs can be found in column
ID
inetc/language.csv
. The second argument is the name of the output-file. Should end in .json.
- ronataswestoldturkiccommands.makeheur.run(args)
Pass on the target language ID as defined in
etc/languages.csv
to loanpy’s get_heur function, which will read thecldf/.transcription-report.json
file and extract the phoneme inventory of the target language from there. It will also read the fileipa_all.csv
, which is shipped with loanpy. From these two files it creates a heuristic prediction of phoneme substitution patterns in loanword adaptations.Write the results to a file named according to the value passed to the second argument. It will be written to the folder
loanpy
. Expected file size: ca. 2.5MB.
Step 3: Mine vertical and horizontal sound correspondences
The output will serve as fuel for predicting loanword adaptations and historical reconstructions later on.
cldfbench ronataswestoldturkic.minesc H EAH
cldfbench ronataswestoldturkic.minesc WOT EAH heur.json
Read in aligned data and write a json-file with information about sound correspondences.
- ronataswestoldturkiccommands.minesc.register(parser)
Register arguments. Two arguments necessary: The ID of the target and source language: In horizontal transfers, the donor language is the source and the recipient language is the target. In vertical transfers, the target language is the ancestor and the source the descendant for backwards reconstructions. Valid IDs can be found in column
ID
inetc/language.csv
. A third argument is optional, namely the path to the json-file containing the heuristic phoneme adaptations.
- ronataswestoldturkiccommands.minesc.run(args)
If argument three was provided, read the json-file containing heuristic phoneme adaptation predictions with the inbuilt json package.
Read aligned forms from
edictor/{srclg}2{tgtlg}edicted.tsv
Extract sound and phonotactic correspondences from the data with loanpy’s get_corrspondences function
Write the sound and phonotactic correspondences to a file named
{srclg}2{tgtlg}sc.json
in the folderloanpy
.
Step 4: Make sound correspondences human-readable
The sound-correspondence file is stored as a computer-readable json. To create a human-readable tsv-file, run:
cldfbench ronataswestoldturkic.vizsc H EAH
cldfbench ronataswestoldturkic.vizsc WOT EAH
Read json file containing six dictionaries about sound and phonotactic correspondences and turn it into a human-readable tsv-file with additional info for easier manual inspection.
- ronataswestoldturkiccommands.vizsc.register(parser)
Register arguments. Two arguments necessary: The ID of the target and source language: In horizontal transfers, the donor language is the source and the recipient language is the target. In vertical transfers, the target language is the ancestor and the source the descendant for backwards reconstructions. Valid IDs can be found in column
ID
inetc/language.csv
.
- ronataswestoldturkiccommands.vizsc.run(args)
Read sound-correspondence json at
loanpy/{srclg}2{tgtlg}sc.json
.Transform computer-readable data structure to human-readable tables with loanpy.utils.scjson2tsv
Merge IDs with info from related tables for easier manual inspection.
Write the correspondence tables to two files named
{srclg}2{tgtlg}sc.tsv
and{srclg}2{tgtlg}sc_phonotactics.tsv
in the folderloanpy
.
Step 5: Evaluate vertical and horizontal sound correspondences
In this section, we are checking the predictive power of the mined sound correspondences with loanpy’s eval_all function
cldfbench ronataswestoldturkic.evalsc H EAH "[10, 100, 500, 700, 1000, 5000, 7000]"
cldfbench ronataswestoldturkic.evalsc WOT EAH "[10, 100, 500, 700, 1000, 5000, 7000]" True True heur.json
Specify the aligned wordlist data together with a few arguemnts and pass them on to loanpy’s evaluator module. Write the output to a json-file. It is a list of tuples with pairs of false and true positives.
- ronataswestoldturkiccommands.evalsc.register(parser)
Register arguments. Three arguments necessary: The first two are the IDs of the target and source language, as specified in column
ID
inetc/language.csv
. The third is a list of integers that specifies how many guesses should be made per input word during each iteration. This roughly corresponds to the false positive rate. Another three arguments are optional: Set adapt=True if we are dealing with horizontal transfers, prosody=True if phonotactics should be repaired during loanword adaptation prediction, and pass a file name to parameter “heur”, to add heuristic predictions of sound adaptations.
- ronataswestoldturkiccommands.evalsc.run(args)
Read aligned data between forms of the source and target language with the inbuilt csv library.
If a filename was passed to parameter “heur”, open that file. It has to be located in the
loanpy
folder.Read the other parameters with the inbuilt
literal_eval
function.Pass the parameters and the data to loanpy’s eval_all function
Write the results to a file called
tpfp{srclg}2{tgtlg}.json
.
Step 6: Plot the evaluations
To gauge the performance of the model, we can plot an ROC curve, calculate its optimum cut-off value and its area under the curve (AUC), a common metric to evaluate predictive models:
cldfbench ronataswestoldturkic.plot_eval H EAH
cldfbench ronataswestoldturkic.plot_eval WOT EAH
The results:
Predicting reconstructions from modern Hungarian words:
The performance of this model can be improved by removing irregular sound correspondences. By inspecting the file loanpy/H2EAHsc.tsv we can see that many words contain sound correspondences that occur only once throughout the entire etymological dictionary. Counting the number of those cognate sets shows that 106 out of 406 or 26% of all etymologies contain at least one sound correspondence that is irregular, i.e. occurs only in one single etymology. (Note that the pre-filtering did not skew this ratio because it picked all cognate sets with an Early Ancient Hungarian and a Hungarian counterpart.) If we remove those 106 cognate sets with irregular sound correspondences from our training and test data, 300 cognate sets remain and we get following result:
Predicting loanword adaptations from West Old Turkic words:
What happened under the hood:
Plot the results of the sound correrspondence file evaluation.
- ronataswestoldturkiccommands.plot_eval.auc(points: List[Tuple[Union[int, float], Union[int, float]]]) float
Calculate the area under the curve with the trapezoidal rule
- Parameters:
points (list of tuples of integers or floats) – A list of x-y-coordinates
- Returns:
The area under the curve
- Return type:
float
- ronataswestoldturkiccommands.plot_eval.euclidean_distance(point1, point2)
Calculate the Euclidean distance between two points.
- Parameters:
point1 (a tuple of two integers or floats) – The first point
point2 (a tuple of two integers or floats) – The second point
- Returns:
The euclidean distance
- Return type:
float
- ronataswestoldturkiccommands.plot_eval.find_optimum(points: List[Tuple[Union[int, float], Union[int, float]]]) Tuple[Union[int, float], Union[int, float]]
Calculate the euclidean distance of each point to the upper left hand corner and return the point with the lowest distance
- Parameters:
points (a list of tuples of floats or integers) – A list of coordinates representing points in the graph.
- Returns:
The optimal cut-off point of the ROC curve
- Return type:
a tuple of two floats or integers
- ronataswestoldturkiccommands.plot_eval.plot_curve(points: List[Tuple[Union[int, float], Union[int, float]]], absfp: List[int], maxtp: int, file_name: Union[str, Path]) None
Get a list of x- and y-axis values out of the points argument
Plot them to a graph with matplotlib, add lables to axes.
Calculate the area under the curve and add it to the plot title
Calculate the optimal cut-off point and mark it with an “x” on the graph.
Add a legend to the plot and write the image as a jpeg to the specified path
- Parameters:
points (list of tuples of floats or integers) – List of coordinate points as tuples
absfp (a list of integers) – The absolute numbers of guesses made. Needed to add information to legend on how many guesses are the optimum.
maxtp (an integer) – The absolute number of possible true positives. Needed to contextualise the relative information on the plot as a human reader.
file_name (a string or pathlike object) – The desired name and location of the output jpeg-file.
- Returns:
Writes the images to the specified path, returns None
- Return type:
None
- ronataswestoldturkiccommands.plot_eval.register(parser)
Register command line arguments and pass them on to the main function. Two non-optional arguments will be registered:
srclg
(source language) andtgtlg
(target langauge). Only strings contained in columnID
inetc/languages.csv
are valid arguments.
- ronataswestoldturkiccommands.plot_eval.run(args)
Read file
loanpy/tpfp{srclg}2{tgtlg}.json
containing true-positive false-positive ratios, the length of the dataframe with header (“maxtp”) and the guesslist, generated by theevalsc
command.Plot the data to an ROC-curve, providing the AUC and optimal cut-off.