This repository contains a complete pipeline for training, evaluating, and analyzing a sequence‑to‑sequence model that transliterates Linear A sign sequences into Latin characters. It also explores data‑driven hypotheses about morphological patterns in Linear A through positional and co‑occurrence analyses, attention heatmaps, and embedding extractions.
Cleans and tokenizes raw Linear A sequences.
Simplifies repetitive patterns and incorporates morphological tags.
Implements a PyTorch-based Seq2Seq architecture with attention.
Trains both a baseline and a “tagged” variant that encodes prefixes/suffixes.
Computes BLEU, exact-match accuracy, and edit distance on a held‑out test set.
Generates quantitative reports and plots to compare model variants.
Extracts encoder/decoder embeddings for cluster analysis.
Produces attention heatmaps to highlight which Linear A signs influence transliterations.
Explores co‑occurrence patterns of signs with numerals and place names to test linguistic hypotheses.
Templates for integrating with domain experts.
Guidance for iteratively refining hypotheses and expanding the tag set.
If you use this work in your research, please cite:
**horus84(Tanishk), “Seq2Seq Transliteration and Linguistic Analysis of the Linear A Script,” 2025.**
This repository houses a Jupyter notebook that walks through the preprocessing, visualization, and preliminary sequence‑to‑sequence experiments on the Indus Valley Script (IVS). While this is an exploratory proof‑of‑concept, it lays the groundwork for more comprehensive modeling and linguistic hypothesis testing. Key Components
IVS.ipynb contains all the code cells for data loading, cleaning, tokenization, and initial modeling experiments.
Inline plots visualize sign‑frequency distributions, co‑occurrence heatmaps, and sample attention weights.
A hand‑curated JSON/CSV file of IVS sign sequences paired with proposed transliteration labels.
Includes metadata on sequence provenance (e.g., find‑spot, inscription length).