Bitcoin

SUTRA: A New Precedent for Multilingual LLMs & Future AI

Abstract and 1 Introduction

2 Related Work

3 SUTRA Approach

3.1 What is SUTRA?

3.2 Architecture

3.3 Training Data

4 Training Multilingual Tokenizers

5 Multilingual MMLU

5.1 Massive Multitask Language Understanding

5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages

5.4 Comparing with leading models for Multilingual Performance

6 Quantitative Evaluation for Real-Time Queries

7 Discussion and Conclusion, and References

7 Discussion and Conclusion

Looking ahead, the SUTRA paves the way for the development of phonetic models (approach for SUTRA-Dhvanim), which benefits from the clear separation between concept modeling and language learning. By replacing the NMT decoder with a phonetic decoder, we enable the generation of phonetic responses for more seamless integration with speech models. Our next frontier for optimization is to examine the accuracy and performance impact of structured sparsity and int4 precision, which could significantly reduce SUTRA’s GPU memory footprint and yield with improvements in latency.

This research has introduced SUTRA, a state-of-the-art multilingual conversational language model, showcasing its superior ability to handle multiple languages with remarkable efficiency and performance. SUTRA is already proficient in 31 languages across multiple tasks, as detailed in Table 10, and is being extended to support over 50 languages. Unlike its predecessors, which struggle with the nuanced requirements of multi-language understanding, SUTRA exhibits a robust proficiency that is evident across a range of linguistic contexts. This is particularly notable in its application to languages with fewer resources available for model training, which traditionally lag in performance metrics. The innovative architecture of SUTRA, with its decoupled concept and language processing, allows for a scalable and flexible approach to language model training. This not only opens the door for more equitable representation of less commonly spoken languages but also ensures that the quality of interaction remains high across all languages. The efficient tokenization strategy of SUTRA, reducing token fertility for non-English languages, also points to potential cost reductions in deploying AI in multi-language environments, a notable consideration for global accessibility.

In conclusion, SUTRA sets a new precedent for multilingual language models by delivering high performance and efficiency without sacrificing linguistic diversity. Its architecture, which mirrors human cognitive development by separating concept understanding from linguistic expression, allows for a more natural and extensive language

Table 10: Although SUTRA can support more than 50 languages, the languages listed in the table above are the ones we have tested across number of tasks. Support for additional languages will be released in next versions of SUTRATable 10: Although SUTRA can support more than 50 languages, the languages listed in the table above are the ones we have tested across number of tasks. Support for additional languages will be released in next versions of SUTRA

comprehension. This breakthrough bears significant implications for the global adoption and application of AI, paving the way for more inclusive and equitable access to technology across language barriers.

References

Tom B. Brown et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Roger Jia et al. Bias in multilingual models: The case for linguistic equity in ai. In Conference on Neural Information Processing Systems (NeurIPS), 2019.

Alexis Conneau et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2020.

Linda Smith et al. Can multilingual models transfer for less resourced languages? Language Resources and Evaluation, 2021.

Yiming Zhang et al. Improving multilingual models with language-clustered vocabularies. arXiv preprint arXiv:2007.07680, 2020.

Yonghui Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2019.

Noam Shazeer et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.

Dan Hendrycks et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021.

Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017.

Jungwoo Son and Byeongil Kim. Translation performance from the user’s perspective of large language models and neural machine translation systems. Information, 14(10):574, 2023.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35: 7103–7114, 2022.

Barret Zoph. Designing effective sparse expert models. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022.

Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, P Sai Koura, Abhinav Sridhar, Tao Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. 2022.

Chujie Zheng, Minlie Huang, and Aixin Sun. Chid: A large-scale chinese idiom dataset for cloze test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 778–787, 2019. doi:10.18653/v1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.

Johanna Nichols and Tandy Warnow. Tutorial on computational linguistic phylogeny. Language and Linguistics Compass, 2(5):760–820, 2008.

Alexandra Birch. Neural Machine Translation. Cambridge University Press, 2021.

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. Freshllms: Refreshing large language models with search engine augmentation, 2023.

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.

Mojtaba Komeili, Kurt Shuster, and Jason Weston. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.

Ashish Vaswani et al. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International Conference on Machine Learning, pages 4057–4086. PMLR, 2022.

Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, and Ed Chi. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 34:29335–29347, 2021.

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv preprint arXiv:2307.16039, 2023.

Chenxi Whitehouse, Monojit Choudhury, and Alham Fikri Aji. Llm-powered data augmentation for enhanced crosslingual performance. arXiv preprint arXiv:2305.14288, 2023.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, and Anoop Kunchukuttan. Airavata: Introducing hindi instruction-tuned llm. arXiv preprint arXiv: 2401.15006, 2024.

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, and Stella Biderman. Kmmlu: Measuring massive multitask language understanding in korean. arXiv preprint arXiv:2402.11548, 2024.

Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023.

Rakuten Group, Aaron Levine, Connie Huang, Chenguang Wang, Eduardo Batista, Ewa Szymanska, Hongyi Ding, Hou Wei Chou, Jean-François Pessiot, Johanes Effendi, et al. Rakutenai-7b: Extending large language models for japanese. arXiv preprint arXiv:2403.15484, 2024.

About Two Platforms

Two Platforms (TWO) is a tech startup that aims to redefine Human-AI Interaction, and is at the forefront of the next generation of AI that is visual and immersive. TWO is building consumer AI apps and services powered by its proprietary Gen-AI models. TWO is headquartered in Silicon Valley with offices in Seoul and Mumbai.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button