India’s AI Leap: Building Indic LLMs

The advent of ChatGPT and other similar Large Language Models (LLMs) created massive disruption in the technology realm. Because of their ability to streamline raw text prompts into categories through deep learning and employing artificial intelligence (AI) algorithms, LLMs make processes seamless and enable more efficient decision making by delivering deeper insights on factors like content consumption and pattern identification.

These facets make LLMs invaluable in the realms of content creation, customer support, enhancing customer experience, law, finance, marketing, advertising, education, and healthcare. That being said, the true extent of its potential is yet to be discovered.

Need for Indic Language-based LLMs

As with any AI-based application, LLMs too need to be trained using datasets which are in a particular language. LLMs like ChatGPT – and many more of its ilk – are modelled around English and therefore their effectiveness diminishes exponentially when inputs – or outputs – are fed in or expected in any other language.

India is an extremely diverse country, with myriad cultures present across the country. The fact that the Indian Constitution recognises twenty-two major languages of the nation is testament to this diversity. Although about 265 million Indians speak English, the accessibility to LLMs and their efficacy will grow significantly across the country if they are modelled around Indic languages to accurately reflect their linguistic and cultural uniqueness.

Challenges to be Surmounted

Although the Indian government, along with a few academic institutions and business organisations, have begun working in earnest towards the creation of Indic language based LLMs, there are a few challenges that they would need to address.

The biggest roadblock in the development of such LLMs for India is the severe paucity of data sets in these languages. Whatever is available is fragmented and scattered. In addition to the data itself, training these LLMs requires significant computational capabilities which is another resource that is in short supply in the country.

The Road Ahead

Undeterred by the potential roadblocks, a lot of work is being done to circumvent these challenges through the utilisation of innovation and technology. To build credible data sets and to reduce fragmentation, researchers are acquiring information from various sources – books, documents, transcription, collaboration with linguists and communities – and collating all of it together. This work will go a long way in streamlining LLM training in India that will reflect the finer nuances of the myriad languages used across the country.

For these efforts to gather momentum, symbiotic collaboration across stakeholders is imperative. The open-source model should be preferred and will enable democratic development and access to these emerging Indic-language based LLMs.