Why India Needs Its Own AI?
By Arunima Rajan
Abhivardhan, President of the Indian Society of Artificial Intelligence and Law (ISAIL) and Managing Partner at Indic Pacific Legal Research, explains why building indigenous AI foundation models is important for India.
Why is this the right moment for India to invest on its own AI?
Since the success of Chinese language model DeepSeek R1, whose research was published in November 2024, but its impact shows that over time, a given level of compute investment enables increased model performance due to improvements in algorithms, hardware, and training methods, thanks to the Jevons Paradox. This means that even if compute becomes more efficient per task, its overall consumption may increase due to stimulated demand. This also means that achieving a specific performance level requires less compute investment as time passes. Hence, there is an economic incentive to build foundation models in the large language models space for India, and Indians.
Data security is a universal concern, even for India-based language models (especially foundational ones). For example, an AI tool used to analyse patient scans for diagnostic support must have robust security protocols to ensure that scan images and associated patient identifiers are only accessible to authorised personnel and are protected from cyberattacks, just as physical patient files are secured.
However, lack of diversity of training data (i.e., data diversity) across sector-specific and sector-agnostic contexts creates a policy and industry problem since stakeholders in industry and government respectfully would not be able to understand how these machine learning algorithms act, what kind of outputs are driven by what technical and data-related considerations, since the training basis of foreign countries’ language models is (1) unclear to be deciphered like why they act the way we realise, as per Dario Amodei, the CEO of Anthropic; and (2) there are technical bias issues with language models. Since they do pattern matching largely, and their “reasoning” capabilities are not quantifiable nor standardisable, building indigenous foundational AI models can be helpful.
Why hasn’t India cracked multilingual AI yet?
As pointed out by Pratyush Kumar from Sarvam AI, managing inference costs are a reasonable concern because traditional large language models require 4-8 tokens per word (on average). For instance, tokens are the basic units of text (like word parts or characters) that AI models process. Now, needing more tokens to represent the same amount of text costs more because computational processing and billing by the AI service providers are based on the total number of tokens handled. This is why inference costs for Indian languages are substantially higher than for English when using traditional large language models, making the widespread and affordable application of language models challenging. Now, here is my understanding of the practical obstacles that still stand in the way:
Limited digitised content exists for many Indian languages, creating a significant obstacle for training comprehensive AI models. The digital ecosystem for these languages remains underdeveloped despite their cultural richness and widespread usage. In addition, lack of standardised digital datasets makes it difficult to compare various linguistic approaches across languages.
India's languages use multiple writing systems derived from Brahmic scripts rather than Latin alphabets, complicating both processing and tokenisation, i.e., making it difficult for AI to consistently break down the intricate letter combinations into simple units. Abugida writing systems (like Devanagari) with inherent vowels, conjunct characters, and diacritics pose technical challenges that English-centric algorithms cannot adequately address.
Inefficient tokenisation leads to higher computational requirements and costs when processing Indian languages. Training comprehensive models requires substantial computational infrastructure and resources that may be inaccessible to many developers. Traditional LLMs trained primarily on English struggle with the complex grammar, cultural nuances and sentence structures of Indian languages.
Language in India is deeply embedded in cultural contexts with region-specific idioms and expressions. This means that maintaining accurate grammar, syntax, cultural relevance, and local expressions is challenging during annotation and model development.
Most Indian languages do not have large, high-quality, annotated datasets for training AI. What are some realistic ways to fill this gap, so these languages do not get left behind?
Pre-existing datasets can serve as foundations for generating synthetic data, helping to mitigate data scarcity issues. Recording studios can be established specifically for capturing studio-quality speech data from professional voice artists for expressive text-to-speech systems. Even otherwise, establishing standardised protocols and methodologies specifically for Indian language corpus creation is critical for ensuring quality. Initiatives like AI4Bharat by the Indian Institute of Madras, L3Cube, Natural Language Toolkit for Indic Languages (iNLTK), and the Indian Institute of Technology Bombay and others are leading coordinated efforts, and one must look at their approaches.
In addition, data collection efforts are increasingly focusing on specific domains and dialects to ensure representation of linguistic diversity, which is a brilliant move, since annotated datasets with grammatical information (like part-of-speech tags and morphological features) are being developed for specific language families.
Considering Bhashini and other initiatives, the tools deriving out of these language models must have defined workflows, and their AI use cases should be stakeholder, context and scope specific. They cannot be employed as business-to-consumer AI-for-preview deliverables like Conversational AI tools with too many general-purpose and unstable use cases.
Indian language AI models can genuinely benefit a lot of knowledge workers, and logistics personnel, who are left behind. A lot of teachers, gig workers can also be benefited if the tools derived out of these models are value-driven, and tailor-made. In addition, these stakeholders should be involved with some feedback ownership model where their inputs are considered like a Data bank, like what Bhashini did with collecting corrections on translations in its initial days.
The cost of using or adapting global AI models can put them out of reach for many startups, small businesses, and public sector users in India. What would it take to build an AI ecosystem that is truly affordable and inclusive for everyone?
As discussed in a celebrated paper by Konstantin Pilz, Lennart Heim and others, two industry effects around language models and AI compute – access effect and performance effect can help us as Indians to build inclusive AI ecosystems since “increasing compute efficiency democratises access while simultaneously enabling performance improvements across all resource levels”.
These twin effects are particularly relevant for India's AI strategy, where the government is actively expanding AI compute infrastructure. Now, I believe that several venture capitalists, bureaucrats, judges and vendors are not aware of three key issues, which should also be highlighted:
They are not aware how the Jevons Paradox, and these access and performance effects affect the resource economics of tokenisation. This causes opaque discourses, and feedback loops to estimate what is the rationale to invest in AI layers of infrastructure, from both data and machine learning algorithm perspectives.
Most of they are not aware how AI workflows, or AI “workflows” are created since 5% of times it is AI that is involved, while it will be still 95% manual and curated efforts. Plus, be it Chinese or American markets, two of the largest AI markets right now, the hype is so much that many GenAI entries, and language models-related marketing claims don’t make it clear how automations, AI workflows and AI “agentic workflows” (i.e., multi-agent coordinated automation of complex tasks) are workable.
AI benchmarks are truly a farce. Companies in both American and Chinese markets are not transparent enough to share the exact data and metadata files required for enabling benchmarking of their products, services and models. This makes it impossible to trust on these benchmarks beyond maybe few weeks or days.
How can India make sure its own language models avoid repeating or reinforcing existing social or regional biases?
Diversity of datasets can make language models exhibit lessened bias, but the challenges do not end once that diversity of India-based datasets are acknowledged and used to train language models in the AI space. There are aspects of efficiency, and explainability of these language models as AI systems, which can be addressed by having more open-source models, or at least open-access collaborative benchmarks, regularly updated. The kind of altruistic work that IIT Madras’ AI4Bharat has done shows that. Then, quantification of social and regional biases can be addressed. Lastly, the algorithmic techniques around these models, based on research published in computer science – should be red-teamed by involving genuine (not doublespeak or merely placeholder-type) multidisciplinary stakeholders, who understand how AI use cases, workflows, and final end-user relationship at a product, service or infrastructure level is clearly established assuming that resource parameters around data, model training and compute are pretty standard. For instance, an ethicist and a social worker in healthcare must demonstrate basic competence to understand the data outflow and inflow, how these algorithms are trained in a commonsensical way, and at what data and AI governance lifecycle stages their feedback is valuable, as per their and the tech teams’ mutual understanding. This alone can address bias issues quite swiftly. However, technique-enabled biases or technical biases should also be acknowledged, because merely claiming subjective ethical ideas, and not quantifying or proving what kind of bias is imprinted in the actions of an AI systems without any legal understanding makes the role of multidisciplinary stakeholders counterproductive. Hence, start focusing on technical biases.
What would it take to build an AI ecosystem that is affordable and inclusive for everyone?
DeepSeek R1’s success has already broken the myth. Access and performance effects as discussed in response to previous questions clearly shows that compute should be used effectively. Hence, here are some suggestions:
First, since token pricing dynamics will be eased, the Government should not over-rely on providing used GPUs under the IndiaAI Mission.
Second, the Government should promote AI research and AI safety research not in hackathon mode, but with a strict public-private partnership (PPP) format, where they should create Centres of Excellence or multi-domain AI research consortiums, giving credible research primacy to foster ecosystems of AI research in 3-4 ways – AI research papers, AI software patents, AI data repositories, and AI prototyped workflows, provided they are not hyped, and not limited to generative AI and language models.
Third, involve context-specific tech or policy standard-setting bodies as AI-related Self-Regulatory Organisations.
What are the most practical and promising uses of AI in the Indian context for hospital CXOs?
AI tools based on a grounded aspect, which involve symptom detection and involves privacy-preserving techniques to scrutinise and document patient data, can be really promising, since in both use cases, it is clear that both as solutions are scalable, and could involve less data privacy and leakage problems. Diagnostics require some tailor-made use of AI tools, so I am not sure if AI agents are capable or trustworthy to do that, for now.
On paucity of healthcare workers and professionals, improving detection rates and workflows at a human-to-human level using third party AI involvement is clearly possible, provided their data outflow is known to senior personnel in hospitals. Even the 2023 ICMR guidelines on AI ethics, which are not binding, highlight this, for instance in Section 1.2, point ii, Section 1.3, points i, iv, v and vi, Section 1.4, and Section 1.5 and others. The use of machine learning algorithms must be disclosed, quite clearly, and a commonsensical understanding has to be achieved, no matter how. Proper agreements and dispute resolution clauses need to be agreed between hospitals and data processors since under the Digital Personal Data Protection Act, 2023, hospitals as data fiduciaries, i.e., entities possessing relevant data and not data processors will have to accept liability imposed on them for breach of data, or for the exercise of data protection rights, or in cases where the threshold of legitimate use is surpassed thanks to any of the activities of data processors.
Got a story that Healthcare Executive should dig into? Shoot it over to arunima.rajan@hosmac.com—no PR fluff, just solid leads.