Subhalingam D

Data Scientist at KnowDis AI & Molecule AI

View Résumé

About Me

I am a Data Scientist with a deep interest in Natural Language Processing (NLP), Information Retrieval (IR), and Deep Learning. I hold a B.Tech. in Mathematics and Computing from the Indian Institute of Technology, Delhi (IIT Delhi).

I currently work across two organizations — KnowDis AI and Molecule AI. At KnowDis AI, I work on building intelligent search systems, spanning components like spell correction, query segmentation, query intent identification, and product categorization. At Molecule AI, I work on applying machine learning to drug discovery — building models that help accelerate the research and development pipeline in the life sciences domain.

I am particularly passionate about applying NLP to Indian languages — an area I believe holds immense untapped potential. Beyond core NLP, I also explore the use of machine learning in quantitative finance and stock markets.

Outside of work, you'll find me listening to music, watching football, teaching others, or indulging in good food — biryani and brownies being the usual suspects.

Education

Indian Institute of Technology, Delhi

B.Tech. in Mathematics and Computing

CGPA: 8.196

Chennai Public School, Chennai

CBSE Std. XII

Marks: 96.4%

Chennai Public School, Chennai

CBSE Std. X

CGPA: 10

Experience

KnowDis AI, Delhi


Data Scientist

Product Category Prediction for Automated Product Mapping (for a Large B2B Marketplace)
  • Designed a retriever-reranker pipeline to classify products into one of 100K+ categories using metadata like title and specifications
  • Curated training data from a 120M+ product catalog and augmented it with task-specific heuristics to improve data quality
  • Fine-tuned a decoder-based retriever; aggregated results using Mean Reciprocal Rank to shortlist candidates across multiple retrievers
  • Developed few-shot prompts for Qwen LLM reranker for single-step decoding; optimized latency using prefix-caching and vLLM
  • Achieved 13% accuracy gain, reduced grossly wrong predictions by 30%, and maintained 92%+ high-confidence coverage
  • Integrated the end-to-end system; led the dockerization and coordinated client delivery & testing; solution deployed to production

Attribute-Value Extraction from Product Titles & User Queries (for a Large B2B Marketplace) (Research accepted at KDD 2025)
  • Generated weakly supervised training data with incomplete labeling from product specifications covering 25K+ attributes
  • Designed a novel two-stage system that employs a marker-augmented generative model to identify potential attributes, followed by a cross-encoder-based token classification model that determines the associated values for each attribute
  • Regenerated training data to expand attribute-value annotations and trained a faster NER classifier on the enriched dataset
  • Improved recall by 20% while maintaining 90% precision; used for dynamic feature highlighting to enhance search experience

Additional Projects in E-commerce Domain:
  • Explored non-autoregressive generation methods to convert Roman Hindi words in search queries to English to achieve low-latency
  • Enhanced the spell correction model by augmenting input with lexically similar product titles to better handle low-frequency terms
  • Dockerized a product search system with large disk-based ANN indexes, deployed using NVIDIA Triton Inference Server
  • Developed a shopping assistant to enhance user experience via conversational product recommendations and discovery
  • Experimented with lexical string matching using Elasticsearch to handle model numbers in a search query
  • Building an agentic system to automatically audit and approve products when sellers list them on the platform

Style-Controllable English-to-Hindi Translator
  • Extended an encoder-decoder model with style tokens to control translation style; fine-tuned on in-house parallel corpora
  • Obtained English translations for scraped style-specific monolingual data using Google Translate API to augment the training data
  • Developed a style classifier to sample representative training examples, leading to improved in-style words usage in translations

Molecule AI, Delhi


Data Scientist

  • Integrated gradient guidance into de novo molecular generation to better optimize drug-relevant properties (ICML 2024 Workshop)
  • Developed a "bad rings" detector to filter out generated molecules with non-drug-like ring structures

KnowDis AI, Delhi


Data Science Intern

  • Built a transformer-based classifier to predict the most relevant product category from 100K+ labels using bootstrapped search logs
  • Designed heuristics to enhance atomic label representations and sampled data to improve category distribution and coverage
  • Achieved 88% accuracy (on par with prior seq2seq model) while reducing response time 3x and eliminating timeouts; integrated and deployed to production at a large B2B marketplace

Data Group, Indian Institute of Technology, Delhi


Undergraduate Researcher • Supervised by Prof. Srikanta Bedathur & Prof. Maya Ramanath • In collaboration with IBM AI Horizons Network

  • Prepared a dataset consisting of How-to troubleshooting FAQs by scraping WikiHow pages from Computers and Electronics category
  • Constructed BERT-based baselines to predict changes in properties of the entities involved at each step of the process
  • Surveyed the literature to build next-step recommenderfrom a given sequence of performed actions and developed LSTM baselines

Samsung R&D Institute, Delhi


Software Engineering Intern (S/W Intelligence Team)

  • Developed sound source direction estimation module using time delay of arrival of signals between pairs of microphones in an array
  • Added modules for tracking active sound sources and extracting individual signals for downstream object identification pipeline
  • Integrated stationary noise estimation module for ambient noise removal and reduced maximum direction of arrival error to 7°

Received Pre-Placement Offer for impeccable performance during the internship

MateRate Education Pvt Ltd, Delhi


Machine Learning Researcher & Developer

  • Developed Item Response Theory-based models to estimate and analyze the ability of 5000+ students & difficulty of 200+ questions

Backend Web Developer and AWS Associate

  • Designed database schema and built Web APIs using Django REST framework to display students’ performance reports to parents
  • Deployed Django backend using Elastic Beanstalk with MySQL on RDS and React frontend to S3 with CloudFront CDN integration
  • Set up Auto Scaling group and attached Load Balancerfor horizontal scaling; the portal went live with the results of 5000+ students

Received Letter Of Recommendation from CEO for exemplary work accomplishments

Publications

A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification

Subhalingam, D., Kolluru, K., Mausam & Singal, S.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (Association for Computing Machinery, 2025)

Paper

TAGMol: Target-Aware Gradient-guided Molecule Generation

Dorna, V., Subhalingam, D., Kolluru, K., Tuli, S., Singh, M., Singal, S., Krishnan, N. M. A. & Ranu, S.
ICML'24 Workshop ML for Life and Material Science (2024)

Paper Code

Tracking entities in technical procedures -- a new dataset and baselines

Goyal, S., Pandey, P., Gaur, G., Subhalingam, D., Bedathur, S. & Ramanath, M.
CoRR, 2021

PDF Code

Activities

Invited Reviewer

Aug '25 - Oct '25

KDD 2026 Datasets and Benchmarks Track (August)

Invited Emergency Reviewer

Mar '25 - Apr '25

ACL Rolling Review - February 2025

Teaching Assisstant

Aug '25 - Dec '21

COL764: Information Retrieval & Web Search
(Graduate-level course taught by Prof. Srikanta Bedathur at IIT Delhi)

General Secretary

Aug '21 - Jul '22

Mathematics Society, IIT Delhi

Overall Coordinator

Jul '20 - Jul '21

Mathematics Society, IIT Delhi

Web Development Executive

Sep '19 - Jul '20

Student Incubation Cell, IIT Delhi

Language Mentor

Aug '19 - Dec '19

Board for Student Welfare (BSW), IIT Delhi

Assisted newcomers regularly to improve their English language communication skill

Executive

Jul '19 - Jul '20

Mathematics Society, IIT Delhi

Volunteer

National Service Scheme (NSS), IIT Delhi

Over 120 hours of community work primarily in Teaching projects

Technical Executive

Aug '19 - Oct '19

Rendezvous, IIT Delhi

Part of the Web Frontend Development team

Projects

Identification of Hate Spreaders on Social Media (Bachelor's Thesis)

Prof. Niladri Chatterjee

We propose a novel model that uses pre-trained word embeddings for encoding the words and incorporates the sentiment scores as weights to mark the importance of the words. It then computes a weighted sum to get the tweet representation and aggregates these to obtain the user representation. The user representation is finally fed to an ML classifier. Our model achieves an accuracy of 76% on the test set and outperforms the best model in the competition.

Ongoing Project

chaii - Hindi and Tamil Question Answering

Prof. Mausam

Fine-tuned XLM-RoBERTa for multilingual Q/A using chaii-1 dataset augmented with MLQA, XQuAD & SQuAD and attained test Jaccard score of 68.72%.

View Project

Context-Sensitive Word Sense Disambiguation

Prof. Mausam

Compared non-contextual and contextual embeddings (GloVe+BiLSTM vs BERT) using WiC dataset for WSD task.

View Project

Tweet Sentiment Classifier

Prof. Mausam

Processed tweets with tweet normalization, internet slang dictionary, stemming, etc.; vectorized with TF-IDF; fed into LR.

View Project

Rule-based Written-to-Spoken Text Converter

Prof. Mausam

Built a regex-based system that accounts for chunks with abbreviations, dates. numerical quantities and inflections. Obtained test F1-score of 97.94%.

View Project

Bankruptcy Prediction

Prof. Niladri Chatterjee

Reviewed state-of-the-art bankruptcy prediction models and observed poor recall. Hypothesized class imbalance & missing values to be the reasons. Trained an ensemble model with Mean Imputation & SMOTE on Polish companies dataset and gained 10% improvement in recall.

View Project

Adaptive Network-based Fuzzy Inference System for Diabetes Prediction

Prof. Niladri Chatterjee

Trained a Takagi–Sugeno type neuro-fuzzy model in TensorFlow for diabetes prediction and obtained accuracy of 81.3%.

View Project

Document Reranking using Pseudo-Relevance Feedback

Prof. Srikanta Bedathur

Used probabilistic query expansion and relevance model based language modeling with unigram/bigram setting & Dirichlet smoothing to rerank retreived documents and improve the MRR and nDCG scores of the system.

View Project

Vector Space Model for News Articles Retrieval

Prof. Srikanta Bedathur

Implemented end-to-end retrieval system indexed with TF-IDF weights & cosine similarity-based ranking. Added prefix searching and named entity based searching (using StanfordNER) to narrow down the results of retreival. Compressed index file by encoding differences between document IDs & reduced size by half (topped class leaderboard for index size).

View Project

Web Designing & Development for SAC, IIT Delhi

Revamped the website using CSS & Javascript for better user experience and easy accessibility & retrieval of information.

Visit Website

Triangulation Topology Analysis using Graph Theory

Prof. Subodh Kumar

Generated generic Graph data structure to store triangles, points & edges for given triangulation topology of 3D shapes. Implemented traversal algorithms to get neighbours, boundary edges, count of connected components & closest components.

View Project

Priority-based Job Scheduler

Prof. Subodh Kumar

Implemented Trie, Red-Black Tree & Max-Heap to execute jobs from users for projects based on priorities & resources. Added features for fetching job status & top budget consuming users, flushing starving jobs & updating project priorities.

View Project

Symbolic Differentiation

Prof. Subhashis Banerjee

Generated a Binary Tree by parsing fully parenthesised infix expression and computed its derivative by traversal. The parser was made to support a variety of functions like algebraic, trigonometric, exponential & composite functions.

View Project

Skills

Contact

Connect

Get in Touch