
Laxman Singh Tomar
Senior NLP Engineer in Bangalore, India, He/him
About
Full Stack NLP Engineer @ Emplay Inc.
Work Experience
Building next-gen conversational search and recommender systems.
Worked at the intersection of Cybersecurity and Machine Learning to build products around Content Moderation and Network Anamoly Detection.
Worked on building Voicenet- a speech recognition library aimed to help developers build various voice-based applications such as age and emotion detection from speech samples.
Projects
Currently integrating generative capabilities of LLMs like GPT-3 & ChatGPT into products like Search, Generation, Information Retrieval, and multi-purpose Agents for SAP and P&G via LLM Stack including LangChain.
Developed an AI-powered content moderation engine for Simpplr, utilizing a dataset comprising ~2M comments. Trained a Mini-LM Model to detect racial, religious hate, insulting, and explicit comments with ~91% accuracy. Utilized ONNX Conversion and Dynamic Quantization to optimize the solution for speed and storage space. Served in production with FastAPI and Docker and actively used by 500+ companies.
Architected a scalable microservice-based system to build and search a knowledge index with over 1 million documents for P&G, supporting multiple tenants and controlled hierarchy via configs. Utilized Docker and Kubernetes for orchestration and OpenAPI standard for REST API design.
Built a pipeline for generating high-quality question-answer pairs from indexed text documents, using T5 and Roberta models fine-tuned on the SQUAD dataset. Designed and implemented asynchronous request handling with Celery and RabbitMQ and deployed it as REST APIs using FastAPI in Docker and Kubernetes. Developed a monitoring dashboard and a testing suite for unit and stress testing with Pytest and Locust.
Annotated a dataset of low and good-quality QA pairs, and adopted techniques like Weak Supervision and Active Labeling to improve annotation efficiency. Generated features attributing to question presence, grammatical question structure, and text readability. Developed a Random Forest Classifier with ~75% accuracy to filter out low-quality questions, with data and experiment tracking by DVC and MLFlow. Developed an evaluation suite to identify low-confidence samples via CleanLab and key slices via Snorkel/Sliceline.
Side Projects
RecSys for Cart Abandonment and Recommendations. Used SIGIR Ecom 2021 Challenge Dataset. Used Prefect as Orchestrator and Metaflow for ML DAGs. Built an LSTM Model with TF/Keras and deployed with Serverless.
Multi-Class Text Classification using Project Metadata. Used Dataset comprising Projects Title, Description, and associated categories/tags. Used Airbyte for Data Pipelines, Airflow as Orchestrator, and Feast for Feature Store. Built a Model using Tf-Idf Feature Vectorization and SGD Classifier to obtain a 0.85 f-1 score.
Replicating Semantic Podcast Search at Spotify. Used Listen Notes' Dataset comprising metadata of 100k Podcasts. Obtained Recall@30 of 0.57 compared to 0.29 via fine-tuned distilUSE when using the same model without fine-tuning.