Jigyasa Grover & Rishabh Misra


10-time award winner in Artificial Intelligence and Open Source and the co-author of the book ‘Sculpting Data For ML’, Jigyasa Grover is a powerhouse brimming with passion to make a dent in this world of technology and bridge the gaps. AI & Research Lead, she has many years of ML engineering & Data Science experience in deploying large‐scale low-latency systems for user personalization and monetization on popular social networking apps like Twitter and Facebook, and e‐commerce at Faire, particularly ads prediction, sponsored content ranking, and recommendation with a recent focus on Generative AI. She is also one of the few ML Google Developer Experts and Google Women Techmaker Ambassadors globally. As a World Economic Forum’s Global Shaper, she ensures to leverage her technical skills and connections for solution-building, policy-making, and lasting change.


Sculpting Data for Machine Learning: Generative AI edition


The emergence of GenAI has revolutionized various domains, from creative content generation with text, synthetic images, video and so much more. However, the success and effectiveness of GenAI models heavily rely on the quality of the underlying data during the fine-tuning process. Volumes of crude data are available on the web nowadays; all we need are the skills to identify and extract meaningful datasets and present them to GenAI models to unleash their full potential. This talk presents the power of the most fundamental aspect of AI - Data Curation, which often does not get its due limelight. It will also walk the audience through constructing good-quality datasets with hands-on Pythonic examples. By emphasizing the indispensability of quality data, this talk underscores the need for robust data collection and preprocessing practices to propel the advancements in GenAI.


Introduction (5 minutes)

  • Popularity of GenAI & Applications
  • Significance of honing dataset-building skills in the context of GenAI applications
  • Importance in Academia: Expanding domains to perform research on, Solving novel problems using AI, Pushing the development of state-of-the-art technology, etc.
  • Importance in Industry: Leveraging domain-specific data to drive business outcomes, Integrating AI pipelines to develop new product features, Proactively identifying data to log to solve specific problems, etc.

Finding data source(s) (10 minutes)

  • Guided-Search based on a problem definition: Identifying essential data signals
  • Unguided-Search with no problem definition in mind: Dealing with ambiguity
  • Tips on identifying data sources for GenAI applications

Dataset Preparation (10 minutes)

  • Anonymizing to maintain confidentiality
  • Standardizing and Structuring
  • Augmentation for GenAI applications

Conclusion and Takeaways (5 minutes)

  • Re-iterating the need for good and reliable datasets: Laying the strong foundation of AI
  • Pointers on why and how to proceed with different data curation techniques enabling GenAI applications, keeping in mind the pros & cons
  • Some personal anecdotes and recommendations for different use cases as ML Engineers

The talk is based on the book authored by us, viz, Sculpting Data for ML: The First Act of Machine Learning