Leonce Nshuti
Data Engineer at Sony Music Publishing
About
I am a Data Engineer with experience in healthcare (Vanderbilt), finance (UBS), and entertainment (Sony) industries. I build reliable data pipelines and machine learning systems to support business goals. I hold a Master's degree in Biostatistics from Harvard University.
Technical Skills
Experience
Data Engineer
Sony Music Publishing • March 2024 - Present
- Led Data Engineering recruiting for 2 data engineers and 3 senior data engineer candidates during my first six months at the company.
- Architected and deployed scalable data extraction and orchestration pipelines, enhancing data reliability by 35% and enabling seamless analysis of datasets exceeding 15TB, supporting critical business intelligence initiatives.
- Developed and integrated advanced experimentation systems for machine learning models, accelerating iteration cycles by 40% and improving model performance against large-scale indexes by 25%, which facilitated faster deployment of high-impact features.
- Led Sony Music Publishing's summer internship program by leading two interns through building a full stack music recommendation application with a React front-end, Fast API backend, RDS PostgreSQL database, GitLab CI/CD pipelines, and AWS Amplify for deployment. The project was integrated into the large enterprise stack and is used to recommend songs using fast Nearest Neighbor search.
Data Engineer
UBS Financial Services • Nov 2022 - March 2024
- Increased operational efficiencies saving $1.2M annually by leading a project to automate reporting for HR Americas to convert manual spreadsheet reporting into a self-serve Tableau dashboard, with ETL logic built in Alteryx.
- Developed RESTful APIs using Flask and Fast API frameworks, improving data exchange and system integration, leading to a 25% increase in data accessibility speed, translating into estimated savings of $500K annually due to improved decision-making and reduced downtime.
- Created PowerBI dashboards to visualize compliance data, which improved data quality and reduced manual validation errors.
- Built and maintained deployment and test pipelines on Gitlab and acted as the subject matter expert on migration from Jira to Gitlab for project management and version control.
Data Engineer
Vanderbilt University Medical Center • Jun 2018 - Feb 2022
- Led a team of 3 Junior and 2 Senior Data scientists to define research objectives, design and clarify work plans, delegate tasks, and execute the delivery of data products (Shiny dashboards, memos, peer-reviewed publications).
- Developed open-source R packages (github.com/graveja0/health-care-markets) for analyzing health insurance networks in the United States (Git, Github, R, SQL).
- Created and ran ETL jobs on Medicare formulary datasets (8 years, 20GB+) using SAS (Proc SQL, Data Step, Macros, Window functions).
- Created and maintained AWS storage and computing environment (S3, EC2, EMR), credentials, and billing (IAM, Organizations).
- Merged and cleaned 15TB+ datasets from varying sources to create analytic datasets.
- Led four quarterly performance reviews and participated in the hiring process for 20+ candidates.
- Ran analyses and wrote method sections for 10+ peer-reviewed papers in prestigious journals (i.e., Health Affairs, JAMA).
Featured Projects
English to SQL Gradio RAG App
This Python application uses Retrieval Augmented Generation (RAG) to ask data questions directly using plain English. The application then uses OpenAI's GPT-4.0-mini model to convert this question (prompt) into SQL, which then queries the DuckDB database that stores the data and returns the solution, in addition to the SQL statement that generates this data.
Front End: Gradio | Back End: Python | Database: DuckDB | Production: AWS
View ProjectSong Recommender App
Song Recommendation System enabling users to input their favorite song and artist and obtain personalized song recommendations based on feature similarity. The app uses hnswlib for efficient approximate nearest neighbor search, over a large music dataset (115,000 tracks). This project showcases proficiency in building scalable machine learning applications with efficient algorithms and user-friendly interfaces for music discovery.
Front End: Gradio | Back End: Python | Machine Learning: hnswlib for similarity search | Data Visualization: Matplotlib, Seaborn
View ProjectPublications
Gupta, Arjun, Leonce Nshuti, Udhayvir S. Grewal, Ramy Sedhom, Devon K. Check, Helen M. Parsons, Anne H. Blaes et al. "Financial burden of drugs prescribed for cancer-associated symptoms." JCO Oncology Practice (2021): OP-21.
Graves, John A., Leonce Nshuti, Jordan Everson, Michael Richards, Melinda Buntin, Sayeh Nikpay, Zilu Zhou, and Daniel Polsky. "Breadth and exclusivity of hospital and physician networks in US insurance markets." JAMA Network Open 3, no. 12 (2020): e2029419-e2029419.
Education
Harvard University
MS in Biostatistics • 2018
Sewanee, University of the South
BS in Economics • 2016