Roles and Responsibility:
• Assist with onboarding a large catalog of datasets
• Clean and normalize these datasets for downstream teams
• Develop scripts for data extraction, transformation (clean, scrub, flatten, normalize, denormalize, etc.) and
visualization
• Maintain the scripts and codes, run quality checks, and automate the pipelines
Mandatory Skills:
• Experience with designing ETL pipelines and distributed systems (Spark); Tools like – AWS, Databricks
• Development using NumPy, pandas and Pyspark; Expertise in SQL
• Experience with Cloud technologies like – Amazon S3, RedShift
• Analytical thinker with strong attention to details, good verbal and written communication skills
Secondary Skills:
• Worked on rest API to download data from different sources
• Familiarity with AWS, Airflow, MongoDB and Linux