Sheetal Reddy

Middle

Регистрация: 06.07.2024

Специализация: Data Engineer

— Around 6+ years of overall IT experience in a variety of industries, this includes hands-on experience in Big Data Analytics and Data Engineering. — Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming. — Good Knowledge on Apache NiFi for automating and managing the data flow between systems. — Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology. — Skilled in Power BI and visualization of the results using Tableau to import, analyze and create various filters and parameters for reports from various data sources such as SQL Server. — Knowledge of one or more Data Visualization & Reporting tools such as Tableau, Power BI. — Hands-on Experience in Service Oriented Architecture (SOA), Event Driven Architecture, Distributed Application Architecture and Software as Service (SAS). — Proficient in performing data analysis and deriving actionable insights from large datasets using tools like Spark, Hive, and SQL. — Expertise in Extraction Transformation Loading (ETL) process using Informatica consisting of administration, data transformation, data sourcing, mapping, conversion and loading. — Performance tuning in Informatica Mappings by identifying the bottlenecks and implemented effective transformation Logic. — Having excellent experience in Data Pipeline, ETL Tool, Apache Airflow, AWS Glue, Talend, Informatica. — Experience in developing and maintaining BI solutions using Amazon S3 and Redshift, enabling efficient data storage, retrieval, and analysis for business users. — Skilled in designing and building efficient data pipelines to move data across different systems and warehouses, ensuring accessibility and reliability of business-critical data. — Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family. — Experienced in Designing Star Schema (identification of facts, measures and dimensions), Snowflake schema for Data Warehouse, ODS architecture by using tools like Erwin Data Modeler, Power Designer, Embarcadero E-R Studio and Microsoft — Configured and managed AWS EMR, Google Cloud Data Proc, and Azure HDInsight clusters for scalable big data processing using Hadoop and Spark. — Integrated AWS EMR, Google Cloud Data Proc with storage solutions like AWS S3, Google Cloud Storage, and Azure Blob Storage for efficient data management. — Established GCP Data Lakes using Google BigQuery, BigTable, and Apache Hadoop, providing secure data storage and querying capabilities. — Partnered with cross functional teams across the organization to gather requirements, architect, and develop proof of concept for the enterprise Data Lake environments like MAPR, CLOUDERA, HORTONWORKS, AWS, and AZURE. — Strong Experience in analyzing data using HIVE, Impala, Pig Latin, and Drill. Experience in writing custom UDFs in Hive and Impala to extend the functionality. — Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight. — Experience in working in Hadoop eco-system integrated to the Cloud platform provided by AWS with several services like Amazon EC2 instances, S3 bucket and RedShift. — Good experience working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, HDInsight/Databricks.

Apache Kafka SQL Apache Spark Tableau AWS MapReduce Python HBase Scala Big Data Architects Apache Hive Power BI

Портфолио

Michelin

Nevro Corporation

Verizon

Скиллы

Kafka

Impala

HBase

Spark Streaming

Apache Spark

Cassandra

Zookeeper

Hive

MapReduce

HDFS

Python

Java

SQL

Scala

Cloudera CDH

Horton Works HDP

AWS

Azure

Logistic Regression

Decision Tree

Random Forest

K-Nearest Neighbor

Principal Component Analysis

Shell scripting

PL/SQL

PySpark

Hive QL

JavaScript

HTML/CSS

Bootstrap

PHP

GIT

GIT HUB

Bitbucket

Eclipse

Visual Studio

Net Beans

Junit

CI/CD

Oracle

MySQL

DynamoDB

Cassandra

Teradata

PostgreSQL

Snowflake

HBase

MongoDB

Опыт работы

Cloud Data Engineer

с 10.2023 - По настоящий момент |Nevro Corporation

Hadoop, GCP HDFS, Python, GCP, Azure, AWS Glue, AWS Athena, RedShift, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau

● Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation. ● Working with two different datasets one using HiveQL and the other using Pig Latin. ● Involved in moving the raw data between different systems using Apache Nifi. ● Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances. ● Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production. ● Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds ● Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs. ● Programmed in Hive, Spark SQL, Java, C# and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines. ● Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing. ● Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop. ● Developed Hive queries to pre-process the data required for running the business process ● Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios. ● Configured and managed AWS EMR, Google Cloud Data Proc, and Azure HDInsight clusters. ● Enabled scalable and efficient big data processing using Hadoop and Spark. ● Integrated AWS EMR with S3 and Redshift. ● Established a GCP Data Lake using Google Cloud Storage, BigQuery, and BigTable, providing seamless and secure data storage and querying capabilities. ● Developed and executed ETL processes on Google Cloud Data Proc, AWS Glue, and Azure Data Factory. ● Ensured data consistency, accuracy, and efficient integration across multiple platforms. ● Implementations of generalized solution model using AWS SageMaker. ● Hands on experience in migrating on premise ETL to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer. ● Implement CI/CD (Continuous Integration and Continuous Development) pipeline for Code Deployment. ● Reduced access time by refactoring data models, query optimization and implemented Redis cache to support Snowflake. ● Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update. ● Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE. ● Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes. ● Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight ● Extracted, transformed, and loaded data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. ● Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks. ● Involved in Data Migration using SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell. ● Architected and implemented medium to large-scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). ● Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases. ● Extensive expertise using the core Spark APIs and processing data on an EMR cluster. ● Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena. ● Used Reporting tools like Tableau to connect with Hive for generating daily reports of data. ● Involved in using and tuning relational databases (Microsoft SQL Server, Oracle, MySQL) and columnar databases (Amazon Redshift, Microsoft SQL Data Warehouse). ● Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Team City/Jenkins. ● Used Jira for bug tracking and Bitbucket to check - in and checkout code changes. ● Worked on agile environment, used GitHub for version control and Teamcity for continuous build.

AWS Data Engineer

09.2022 - 09.2023 |Verizon

Informatica Power Center 10.x/9.x, IDQ, AWS Redshift, Snowflake, S3, Postgres, MS SQL Server, Big query, SQL, Python, Tableau, EMR, GitHub

● Worked extensively with AWS services like EC2, S3, VPC, ELB, Auto Scaling Groups, Route 53, IAM, CloudTrail, CloudWatch, CloudFormation, CloudFront, SNS, and RDS. ● Integrated Terraform with data services and platforms such as AWS Glue, Amazon Redshift for provisioning and managing data infrastructure. ● Built serverless ETL pipelines using AWS Lambda functions to extract data from source systems, transform it according to business logic and load it into target data stores. ● Experienced in tuning model hyperparameters using Sage Maker built-in hyperparameter optimization (HPO) functionality. ● Capable of conducting automated experiments to find optimal hyperparameter configurations for improved model performance. ● Spearheaded end-to-end Snowflake data warehouse implementation, including designing and building the data architecture, setting up schemas, roles, and warehouses, and ensuring optimal data performance. ● Developed Python scripts to parse XML, Json files and load the data in AWS Snowflake Data warehouse. ● Implement and support of data warehousing ETL using Talend. ● Familiar with ethical considerations and best practices in AI and ML, including fairness, transparency, accountability, and privacy. ● Experienced in implementing fairness-aware algorithms and bias mitigation techniques to ensure equitable and responsible AI systems. ● Knowledgeable about Stream Sets support for real-time data streaming, CDC (Change Data Capture), and data quality monitoring. ● Familiar with DataStage and metadata management capabilities, including defining and maintaining metadata repositories, data lineage, and impact analysis. Experienced in leveraging metadata to ensure data governance and compliance. ● Utilized Power BI to create various analytical dashboards that helps business users to get quick insight of the data ● Developed Kafka producers and connectors to stream data from source systems into Kafka topics using Kafka Connect framework or custom implementations. ● Proficient in designing DataStage jobs to extract, transform, and load (ETL) data from various sources into target systems. ● Skilled in creating job designs that optimize performance, scalability, and data quality. ● Used Terraform in AWS Virtual Private Cloud to automatically setup and modify settings by interfacing with ● control layer. ● Proficient in monitoring DataStage jobs and diagnosing issues using built-in monitoring tools and log files. ● Proficient in building and training machine learning models using Amazon SageMaker. ● Performed end-to-end delivery of PySpark ETL pipelines on AWS to perform the transformation of data orchestrated via AWS GLUE scheduled through AWS Lambda. ● Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and big query tables. ● Create story board of back log items in Agile and develop item according to business needs. ● Transform and Load, Designed, developed and validated and deployed the ETL processes for the Data Warehouse team using Hive. ● Developed complex Stored Procedures using SQL, Dynamic SQL queries having good performance ● Applied required transformation using AWS Glue and loaded data back to Redshift and S3. ● Experience in analyzing and writing SQL queries to extract the data in Json format through Rest API calls with API Keys, ADMIN Keys and Query Keys and load the data into Data warehouse. ● Worked on Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark. ● Developed SSIS packages to Extract, Transform and Load ETL data into the SQL Server database from the legacy mainframe data sources. ● Experienced in developing complex ETL processes using DataStage, including data cleansing, aggregation, and transformation tasks. ● Experience in Developing ETL solutions using Spark SQL in AWS GLUE for data extraction, transformation and aggregation from multiple file formats and data sources for analyzing transforming the data to uncover insights into the customer usage patterns. ● Developed machine learning models and algorithms using Python libraries such as scikit-learn, TensorFlow, or PyTorch to solve predictive modeling and classification tasks. ● Experienced in preparing datasets for model training and inference using Amazon Sagemaker. ● Capable of handling data preprocessing tasks such as feature engineering, normalization, and encoding. ● Hands-on experience with Informatica power center and power exchange in integrating with different applications and relational databases.

Azure Data Engineer

05.2020 - 07.2022 |Michelin

Hadoop, Map Reduce, Spark, Spark Tableau, SQL, Excel, VBA, SAS, MATLAB, Azure, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau

● Gathering business requirements from the Business Partners and Subject Matter Experts. ● Installed and Configured Hadoop cluster using Amazon Web Services (AWS) for POC purposes. ● Involved in implementing nine node CDH4 Hadoop cluster on Red hat LINUX. ● Leveraged Amazon Web Services like EC2, RDS, EBS, Elbaite scaling, AMI, IAM through AWS console and API Integration ● Created scripts in Python which integrated with Amazon API to control instance operations. ● Extensively used SQL, NumPy, Pandas, Scikit-learn, Boost, TensorFlow, Koras, Porch, Spark, Hive for Data Extraction, Analysis and Model building. ● Experience in using various packages in R and Python like ggplot2, caret, duly, Wreak, models, Curl, tm, C50, twitter, NLP, Reshape2, Rison, seaborn, SciPy, matplotlib, Beautiful Soup, Rpy2 ● Expertise in database programming (SQL, PLSQL) XML, DB2, Informix, Teradata, Database tuning and Query optimization. ● Monitored, Scheduled, Automated bigdata workflows using Apache Airflow ● Experience in designing, developing, scheduling reports/dashboards using Tableau and Cognos. ● Expertise in performing data analysis and data profiling using complex SQL on various sources systems including Oracle and Teradata. ● Good knowledge in understanding and using NoSQL databases Apache Cassandra, Mongo DB, Dynamo DB, Couch DB and Redis. ● Experience on querying various Relational Database Management Systems including MySQL, Oracle, DB2 with SQL and PL/SQL. Built and maintained SQL scripts, indexes, and complex queries for data analysis and extraction. ● Performance tuning of SQL queries and stored procedures. ● Perform fundamental tasks related to the design, construction, monitoring, and maintenance of Microsoft SQL databases ● Implemented Data Lake in Azure Blob Storage, Azure Data Lake, Azure Analytics, Data bricks Data load to Azure SQL Data warehouse using Poly base, Azure Data Factory. ● Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users. ● Involved in extracting the data using SSIS, Teradata BTEQs from OLTP to OLAP. ● Updated UNIX shell scripts to trigger jobs, move files, DMC check, email notifications. ● Automated the continuous integration and deployments (CI-CD) using Jenkins, Docker, Ansible and AWS Cloud Templates. ● Used SSIS to create ETL packages to validate, extract, transform, and load data to data warehouse databases, data mart databases. ● Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory. ● Involved in administration tasks such as publishing workbooks, setting permissions, managing ownerships, providing access to the users, and adding them to the specific group and scheduled instances for reports in Tableau Server. ● Created a SharePoint site to share project-related documents. ● Used Azure SQL Data warehouse which facilitated reading, writing, and managing large data sets residing in distributed storage. ● Developed weekly Internal Corrective Actions and Failures Reports for failures Trend Analysis and reporting during monthly product quality review meeting. ● Participated in EWOs (Engineering Change Releases meetings for product design related changes.

Data Engineer

04.2019 - 04.2020 |Matrix Technologies

Hadoop, Hive, Impala, Spark, Cassandra, Sqoop, Oozie, Map Reduce, SQL, Abinitio, AWS (S3, Redshift, CFT, EMR, Cloud watch), Kafka, Zookeeper, PySpark

● Automated the installation and configuration of Scala, Python, and Hadoop, and necessary dependencies that configured the necessary files accordingly. ● Fully configured Hadoop cluster for faster and more efficient processing and analysis of the data. ● Working knowledge of Spark RDD, Data Frame API, Data Set API, Spark SQL, and Spark Streaming. ● Extensively used Sqoop to ingest batch data present in Oracle and MS SQL into HDFS at scheduled intervals. ● Imported data from AWS S3 into Spark RDD and performed transformations and actions on RDD’s. ● Involved in ingesting real-time data into HDFS using Kafka and implemented job for daily imports. ● Configured Zookeeper to manage Kafka cluster nodes and coordinate the brokers/clusters topology. ● Created Hive tables to load transformed data. Implemented dynamic partitioning and buckets in Hive. ● Worked on Performance and Tuning optimization of Hive. ● Improved the performance and optimization of existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD’s, and Spark YARN. ● Involved in exporting Spark SQL Data Frames into Hive tables as Parquet files. Performed analysis on Hive tables based on the business logic. Continuously tuned Hive UDF’s for faster queries by employing partitioning and bucketing. ● Created a data pipeline using Oozie workflows which runs jobs daily ● Worked on deriving Structured Data from Unstructured data received using Spark. ● Used Spark-Cassandra Connector to load data to and from Cassandra. ● Worked in the process of data collection, data processing, and exploration projects in Scala. ● Automated the process for extraction of data from warehouses and weblogs by developing workflows and coordinated interdependent Hadoop jobs using Airflow. ● Worked on ingesting data (MySQL, MSSQL, MongoDB) into HDFS for analysis using Spark, Hive, and Sqoop. Experience Working with CosmosDB (Mongo API). ● Involved in PL/SQL query optimization to increase the speed of runtime of stored procedures. ● Worked extensively with Sqoop for importing metadata from Oracle. ● Involved in converting the JSON data into pandas Dataframe and stored it in hive tables. ● Model complex ETL jobs that transform data visually with a data flow or by using computer services like Azure Databricks and Azure SQL Database. ● Generated scheduled reports for Kibana dashboards and visualizations. Worked in an Agile development environment using Kanban methodology.

ETL Developer

03.2018 - 03.2019 |Arena IT soft

Informatica PowerCenter 8.5, UNIX, Oracle SQL, ETL, Teradata, Data Warehouse, python, Java

● Analyzed business requirements and worked closely with the various application teams and business teams to develop ETL procedures that are consistent across all applications and systems. ● Involved in the Data Analysis for source and target systems and good understanding of Data Warehousing concepts, Staging Tables, Dimensions, Facts and Star, Snowflake Schemas. ● Installation of Informatica PowerCenter 9.x and setup services and client tools. ● Creation of ETL framework and review of high-level design document ● Creation of Logical Data Model and Physical Data Model ● Extensively worked in the performance tuning of the programs, ETL Procedures and processes ● Responsible for kinds ETL activities such as normalization, cleansing, scrubbing, aggregation, summarizing, and integration. ● Worked in importing and cleansing of data from various sources like Teradata, Oracle, flat files, SQL Server, semi structured and unstructured files, XML files and MS WORD with high volume data. ● Involving in Supporting ETL Applications production jobs and changing existing code based on business needs. ● Develops logical and physical data flow models for ETL applications. Translates data access, transformation and movement requirements into functional requirements and mapping designs. ● Documented both High level and Low-level design documents, Involved in the ETL design. ● Implemented Type 4 and Type 6 slowly changing dimensions. ● Developed complex Multi load and Fast Load scripts for loading data into Teradata tables from legacy systems. ● Involved in various phases of the software development life cycle right from Analysis, Design, Development, and Testing and Production support. ● Performed and documented the unit testing for validation of the mappings against the mapping specifications documents. ● Performed production support activities in Data Warehouse (Informatica) including monitoring and resolving production issues, pursuing information, bug fixes and supporting end users.

Языки

АнглийскийСвободно владею