Key Responsibilities:
• Design, implement, and optimize data ingestion pipelines using Apache NiFi to handle sources like CSV and RDBMS, converting data to formats such as Parquet.
• Configure and manage a Spark standalone cluster for efficient data processing.
• Set up and maintain MinIO cluster for object storage, including raw and processed buckets.
• Orchestrate end-to-end data workflows using Apache Airflow
• Monitor system performance, logs, and health across nodes using built-in tools and optional monitoring services; ensure high availability and quick issue resolution.
• Work cross-functionally with other stakeholders to align infrastructure with business needs, including documentation and knowledge sharing.
• Develop and maintain ETL Pipelines using Pyspark and Python
• Proficient in SQL and willing to write complex queries
Required Qualifications:
• Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field (Master's preferred).
• 3+ years of experience in data engineering, infrastructure, or operations roles, with a focus on building and maintaining data pipelines and systems.
• Proven hands-on experience with Apache NiFi for data ingestion and ETL processes.
• Strong expertise in Apache Spark (standalone or clustered) for distributed data processing.
• Proficiency with object storage solutions like MinIO (or S3-compatible systems) and database management using SQL Server and Oracle.
• Experience with workflow orchestration tools such as Apache Airflow.
• Solid understanding of data formats (e.g., Parquet, CSV), data flows, and optimization techniques for performance and scalability.
• Knowledge of monitoring, logging, and troubleshooting in data environments.