Data Engineer - Analytics
About the Role:
We are seeking an experienced and tenacious Data Engineer to take full ownership of our high-volume behavioral data pipeline. This role is crucial for ensuring the reliability, performance, and integrity of the data used for real-time reporting, critical business intelligence, and our core customer recommendation engine. You will be the sole expert responsible for optimizing performance from the source database (ClickHouse) through the ETL processing layer (Spark) to the final reporting dashboard.
Key Responsibilities:
● ETL Pipeline Ownership: Design, develop, and maintain PySpark ETL jobs to process and transform behavioral data (5+ billion records) from ClickHouse into clean, aggregated daily reports.
● Performance Engineering: Serve as the subject matter expert in optimizing query performance and data throughput in high-scale analytical databases. This includes designing and tuning effective table schema and indexing strategies.
● Data Modeling & Warehousing: Design and implement highly performant analytical data models in ClickHouse, optimizing storage and query speed for daily dashboard consumption.
● System Resiliency: Tune and manage the distributed computing environment (Spark) to ensure job stability, efficient resource utilization, and mitigation of common failures like memory spilling and read/write timeouts.
● Automation & Scheduling: Implement and maintain robust scheduling mechanisms (e.g., Cron or Airflow) to ensure daily processing runs reliably on a different rolling window.
Required Technical Stack & Expertise:
● Deep Expertise: 3+ years of hands-on experience with Apache Spark (PySpark) for large-scale data transformation.
● ClickHouse Mastery: Proven expertise in designing, tuning, and maintaining large-scale ClickHouse or specialized OLAP databases.
● Software Engineering: 3+ years of hands-on experience with Python environments and dependencies within containerized/virtualized platforms.
● SQL Fluency: Mastery of ANSI SQL and complex aggregations.
● Operating Systems: Familiarity with Linux command-line tools and job scheduling.
Bonus Points:
● Experience with Metabase or similar visualization tools (Tableau, Looker).
● Familiarity with distributed scheduling platforms (Apache Airflow).
این آگهی از وبسایت جاب ویژن پیدا شده، با زدن دکمهی تماس با کارفرما، به وبسایت جاب ویژن برین و از اونجا برای این شغل اقدام کنین.