What you'll do
We are looking for a strong Senior Data Engineer with deep experience in Java based data platforms and hands-on expertise with GCP, GCS, Iceberg and Parquet. The role involves building efficient data pipelines, improving storage and query performance, and enabling a scalable data lake architecture. Experience with Trino or Apache Spark is a plus.
Java +Data Engineering is a must due to our tech stack requirement; would not consider Pyspark candidates.
Key Responsibilities
Data Engineering and Development
- Design and develop scalable data ingestion and transformation frameworks using Java.
- Build and maintain Iceberg tables stored on GCS using Parquet format.
- Continuously improve pipeline performance through better partitioning, compression, data layouts and efficient Java code.
2. Cloud Engineering (Google Cloud Platform)
- Develop and optimize data solutions using GCP storage and compute services.
- Tune GCS usage, IAM configuration and lifecycle rules for reliability and cost.
- Implement data residency and security for high performance and low latency workloads.
3. Data Lake Operations
- Manage Iceberg metadata, schema evolution, commit operations and manifest handling.
- Improve read and write performance through partition strategies, clustering, file sizing and metadata compaction.
- Troubleshoot concurrent write issues and optimize execution paths.
4. Integration and Query Layer
- Work with Trino or Spark to run efficient queries on Iceberg datasets.
- Improve Trino catalog performance through caching, connector tuning and configuration changes.
- Integrate Java based applications with data lake endpoints and reduce application query latencies.
5. Testing and Quality
- Build comprehensive automated tests for schema validation, data correctness and regression detection.
- Validate data performance under different loads and benchmark improvements.
6. DevOps and Observability
- Implement CI and CD pipelines for data services.
- Develop monitoring for Iceberg metadata operations, GCS performance, Trino query speeds and storage metrics.
- Identify bottlenecks and drive continuous performance improvements across the platform.