Job Description:
Candidate profile : -
--Working with Big data technologies like Hadoop, Hive, Presto, Spark, HQL, Elasticsearch, yarn.
--Working with any of the cloud environments (GCP/AWS/Azure). Cloud Certification(GCP Data Engineer or AWS SA or Azure Data Engineer) shall be a plus
--Working with Python, Spark(with scala) or Pyspark, SQL stored procedures
--Building Batch and Streaming data pipelines from various data sources like RDBMS, NoSQL, Iot/telemetry data, API's to various data lake or data warehouses.
--Building Streaming data ingestion pipelines using Apache Kafka and cloud based services (like AWS IOT core/Kinesis/MSK, GCP Pub/sub, etc)
--Using various ETL tools like Apache Spark, and native cloud based services like GCP Dataflow/Dataproc/Data Fusion, AWS Glue, Azure Data Factory
--API data Integration using different methods like POSTMAN, CURL, Python libraries
--Worked with various Data Lake and Data Warehouses(cloud based(like Bigquery/Redshift/Snowflake/Synapse/S3/GCS/Blob) and on-prem open source)
--Developing an incremental data pipeline NoSQL Database (either MongoDB/AWS Dynamo DB/Azure Cosmos DB/GCP Big Table/ GCP firestore)
--Working with structured/unstructured dataset, different file formats like Avro/Parquet/Json/CSV/XML/text file
--Worked on various file formats like csv, json, parquet, avro, text files etc.
--Job Scheduling using orchestrators preferably Airflow
--Setting up IAM, Data Catalog, Logging and monitoring using any of the cloud or open source based services
--Developing dashboards using any BI tool(Power BI/Tableau/Qlik) shall be a plus
--Developing web crawlers like social media crawlers shall be a plus
Skill:
Hadoop, Hive, Presto, Spark, HQL, Elasticsearch, yarn