apache spark - no record is processed and all the checkpoint files have inconsistent data

I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", ";) \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", "https://kinesis.eu-central-1.amazonaws") \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

Share Improve this question edited Nov 28, 2024 at 14:32 Rob 15.2k30 gold badges48 silver badges73 bronze badges asked Nov 18, 2024 at 20:54 Alexandre Silva 133 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Here Spark github example: Spark-Kinesis

I also find the official documentation from Spark: Spark Streaming + Kinesis

Hope this can help you.

Programmer puzzle solving

apache spark - no record is processed and all the checkpoint files have inconsistent data - Stack Overflow

1 Answer 1

Articles related to this article

comment list (0)