最新消息:Welcome to the puzzle paradise for programmers! Here, a well-designed puzzle awaits you. From code logic puzzles to algorithmic challenges, each level is closely centered on the programmer's expertise and skills. Whether you're a novice programmer or an experienced tech guru, you'll find your own challenges on this site. In the process of solving puzzles, you can not only exercise your thinking skills, but also deepen your understanding and application of programming knowledge. Come to start this puzzle journey full of wisdom and challenges, with many programmers to compete with each other and show your programming wisdom! Translated with DeepL.com (free version)

apache spark - no record is processed and all the checkpoint files have inconsistent data - Stack Overflow

matteradmin7PV0评论

I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", ";) \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", "https://kinesis.eu-central-1.amazonaws") \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

Share Improve this question edited Nov 28, 2024 at 14:32 Rob 15.2k30 gold badges48 silver badges73 bronze badges asked Nov 18, 2024 at 20:54 Alexandre SilvaAlexandre Silva 133 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Here Spark github example: Spark-Kinesis

I also find the official documentation from Spark: Spark Streaming + Kinesis

Hope this can help you.

Post a comment

comment list (0)

  1. No comments so far