Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you have plans to support streaming in near future? Interested in readStream use-case spark.readStream.format("bigquery") #259

Open
nmusku opened this issue Oct 28, 2020 · 11 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@nmusku
Copy link

nmusku commented Oct 28, 2020

If not how can I do it with current connector? Any thoughts?

@davidrabinowitz
Copy link
Member

Streaming is on our roadmap, can you please elaborate more on your use case? Please feel free to contact us directly

@nmusku
Copy link
Author

nmusku commented Oct 28, 2020

Hi we have data flowing directly into big-query(via fluentd) in real-time.
My use-case is to query/filter and transform that raw data to meaningful events using this spark-connector. The data ingested is based off timestamp so if there are delays in ingestion, I would like to go back in time (lets say: threshold of 15 minutes) as well and read the delayed data..Not sure how to achieve it via batch jobs : example:
spark.read.format("bigquery").option("filter", "start_time > current-5 minutes").option("filter", "end_time > current")
Might not work ^^

Note: The reads will be from view.

@nmusku
Copy link
Author

nmusku commented Oct 29, 2020

@davidrabinowitz any thoughts? Is it possible to use any timestamp or any offset?

@davidrabinowitz
Copy link
Member

@nmusku Yes, for the time being you can implement it with a query like you've suggested. BTW, you can also merge it:

spark.read.format("bigquery").option("filter", "start_time > current-5 minutes AND end_time > current")

@nmusku
Copy link
Author

nmusku commented Oct 30, 2020

ok one more ques are the events in big query ordered?

@rwagenmaker
Copy link

Are there any news on this? now with GA4 would be cool to get streaming integration in spark

@Magicbeanbuyer
Copy link

Hi @davidrabinowitz ,

I am also interested in a readStream feature.

We have one ETL pipeline extracting campaign data from BigQuery and load data into our DeltaLake.

The struggle we face is to do incremental ETL without loading duplicated data into our deltalake. With readStream and checkpoint, hopefully this will be solved.

Could you maybe share more information on the timeline for readStream feature?

@benney-au-le
Copy link

We are also interested in this use case.
We land data in bigquery in real time from sources such as fivetran / fluent.d etc.
We would like to build spark streaming applications off by starting spark.readStream.format("bigquery") and trigger new micro-batches when new data arrives.

@kaiseu
Copy link

kaiseu commented Apr 28, 2023

@davidrabinowitz any update on this topic? we're also interested in this.

@davidrabinowitz
Copy link
Member

Can you please elaborate on the use case, especially how to want to read?

@davidrabinowitz davidrabinowitz added the question Further information is requested label Jun 7, 2023
@kaiseu
Copy link

kaiseu commented Jul 18, 2023

@davidrabinowitz our use case is streaming read the incremental data from bigquery tables, something like, spark.readStream.format("bigquery").option("inc_col", "create_time"), and we can config the incremental column, each time it will only read the newly added data. do we support this now? any suggestions?

@isha97 isha97 added the enhancement New feature or request label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants