Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[noah] New bug introduced by PR #727 "Add Elasticsearch ingestion pipeline" #822

Open
baiqiushi opened this issue Feb 7, 2022 · 0 comments
Labels

Comments

@baiqiushi
Copy link
Collaborator

Symptom

When enabling multi-thread in geotagger module (
i.e.,

java -cp noah-assembly-1.0-SNAPSHOT.jar \
edu.uci.ics.cloudberry.noah.TwitterJSONTagToADM \
    -state /mnt/disk/data/twitter/web/public/data/state.json \
    -county /mnt/disk/data/twitter/web/public/data/county.json \
    -city /mnt/disk/data/twitter/web/public/data/city.json \
    -thread 32

), AsterixDB will alerts tons of parsing errors complaining ... expecting a rectangle type for the attribute ....

Suspection on the reason

In PR #727 "Add Elasticsearch ingestion pipeline", it introduced a knob var file = "ADM" // By default, generate ADM file. and inside the main function tagOneTweet(...), it checks whether the variable file has ADM (i.e., file.equals("ADM")). My guess is this check might be thread-unsafe, which results in random behaviors and then goes to the other branch that outputs JSON formatted tweets.

Current work-around

Before assembling project noah, use git to revert to the earlier commit before the PR #727.

cd cloudberry/examples/twittermap
git checkout 2455b69d70a45f50b55492304138e16af9125e94
sbt "project noah" assembly

If you see errors about duplicate lib files, it is because the merging strategy is not appropriate. Modify examples/twittermap/project/commons.scala to the following,

...
case x => MergeStrategy.first
      //  val oldStrategy = (assemblyMergeStrategy in assembly).value
      //  oldStrategy(x)
}

Next step solution

Try to make the knob var file = "ADM" // By default, generate ADM file. to be thread-safe and test it by ingesting a large number of tweets to a clean AsterixDB.

@baiqiushi baiqiushi added the bug label Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant