Skip to content

lx0612/Data-pipeline-using-python-kafka-and-mongodb-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#Data-pipeline-using-python-kafka-and-mongodb-quickstart This project contains 2 files (before the README.md).

SourceCode.py is a combination of both scraping data code and Kafka procedure code to stream data to Mongodb (data lake).

Scrapping data on Jupyter Notebook by python with selenium libs Untitled1

Jobs data is pulled from topcv.vn using python. Untitled

Kafka

See web site for details on the project.

Install Kafka: single kafka: download Kafka from offcial webside :https://kafka.apache.org/ recommend download kafka with scala My project use kafka_2.13-3.2.0

wget https://dlcdn.apache.org/kafka/3.2.0/kafka_2.13-3.2.0.tgz

See more in https://kafka.apache.org/quickstart

##Connect Kafka to Mongodb:

My project use jar file mongo-connect-kafka to config sink source . Data from crawl streamming to kafka then from kafka to databases; Download jar file in git or website of Maven (https://search.maven.org/artifact/org.mongodb.kafka/mongo-kafka-connect) Copy the JAR and any dependencies into the Kafka plugins directory which you can specify in your plugin.path Create file MongoSinkConnector.properties in config of kafka. Exemple: name=mongo-sink topics=topcv connector.class=com.mongodb.kafka.connect.MongoSinkConnector tasks.max=1 key.ignore=true connection.uri=mongodb://localhost:27017 database=topcv collection=transaction max.num.retries=3 retries.defer.timeout=5000 type.name=kafka-connect schemas.enable=false Start kafka: (Open three terminal)

1. start zookeeper server
bin/zookeeper-server-start.sh config/zookeeper-server.properties

2. start kafka server
bin/kafka-server-start.sh config/server.properties

3. start connection
bin/connect-standalone.sh config/connect-standalone.properties config/MongoSinkConnector.properties

Test

1.Open terminal

2.Create topic test:
    bin/kafka-topic.sh --topic topcv --bootstrap-server localhost:9092

3.Start kafka producer:
    bin/kafka-producer.sh --topic topcv --bootstrap-server localhost:9092
Send message
    {"hello":"world"}

4. Check in database(Mongo)
    in orther terminal and run
    mongosh # start mongoshell
    show databases #show all database
    use test #change to database test
    show collections #show all collections in database
    db.transaction.find() #show all record in collections
    if dislay the message , the connect is success 

Crawl data and stream kafka to mongodb

run file crawl.py on window with cmd REQUIRED install python and some lib (kafka-python,beautifulsoup4,selenium) download chromedrive (chekversion) file crawl.py get data in http://topcv.vn

About

Data-pipeline-using-python-kafka-and-mongodb-

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages