Machine Learning on BigData -- distributed ML related computation bassed on real-time micro-batch streams as well as big-data.

It is a layer on top of Martin Karlsons framework that adds machine learning, data mining, AI and mathematical staticstic capabilites to it, an instance which runs with upgraded versions 3.3.1 of Spark and PySpark and 3.10.6 of Python. The instance runs on Ubuntu 22.04 insteead of Alpine, which consumes some more resources, but makes the life of intergrating new s/w a great deal less painfu. It is currently capalbe of running most machine learning python packages inculding the model calibration utiility GridSearchCV with the the help of joblibspark, which enablies sckit-learn classes such as GridSearchCV to run on executors on Sparc worker nodes.

Malin Yamato 大和まりん

Email : [email protected]
post-graduate student, Artifical Intelligence Applied to Medicine AIM
Department of Physiology and Pharmaocolgy FYFA, Karolinska Institutet and Deparment of Computer and Systems Sciences, Stockholm University

Currenty what we have here - the vision, the goal

Why I am doing this or the objective of this project

To provide next generation of intensive care methods, methods in anasthetics applying machine learning and artificial intelligece to swiftly provide relief and optimmal tratment to intensive care, which mostly includes care for those suffering from multi-desise conditions, sepsis, multplie organ failures -- based on ASAP AI analytics provided by a framwork such as this one. The figure to the left is what we have today and the one on the right is where wee are heading.

Comming up version 2

Purpose

To emulate the Hadoop/Spark/Kafka cluster haivng services run in docker on a virtual machine VM with its own ip address. This will make it easeir to add datanode and worker nodes locally or on physical machine by cloing a virtual machine node and deployu it there.

Pre-requisite

this is ony tested on debian based Linux distributions, It should work on all Linuxes and I see no reson for it not to work on Windows.
latest Docker and VirtualBox

Recommended hardware requirements

32 GB RAM
8 core CPU

Start

Execute bash master-build.sh to start the the build and start the containers.

Hadoop

Access Hadoop UI on ' http://localhost:9870 '

Spark

Access Spark Master UI on ' http://localhost:8080 '

Jupyter

Access Jupyter UI on ' http://localhost:8888 '

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
hadoop		hadoop
jupyter		jupyter
spark		spark
.gitignore		.gitignore
MLBD.png		MLBD.png
README.md		README.md
arch.png		arch.png
docker-compose.yml		docker-compose.yml
intensivcecare.png		intensivcecare.png
iva2.png		iva2.png
iva3.png		iva3.png
master-build.sh		master-build.sh
ml2.png		ml2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning on BigData -- distributed ML related computation bassed on real-time micro-batch streams as well as big-data.

Malin Yamato 大和まりん

Currenty what we have here - the vision, the goal

Why I am doing this or the objective of this project

Comming up version 2

Purpose

Pre-requisite

Recommended hardware requirements

Start

Hadoop

Spark

Jupyter

About

Releases

Packages

Languages

MalinYamato/machine_learning_on_bigdata

Folders and files

Latest commit

History

Repository files navigation

Machine Learning on BigData -- distributed ML related computation bassed on real-time micro-batch streams as well as big-data.

Malin Yamato 大和まりん

Currenty what we have here - the vision, the goal

Why I am doing this or the objective of this project

Comming up version 2

Purpose

Pre-requisite

Recommended hardware requirements

Start

Hadoop

Spark

Jupyter

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages