HELOOOOOOOOOOOOOOOOO AGAIN ITS ABDELJALIL WITH NEW PROJECT
SPAM EMAIL DETECTION USING SVM MACHINE LEARNING ALGORITHME ---> IMPORTING DATASET ---> EXTRUCTE FEATURES AND SPLIT DATASETS ---> TRAINING THE MODEL AND TUNE PARAMETERS ---> BUILD FINAL SYSTEM RECOGNITION
FOR IMPORTING THE DATASET I IMPORT 4 DATA SETS FROM KAGGLE WEB SITE 5572 , 83K , 179 , 5931 SAMPLES IN EACH
ONE THE REASON THAT I USE ALL THIS DATASET TO MAKE A EQUALITY BETWEEN THE SPAM AND HAM SAMPLES USED IN MY SYSTEM
I DOWNLOAD ALL THIS DATASETS AS CSV FILES AND STORE THEM IN FILE IN SIDE THE PROJECT FILE
I CREATE DATASETGENERATOR.py FOR MAKE DATA SET READY FOR USING INSIDE THE FILE I CREATE A CONNECTION WITH MYSQL DATABASE (I PREFER TO USE MYSQLTABLES NOT CSVFILES BECAUSE I HAVE A WHILE WORKING WITH TABLES )
OK I CREATE A BLOCK TO EACH DATASET IMPORT THE CSV FILE , OPEN IT , REMOUVE THE HEADER THEN START SORT THE FILE IN MYSQL TABLES HAVE TOW COLUMNS ONE FOR LABEL AND THE OTHER FOR EMAILMESSAGE THIES TABELS CALLED dataset{k} k number take values from 1 to 4 ONE FOR EACH DATA SET
NEXT CREATE A NEW TABLE CALLED FINALDATASETTABLES COMTAINE ALL FOUR DATASETS FINALLY I GET 94k SAMPLE OF TOTAL IN MY DATA SET
WHEN I CLACULATE THE NUMBERS OF SEMPLES OF EACH LABEL OR CLASS I FUND 44K BY 43K BUT INFRONTLY I CANT USE IT ALL
OK TILL NOW I BUILD A NEW DATASET FROM SOME PARTS AND IM READY FOR NEXT STEP
THE SECOND STEP START BY CREATING A FUNCTION TO PREPROSSECING THE DATASET MESSEGE FROM SOME NOISE LIKE , MULTIWHITSPACING , AND ALL LINK CHARS LIKE and or if while...... USING RE LIBRARY AND REGULAR EXPRISSION FULLY IMPORTED BY GOOGLE , THENEXT STEP IS LOADE OUR DATASET FROM MYSQL DATABASE THE START EXTRUCT FEATURES USING SOME FUNCTIONS , BY THE WAY THE USED FEATURES ARE ( NUMBERS OF PHONENUMBERS IN MAIL , NUMBER OF LINK IN MAIL , PRESENCE OR ABCENCE OF CURRACNY SYMBOLS , RATE OF SPECIAL SYMBOLS IN MAIL , PRESENCE OR ABCENCE OF SPECIAL WORDS IN MAIL , THE UPPERCASE LETTERS RATE , AND BIG WORDS RATE )EACHE ONE OF THIS FEATURS ARE EXTRUCTED BY A FUNCTION RETURN A NUMERICAL VALUE TO USE IT LATER NEXT I MADE A SIMLE TEXT TO CHEK IF MY FUNCTIONS EXTRUCT FEATURES CORRECTLY FINE WHEN I CONFIRM THEY WORK WELL I CALL MY DATASET TABLE ROW BY ROW EXTRUCT FEATURES FROM EACH ROW THEN SAVE IT IN NEW MYSQL TABLE CALLED FINALDATASETFEATURESTABLES ALSO TRANSFER LABELS FROM HAM AND SPAM TO 0 ANS 1 BECAUSE OF MODEL PARAMETERS SPECIFICATION DO THAT FOR ALL 94K SAMPLE AND STORET , THE NEXT STEP BEFOR I START SPLIT MY DATASET IS JUST REMOUVE THE EPTY EMAILS FROM FEATURES TABLE ( I DISCOVER THEM LATE AFTER I FIND SOME SPAM EMAILS AND ALL THERS FEARTURES ARE 0)
OK NOW IM READY TO SPLIT DATASET FEATURES TO TRAIN VALIDATION AND TEST DATA
I CREATE THREE MYSQL TABELS
IN FOR LOOP IMPORT ONLY HAM EMAIL STORE THE FIRST 30K HAME EMAIL AS TRAINING DATA THE NEXT 10K AS VALIDATION DATA AND NEXT 500 AS TEST DATA AND IGNORE THE OTHERS, AFTER THATT IMPORT ONLY SPAM EMAILS DO THE SAME THING 30K FOR TRAIN 10K FOR VALIDATION AND 500 FOR TESTING AND SAVE THE IN PREVIOUS TABLES
WITH THAT WE FINICH THE FIRST AND SECOND STEP
THE NEXT STEP IS TRAIN THE MODEL AND YUNE HYPERPARAMETERS WICH IS THE FEATURES EXTRUCTION METHODS EATHER MANUAL OR IF-IDF METHOD THE MARGINS STATUE THE KARNEL TRICKS AND THE NUMBER OF TRAINING DATA
FIRST I CALL ALL TRAINING DATA AND THEN MAKE A LIIMIMT FOR THE CONTITE I USE FOR TRAIN NOT ALL TRAINING DATA SAMPLES
OK AFTER CALLING TRAINING SAMPLE AND MAKE A LIMITS FOR IT , I TRANSFER THE USED TRAINING FEATURES TO NP ARRAY TO MATCH THE THE MODEL ARGUMENTS
CREATE A ACCURACY FUNCTIONS FOR GET SOME PERCENTAAGES
THEN CREATE A SIPMLE MODEL WITH A SIPMLE DTAT TO CONFIRM THE THE MODEL WORK WITHOUT ISSUS WHEN CONFIRM THAT
CALLING ALL VALIDATION DATA ROW BY ROW TRANSFER THEM TO NP ARRAY AND PASS THEM TO MODEL USE PREDICT FUNCTION AND SAVE ALL MODEL CLASSIFICATION IN A LIST FOR CALCULATE THE ACURRACY LATER
AFTER FINICHING THE VALIDATION DATA WE CALL THE ACCURACY FUNCTION GIVE IT THE MODEL CLASSIFICATION IT SUM ALL THE CORRECT CLASSES AND DIVID THEM BY TOTAL NUMBER OF VALIDATION MULTIPLYING BY 100 TO GET PERCENTAGE
ATT THE END I CREATE A SQL TABLE CALL MODEL SPECIFICATION TO STOR THE HYPERPARAMETER VALUES AND THE CORRESPANDING ACCURACY MAKE AN INSERT STATMENTS TO ADD THIS INFO RETURN BACK TO TOP OF CODE START MODIFY THE MODEL HYPERPARAMERTERS AND LOOK FOR BEST RESULTS ; CHANGING KERNAL TRICK THE TRAINING DATA THE MARGINS TRY AND TRY AND TRY
AFTER GETTING THE BEST ACURRACY WE CAN BUILD OUR SYSTEM RECOGNITION IN NEW FILE CALLED FINALPROJECT.py IMPORT LIBRARIES CREATE FUNCTION TO MAKE TRAINING DATA READY THE FEATURES AND LABELES THEN CREAT A FUNCTION THAT CRATE A TRAINED MODEL HAVE TOW ARGUMENTS THE FEATURES AND LABELS NEXT CRAEAT SOM FUNCTION TO MAKE USER DATA READY FRO CLASSIFICATION WE SEE THEM ALREADY IN PREVIOUSE FILE ALSO FUNCTION CALLED SYSTEN RECONIGNITION HEADER THE CALL THE PREVIOUS FUNCTION AND RETURN A NP ARRAY AS TESTING SAMPLE TO USE IT AT THE END THER IS A FUNCTION TRANSFORM THE LABEL FROM 0 AN 1 TO HAM AND SPAM IN MAIN METHD I CALL THE TO FUNCTIONS OF LABELS AND FEATURES THEIR RETURN I PASS TO MODELFEANRATOR FUNCTION WHEN IT RETURN A MODEL I ENYER A WHILE LOOP ASK THE USER FOR HIS EMAIL USE PRIDICT FUNCTON TO DICIED THE CLASS OF EMAIL THEN PRINT THE LABEL WITH THAT I THINK ALL THING ARE OKEY AND FINE
G-SUITE : [email protected] IDE : PYCHARM | DATAGRIP | EXEL TIME : 144H FROM SCRATCH