-
Introduction
- 1.1 Basics of Scrabble Gameplay
- 1.2 Game Elements and Components
- 1.3 Event flow in Gameplay
- 1.4 Detected Events
-
Data Description
- 2.1 Dataset Preparation
- 2.1.1 Approach
- 2.1.2 Challenges in Data Recording
- 2.1.3 The Second Approach
- 2.2 Examples of Dataset Images
- 2.1 Dataset Preparation
-
Methods and Techniques
- 3.1 Preprocessing Steps
- 3.2 Detection and Recognition Techniques
- 3.2.1 Board Detection
- 3.2.2 Letter detection
- 3.2.3 Turn detection
- 3.3 Word recognition
- 3.4 Score counting
-
Results and Analysis
- 4.1 Effectiveness per Dataset
- 4.2 Challenges, Limitations and Observations
-
Summary of Findings
-
References
The game involves placing interrelated words on a board using tiles with letters of varying values, resembling the construction of a crossword puzzle. The goal of the game is to achieve the highest score.
- Board: The main element of the game where words are placed.
- Letter racks: Used by players to hold their letter tiles.
- Letter bag: Used to draw letter tiles randomly.
- 100 letter tiles: Game elements used by players to form words. Each letter has a specific point value.
- Player marker: Used to identify players or their positions.
The gameplay consists of the following stages:
- A player draws tiles from the bag and places them on their rack.
- During their turns, players can either:
- form a word on the board using the available tiles,
- exchange a chosen number of tiles,
- skip the turn.
- At the end of the turn, the player replenishes their rack to a full set of tiles.
- Adding up points collected in the turn. Taking into account the created word, possible multipliers caused by the special tiles and bonus (using all the words from player's rack in one turn).
- The game ends when the letter bag is empty, and no player can make a valid move.
Our system provides a virtual view of the game in real-time. It tracks the state of the game by monitoring:
- turn (player1 of player2),
- each player's score,
- board state (created words),
and does it by detection of the following events:
- change of turn: moving the player marker (pink card in our case),
- bulding a word,
- detecting the created word (assuming polish language and accordance with the game rules of SJP),
- checking for special tiles (optional multipliers of the created words or single letters),
- skipping turn (moving the card without making changes to the board).
We recorded Scrabble games ourselves, capturing a variety of scenarios to simulate real gameplay. Each approach was designed to include three difficulty levels:
-
Easy:
- Perfect view of the board and game components.
- The game elements are not covered by hands when interacting with them.
-
Medium:
- Different lighting dynamics in the images.
- Presence of shadows and light reflections.
-
Difficult:
- Includes all conditions from the medium level.
- Angled views (a few or several degrees) of the board.
- Game components partially covered by hands during movement.
- Slight shaking of the camera during recording.
Our first approach involved using a wooden table as the background. This setup presented significant challenges for detection due to the following issues:
-
Lighting and Colors:
- Varying light conditions caused inconsistent detection results.
- The dark blue border of the board blended with the dark brown wooden table, making segmentation difficult.
-
Overlapping Components:
- The bag of letters often overlapped with the board. Since both the bag and the board border shared similar colors, the color segmentation function could not distinguish between them.
- Similarly, letter racks shared the same color as the board, causing additional segmentation issues.
-
Recording Quality
- For unknown reasons, the camera used caused us some troubles. Not only did it stop the recording in the middle of the session, but also (and even more importantly) it significantly lowered the quality of the recording. While for the processing purposes we eventually needed to lower the quality anyway, it made already hard problem of letter detection even more cumbersome.
These challenges highlighted the importance of carefully selecting backgrounds and controlling the recording environment to ensure effective data processing.
Due to the significant challenges encountered in the first dataset, we decided it was necessary to create a second dataset with several important changes to improve detection accuracy:
-
Black paper beneath the board:
- This provided a clear contrast with all game components, making segmentation more effective.
-
Ensured distances between components:
- Game elements such as the letter bag, racks, and tiles were positioned with enough spacing to avoid overlap and confusion during detection.
-
White tape around the board:
- A white border was created around the board using tape. This greatly improved the detection of the board's boundaries.
- The white border was resistant to light reflections, resolving the issues in the first dataset where reflections made corner detection impossible.
-
Stabilized the board:
- In the initial recordings, the board moved slightly during players' turns. For the new dataset, we ensured the board remained stable, reducing variability in detection.
These changes addressed the limitations of the first approach and provided a more reliable dataset for developing the Scrabble detection system.
As mentioned before, there were 3 levels of difficulty. We recorded 3 different clips per difficulty level, at least minute each. Due to unforeseen problems, we had to do it twice.
Level 2 Uneven light from lamp and flashlight flickering

Level 3 Angled camera, disruptions of video by camera shaking, all the previous impediments

We were very restrained with the preprocessing steps. The only change that we implemented was an optional resize of each frame. This step however acts like a double-edged sword, since both options provide certain upsides and downsides: namely the computational complexity - sharper details (more information) tradeoff. Other than that, we experimented with e.g. histogram equalization methods, yet they didn't prove very helpful in mitigating light distortions at the corners of the board, which made board detection impossible (assuming we want to keep all the tiles), however it shouldn't come as a surprise, because the problem was rooted in the hue being unrecognizable when white circle of light surrounded the area. CLAHE algorithm would be more useful in case hue was defined properly, and only the value or saturation needed adjustment.
The detection and recognition part of our project might be divided into three main tasks: board, letter and turn detection/recognition.
The purpose of board detecion is quite obvious. Matter of fact, this was the fundamental part of the entire project. Proper letter detection and, later on, score counting, heavily depend on this step. That's why we wanted to make sure that this step is as good as it gets.
Initially, uneven lighting, especially on the borders of the board, along with the standard dark-green colour of the border, was too much of a challenge for a simple hue-based corner detection. It was very hard to create such mask that would comprise the entire board while simulatenously discarding the same coloured tile bag that overlapped with the playing field.
For those and some other reasons we decided to record the videos again, this time in a more deliberate environment, so to say.
White tape worked like charm. Bigger distance between the bag and the board is not a coincidence either.
Finally, let's talk about the detection algorithm. Its steps are as follows:
- hsv conversion (to isolate different variations of the border so to say),
- masking based on lower and upper hsv thresholds (to filter between a range of green),
- filling pixels with the count of the closest neighbours greater than threshold,
- find contours inside the mask and choose the one with the maximal area,
- apply Hough Lines algorithm to the detected contours, concatenate the falsely separated lines, and finally, find the corners as intersections of the lines,
- iteratively rotate and then swap the corners, until the desired orientation is found.
The first step of letter detection is to transform the board in such a way that we obtain its top-view. Perhaps this wouldn't matter so much in deep learning approach, however in our case this is crucial, and in a moment I'll explain why. Below, you can see the example of such perspective transform:
The picture was taken from this article, I recommend it for getting a better intuition behind perspective transform.
Implementation note: luckily for us, cv2 library contains getPerspectiveTransform and warpPerspective functions that allowed us to turn the idea into code effortlessly.
Considering the above, we were able to create a grid-like representation of the board, where each square corresponded to a particular cell in a "chessboard" coordinate system, e.g. A5, B8, C3 etc. This proved invaluable in word recognition.
Now, why did we need this transformation so much? To use template matching!
We cut and labeled example of each letter from a well-visible (in terms of lighting and occlusion) video, cropped it a bit not to impose any bias (e.g. black stripe on the side would suggest that this particular letter should be the last one in a word) and then binarized to obtain the mask.
Polish language introduced some specific troubles at this point. Special signs in this case are very similar visually to the letters they originated from thus being misleading for the template matching algorithm, for example:
"A" -> "Ą", "S" -> "Ś", "Z" -> "Ź", etc.
We decided to use only "basic" letters for the purpose of template matching and further resolved this issue in the upcoming section Word Recognition.
Turn detection should be separated into two steps:
- Marker detection: as mentioned before, in our dataset we used a pink card as an artificial marker of turn, introduced for the purposes of this project (not a Scrabble standard). Player that has the card by his/her side is on the move. Detection of the marker was done simply using information about its colour: pink doesn't appear anywhere else in our frames (except for a hand with certain light on it), so it was quite straightforward to select it using hsv range.
- Turn recognition: our dataset has this property that the camera doesn't change its position in relation to the board throughout the game. We used this fact to compare the average Y-coordinate of the detected card with height of the picture. If the midpoint surpassed half of the picture's height, switch of turn was announced.
Our implementation allows to retrieve only those letters that were added in the latest turn. That's very good already, however there are certain issues to be resolved:
- bad letter prediction (common case: "F" instead of "E"),
- false positives (letters that are simply not there),
- false negatives (letter is there but wasn't detected by the template matching),
- special empty tile.
How did we cope with that? Solutions to each problem, respectively:
- We created the WordValidator class that given a dictionary of legal words, implements word_matcher method that matches the closest allowed word to the given one using quasi-Levenshtein distance. Its not exactly Levenshtein distance, because only substitutions are allowed in this case: we assume that the number of letters in the word is correct. In case it wasn't sufficient, we try to append or prepend one letter to the current word. This heuristic, although imperfect, works surprisingly well.
- Checking for letters that are isolated from the rest, "islands". It is very simple to check the vertical and horizontal neighbourhood of a letter, yet very effective, because there's relatively small chance for false-positive in our case, and even a smaller chance for 2 false positives in a row.
- The aforementioned word_matcher function solves that with prepend/append heuristic. It is negligibly rare that a letter inside a word is not detected.
- It is not possible to detect the empty tile by template matching. The only way is to infer its existence, so to say. It can be done by treating the board as a graph and checking for its connectivity. Disconnected graph is forbidden by Scrabble's rules. Another way is to look at letters added during the last turn and once again checking for its connectivity. Only one word might be added during the turn, thus we can conclude that an empty field between two strings is, in fact, an empty tile.
Score counting by now is fairly simple. As we have the information about the newly added word, we can simply check for the hardcoded bonuses corresponding to fields that the tiles occupy. Bonuses don't change throughout the game, so this is sufficient.
We repeat this approach iteratively, adding the points to each player, depending on the parity of the turn.
Easy Dataset The processing algorithm works flowlessly. The only issue encountered, which is consistent over all subsequent dataset is that sometimes hand disrupts the board detection (or the pink marker, yet this is harmless generally speaking), which may cause different sorts of trouble. Luckily, we managed to mitigate it. Medium Dataset In the medium dataset a new problem arised - light blinding the board and thus letters. If the light is very condensed in one point, there's not much that could be done. Nonetheless, our heuristic with finding the most probable words for a given letters does a great job here, diminishing this issue to negligible size. Hard Dataset Although we had to cope with acute angle in this case and irregular shaking of the camera, it didn't present a significant challenge (in comparison to compressed light at one point). The algorithm is invulnerable to slight angle changes (+-20 degrees) in the forward and backward direction. Camera shaking is not a big deal due to fixed frame rate (not every frame is processed) and walkaround involving averaging certain bounding boxes over recent frames.
We have to admit that our approach is indeed quite limited, partly due to our lack of experience in the field and probably partly due to the inherent nature of classical Computer Vision algorithms (we didn't use ANNs, nor ML). Nonetheless, our results could be reproduced given the same Scrabble set (particularly the board and the tiles), and a reasonable colour distinction between a board and a table.
Other than that, light was a factor that we underestimated in the initial stage of the project. It turns out, contrary to our earliest assumptions, that the biggest problem that we should avoid are not shadows, but rather too much light compressed in one place.
Despite poor adaptability, the approach's performance surpassed our estimation. It is pretty efficient and accurate, considering its simplicity. Presumably its biggest advantage over DL-based methods is its explainability, and this conclusion could be extrapolated to the entire family of similar algorithms.











