Smart agent for card games using reinforcement learning

Overview

Video games & Card-games

Video games serve as immersive virtual environments where players engage using various peripheral devices like keyboards, mice, and gamepads Designed for entertainment, these games enable players to unwind while making strategic decisions within established rules to achieve rewards.

Games can be classified into various categories, including action, first-person shooter (FPS), third-person shooter (TPS), and strategy, among others Notably, a single game can belong to multiple genres, such as being categorized as both action and hack 'n slash or a third-person shooter.

Figure 1.1: Poker Championship, a Poker-based online game 1

Card games, both in digital and physical formats, are among the most popular forms of entertainment A prime example of this genre is Poker, which illustrates the enduring appeal and complexity of card games.

Poker Championship is an engaging multiplayer poker game featuring identical playing cards Players can create customized decks from a large set of cards, enhancing their gaming experience **Key Features** - Participate in massive tournaments with up to 500 players and enjoy daily free buy-in events - Engage in the World Tour Season to earn Tour Credits and improve your ranking - Receive free chips every four hours and unlock rewards by collecting them - Enjoy the thrill of hitting the Jackpot at every poker table - Showcase your achievements with a variety of avatar portraits and decorations - Stay competitive with exciting seasonal events, daily and weekly missions - Compete in real-time with Texas hold’em players globally and aim to become the world champion - Play seamlessly across devices, starting on mobile and continuing on PC or TV **System Requirements** The game is compatible with Windows and Mac OS, requiring a minimum of 1 GB RAM and sufficient storage space **Customer Reviews** Players have shared positive feedback about the immersive gameplay and competitive features of Poker Championship.

Card games are a form of strategy game characterized by imperfect information, where players are aware only of their own cards and those on the table, while remaining unaware of their opponents' cards and the ones still in the deck This unique aspect of card games emphasizes strategic thinking and decision-making.

Artificial Intelligence

Artificial Intelligence (AI) has been a significant area of research and development since its inception in the 1950s Today, AI plays a crucial role in various computer applications, including recommendation systems, recognition, and detection The gaming industry is no exception, as AI has become an integral component in most computer games, enhancing gameplay and user experience.

In the 1990s, people implemented classical AI methods in games For instance,Deep Blue uses search algorithms to play chess Although Deep Blue’s performance was good, its al-

Figure 1.2: Deep Blue in IBM’s headquarters in Armonk, N.Y 2

2 Yvonne Hemsey/Getty Images gorithm demands a significant computational power to work well onmuch larger state-space games.

In recent years,Machine Learningapproaches (Supervised Learning - SLandReinforce- ment Learning - RL) have been implemented and gradually improved, which resulted in high performance and low computational power requirement.

Objectives

Card game is a good application for testing AI systems We can evaluate the efficiency of

AI systems on these games since they contain complicated states and action spaces.

This project focuses on developing a contemporary card game that utilizes intelligent agents powered by advanced machine learning techniques To accomplish this, we will undertake a series of strategic tasks aimed at enhancing gameplay and agent performance.

• Investigate card games to prepare for designing an alternative card game with smart agents based on machine learning models.

• Build the proposedCard gamein order to assist the smart agent.

• Build Intelligent Agents using modern methods (RL, combined RL - SL) and evaluate their performance.

We decide to build our own game instead of using an existed game because:

The game's complexity is easily manageable, allowing players to quickly grasp its functions and interactions without dedicating excessive time to exploration and understanding.

On-demand customization and integration can be achieved quickly due to our development of core components, making it simpler to create new features or subsystems rather than modifying existing code from other sources.

Development process

To achieve the above objectives, we break down the development process into phases as follows:

• Phase 1: Research both game engines and machine learning methods: foundation, technical basis, mathematical basis, advantages and disadvantages.

• Phase 2: Choose techniques to build the games and agents and design the core game components.

• Phase 3: Implement a game prototype with basic interactions and an agent-training environment.

• Phase 4: Build a complete game and an intermediate component to link the game and the training environment.

• Phase 5: Update the game and train agents based on chosen approaches.

• Phase 7: Organize a self-retrospective meeting to evaluate the whole thesis.

This dissertation outlines the development process, beginning with a presentation of the mathematical and technical foundations related to game engines, games, and reinforcement learning approaches in Chapter 2 Chapter 3 discusses key considerations for game engines and modern machine learning techniques, culminating in a summary of our selected technologies and development strategy In Chapter 4, we detail the main components, including the communication server, proposed game, and game agents Chapter 5 evaluates the performance of our smart agents, while Chapter 6 concludes the work with a self-retrospective on our achievements, challenges, and potential improvements.

Game Engines

Definition

The term "game engine" emerged in the mid-1990s, particularly in relation to first-person shooter games like Doom by id Software This game featured a clear separation between its core software components—such as 3D graphics rendering, collision detection, and audio systems—and the art assets, game worlds, and gameplay mechanics that shaped the player's experience This separation proved valuable as developers began to license and modify games, allowing them to create new products by altering art, world layouts, weapons, characters, vehicles, and game rules with minimal changes to the underlying engine software.

Game engines have become highly customizable through scripting languages such as Quake C, allowing developers to license them as a profitable secondary revenue stream Nowadays, game developers can efficiently license game engines and leverage substantial parts of their essential software components, making it a cost-effective alternative to creating all core engine components from scratch.

The distinction between a game and its engine can often be unclear, with some engines maintaining a clear separation while others do not In certain games, the rendering code is specifically tailored to draw unique characters like orcs, whereas in others, a more general rendering engine allows for flexibility through data-driven architecture This data-driven approach is what sets a game engine apart from a standard game, which typically relies on hard-coded logic and specific rendering code that hinders reusability In contrast, a game engine is designed to be extensible, serving as a versatile foundation for creating various games with minimal modifications.

Game engines are designed to create various types of games but tend to be optimized for specific genres With advancements in graphics cards, processing units, and rendering algorithms, it is feasible to use a first-person shooter (FPS) engine to develop a strategy game However, there remains a trade-off between generality and optimality; a strategy game built on an FPS engine may experience slower rendering and longer logic processing compared to one developed on a specialized strategy engine.

Game Engine Platform Scripting Language(s)

Unreal Engine 4 2D/3D C++, Python, Blueprint visual scripting

Amazon Lumberyard 3D Lua, Script Canvas

Table 2.1: Some popular game engines associated with their platforms and scripting languages.

1 https://www.gameenginebook.com/figures.html

Runtime Engine Architecture

A game engine typically comprises a suite of tools and a runtime component, as illustrated in Figure 2.2, which highlights the essential runtime elements of a standard 3D game engine Similar to other software systems, game engines are structured in layers, with higher layers relying on the foundational lower layers, ensuring a one-way dependency.

Figure 2.2: The runtime game engine architecture 2

This section will introduce some main components that developers usually make use of in any game engine.

2.1.2.1 Third-Party SDKs and Middleware

Many game engines utilize various third-party software development kits (SDKs) and middleware, with the interface of an SDK commonly referred to as an application programming interface (API).

Games, like any software system, rely significantly on container data structures and algorithms for efficient data manipulation Notable third-party libraries that support these functionalities include Boost, which offers a data structures and algorithms library inspired by standard C++ and the Standard Template Library (STL); Folly, designed to enhance the performance of the standard C++ library and Boost; and Loki, which also contributes to these capabilities.

Most game rendering engines are built on top of a hardware interface library, such as:

Glide, a 3D graphics SDK designed for older Voodoo graphics cards, alongside OpenGL, a widely adopted portable 3D graphics SDK, and DirectX, Microsoft's proprietary 3D graphics SDK, have paved the way for modern graphics programming Additionally, Vulkan, developed by the Khronos Group, offers a low-level library that empowers game developers with direct access to the GPU for rendering and GPGPU compute tasks, granting them fine-grained control over memory and shared resources between the CPU and GPU.

Collision detection and rigid body dynamics (briefly, "physics") are provided by these well- known SDKs: Havok (a popular industrial-strength physics and collision engine), PhysX

(another popular industrial-strength physics and collision engine from NVIDIA),Open Dy- namics Engine.

The resource manager serves as a centralized interface for accessing various game assets and engine input data, with some engines like Unreal and OGRE implementing a consistent approach through packages and resource management classes In contrast, other engines may adopt a more disorganized method, requiring game programmers to manually access raw files stored on disk or within compressed archives.

The rendering engine is a crucial and intricate part of any game engine, with various architectural designs available Most contemporary rendering engines adhere to key design principles influenced significantly by the 3D graphics hardware they rely on.

The layer includes the engine's fundamental rendering capabilities, emphasizing the rapid and detailed display of geometric primitives while largely ignoring the visibility of specific scene elements.

2 https://www.gameenginebook.com/figures.html

Figure 2.3: Cocos 3.1.0 engine allows user to choose desired physics simulation library.

This component optimizes rendering by limiting the number of primitives based on visibility In small game worlds, frustum culling, which eliminates objects outside the camera's view, is typically sufficient However, for larger game worlds, employing advanced spatial subdivision data structures enhances rendering efficiency by rapidly determining the potentially visible set (PVS) of objects.

Modern game engines offer extensive visual effects capabilities, featuring advanced particle systems, decal systems, and techniques like light mapping and environment mapping They also incorporate dynamic shadows and full-screen post-processing effects, such as high dynamic range (HDR) tone mapping, bloom, and full-screen anti-aliasing (FSAA), enhancing the overall gaming experience.

Most games employ some kind of 2D graphics overlaid on the 3D scene for various purposes: heads-up display (HUD), in-game menus, in-gamegraphical user interface(GUI).

Collision detection is crucial in gaming, as it prevents objects from inter-penetrating and ensures meaningful interactions within the virtual environment Additionally, many games incorporate realistic or semi-realistic physics simulations to enhance the overall experience.

Collisions and physics are intricately linked, with collision detection often resolved through physics integration and constraint satisfaction Most game engines rely on third-party SDKs, such as PhysX or Havok, to manage their physics engine.

To process inputs from the player, games needhuman interface devices (HIDs), including:keyboard and mouse, a joypad or other specialized game controllers.

Figure 2.4: Unity scripting system requires shutting the game down before recompilation.

The player I/O component, often referred to as such, is responsible for delivering output to the player through various Human Interface Devices (HID) This includes features like force-feedback or rumble effects on game controllers and audio output from devices like the Wiimote.

Gameplay encompasses the actions and rules within a game’s virtual environment It is often developed in the same native programming language as the game engine or through a high-level scripting language To connect gameplay code with low-level engine systems, many game engines incorporate a layer known as gameplay foundation systems.

• Game Worlds and Object Models

The gameplay foundations layer establishes a game world composed of both static and dynamic elements, typically modeled using an object-oriented approach This diverse collection of object types—including background geometry, dynamic rigid bodies, player characters, non-player characters, weapons, projectiles, lights, and cameras—creates a real-time simulation, enriching the gaming experience.

In game development, effective communication between game objects is essential An event-driven architecture is a widely used method for facilitating this inter-object communication In this approach, the sender generates a data structure known as an event or message, which includes the message type and any relevant argument data The event is then transmitted to the receiving object by invoking its event handler function.

Reinforcement Learning

Markov Decision Process

In reinforcement learning, the most typical decision-making framework isMarkov Decision Process(MDP) An MDP is defined by a 5-tuple below:

In reinforcement learning, the tuple consists of the state space (S), the set of possible actions (A) available to the agent, and the transition probability function (P s→s a ′), which defines the likelihood of the environment moving to a new state (s′) when the agent executes a specific action (a) in the current state (s).

R a s is a reward function, it says how much reward we expect to get from taking actionain states Andγ ∈[0,1)is a hyperparameter called thediscount factor.

Following a MDP, the agent starts in states0 and gets to choose some actiona0∈A As a result, the state of the MDP randomly transitions to some successor states 1 , drawn according to

P s a 0 →s 1 and then picking another actiona 1

In a Markov Decision Process (MDP), an agent selects an action \( a \) from a set of actions \( A \) based on a defined policy \( \pi \) This policy \( \pi \) serves as a function that maps a given state \( s \) to the corresponding action \( a = \pi(s) \) To evaluate the effectiveness of the policy \( \pi \), we establish a value function that quantifies the expected return for following this policy.

In (2.3),V π (s)is the expected reward upon starting at statesand taking action according to π Our goal in MDP is to find an optimal policyπ ∗ that maximizes the expected total reward

A key feature of Markov Decision Processes (MDP) is that both the reward and the subsequent state are determined solely by the current state and the action taken This property is also present in reinforcement learning techniques that utilize MDP, including Q-learning.

Q-learning

Q-learning is one of the core methods applying reinforcement learning for MDP The goal of Q-learning is as same as MDP, which is mentioned in section 2.2.1, is to interact with the emulator (or the environment) by selecting actions in a way that maximizes the rewards In order to understand the concepts of Q-Learning, we firstly define the state-action value function (or Q-value function) Q π (s,a) representing for the expected reward when starting in state s, performing actionaand following policyπ 7 ,

In Q-learning, our goal is finding a strategy that maximized expected cumulative reward achievable from a given (state,action) pair at time step t,

6 http://www.cs.cmu.edu/~10601b/slides/MDP_RL.pdf

7 https://www.dropbox.com/s/wekmlv45omd266o/deep_rl_intro.pdf?dl=0

The optimal Q-value function obeys an important identity known as theBellman Equation.

To determine the optimal strategy at the current time step, one must maximize the expected value of the reward R at state st, combined with the discounted future value Q*(st+1, at+1) for all possible actions at the next time step t+1 This approach relies on knowing the optimal value Q*(st+1, at+1) for each action, guiding decision-making to achieve the best outcomes.

Our goal is to compute state-action value functions to derive Q ∗ through maximization of these values Many algorithms for state-action value computation rely on estimating the function using the Bellman equation as an iterative update method.

Value iteration algorithms converge to the optimal action-value function, denoted as Q*, as iterations progress towards infinity However, this fundamental approach is impractical due to its lack of scalability, requiring the computation of Q(s,a) for every possible state-action pair during updates In scenarios with extensive state spaces, such as card games, calculating the entire state space becomes computationally infeasible Consequently, utilizing a function approximator to estimate the action-value function is a more common and efficient practice.

Fictitious Self-play

Fictitious play is a game-theoretic model centered on self-play learning, where players select optimal responses to their opponents' average strategies This process leads to the convergence of players' average strategies toward Nash equilibria, where each player's strategy is the best response to the others By facilitating the approximation of best responses and the updating of average strategies, fictitious play proves to be highly applicable in machine learning contexts.

In 2015, Johannes Heinrich introducedFictitious Self-play (FSP), a sample - and machine learning - based class of algorithms coping with fictitious play problem in extensive form game.

In the FSP framework, the optimal response is determined through reinforcement learning, while the average strategy is refined using supervised learning The FSP agent creates two datasets: MRL, which captures experience transition tuples for reinforcement learning, and MSL, which records the agent's behavioral experiences for supervised learning The self-play agent utilizes the MRL dataset to approximate a Markov Decision Process (MDP), allowing Q-learning to derive an approximate best response This response is then integrated into the MSL dataset, which is employed to enhance the agent's average strategy through supervised classification techniques.

Supervised Learning

Multilayer perceptron

A multilayer perceptron (MLP) is a type of artificial neural network comprising three essential layers: the input layer, hidden layers, and output layer The input layer captures the data to be processed, while the output layer delivers the results, such as classifications The hidden layers, which can vary in number, serve as the computational engine that drives the processing of information within the network.

MLP As well as other feed forward neural networks, data flow is in forward direction from the input layer to the output layer [10].

Outputa (i) of layerL=iis calculated by: a (i) = f(θ (i)T a (i−1) +b (i) ) (2.9)

In a multilayer perceptron (MLP), the output of each layer is determined by the weights, biases, and an activation function Specifically, the output of layer \(i\) is calculated using the weights \(\theta(i)\), the output from the previous layer \(a(i-1)\), the bias \(b(i)\), and the activation function \(f\) Among various activation functions like sigmoid, tanh, and ReLU, the Rectified Linear Unit (ReLU) is the most widely used due to its effectiveness in addressing the vanishing gradient problem and its zero-centered nature.

After completing the feedforward process and obtaining the output layer results, the Multi-Layer Perceptron (MLP) calculates the loss and updates its weights through gradient descent to minimize this loss The weight update is represented by the equation θ ← θ - α∂L, where θ denotes the weights, α is the learning rate, and ∂L represents the gradient of the loss.

The learning rate, denoted as α, is the most crucial hyperparameter in a neural network, as it significantly influences the model's performance A learning rate that is excessively high can prevent the loss from decreasing, while one that is too low may lead to convergence issues The optimal learning rate is identified as the largest value that avoids divergence To establish an effective learning rate, a common approach is to start with a larger value and gradually reduce it if the network begins to diverge Additionally, artificial neural networks commonly utilize the backpropagation algorithm to enhance the efficiency of derivative calculations in gradient algorithms.

Giving computational graph as shown in figure 2.7, which computes output f from three inputsa, b, c A specific feature of the graph is that we can compute the derivative of f with

Figure 2.6: Multilayer perceptron. respect toaby applying the approach shown in equation 2.11.

The computational graph serves as the mathematical foundation for backpropagation By representing the forward propagation process as a computational graph, we can effectively calculate the derivative of the loss function \( L \) with respect to the parameters \( \theta \), as outlined in equation 2.10.

Using the Sigmoid activation function alongside the Cross-entropy loss function significantly enhances computational efficiency, leading to faster training speeds for the network.

In reinforcement learning, Multi-Layer Perceptrons (MLPs) are utilized to create imitation models that leverage current state information to select the most advantageous actions from a defined action space This process closely resembles classification, which is a prevalent application of MLPs.

Figure 2.7: An example of a computational graph.

Recurrent neural network

In recent years, deep learning has advanced significantly and is now widely applied across various data types Different architectures have emerged to accommodate the unique characteristics of these data, including recurrent neural networks (RNN), convolutional neural networks (CNN), and deep neural networks (DNN) While CNN and DNN excel in processing spatial data, they struggle with temporal information, making RNN the leading choice for research focused on sequence data.

The dominance of Recurrent Neural Networks (RNNs) can be attributed to their unique cycling connection architecture, which enables the model to update its current state based on both previous states and current input data This capability makes RNNs particularly effective for modeling sequential data.

The architecture of a Recurrent Neural Network (RNN) and its forward computation process are depicted in Figure 2.9 At each time step \( t \), the activation \( a_t \) and output \( y_t \) are calculated using the equations \( a_t = g_1(Wa_{t-1} + Ux + b_a) \) and \( y_t = g_2(Va_t + b_y) \), where \( W, U, V, b_a, \) and \( b_y \) are temporally shared coefficients, and \( g_1 \) and \( g_2 \) represent activation functions The activation \( a_t \) serves as a basis for computing the subsequent activation \( a_{t+1} \) in the next time step.

To accommodate diverse data characteristics and output requirements, several types of Recurrent Neural Networks (RNNs) exist As illustrated in Figure 2.10, there are five distinct types of RNNs, each with unique applications across various domains of deep learning, as detailed in Table 2.2.

While Recurrent Neural Networks (RNNs) are effective for processing sequence data, they face significant challenges The reliance on previous states for training leads to slow and complex computations Additionally, attempts to implement weight sharing have not successfully mitigated these limitations.

Figure 2.9: RNN and the unfolding in time of the computation involved in its forward computation [1].

Type of RNN Typical application example

One to One Traditional neural network

One to Many Music generation

Many to One Sentiment classification Many to Many(Tx6=Ty) Machine translation Many to Many(Tx=T y ) Name entity recognition

Table 2.2: Applications of RNN. addition, one of the most severe disadvantages of RNN is the vanishing gradient and exploding gradient problems [17].

Imitation learning significantly enhances the application of recurrent neural networks (RNN) in reinforcement learning by framing the core task of mimicking an object's behavior as a sequential prediction problem This approach leads to impressive outcomes, demonstrating the effectiveness of RNNs in imitation learning scenarios.

Game Engines

Unity

Unityis a 3D/2D game engine and powerful cross-platform IDE for developers 1

The Unity Game Engine offers essential built-in features such as physics simulation, 3D rendering, and collision detection, enabling developers to focus on game design without the need to create new systems from scratch This allows for efficient handling of complex calculations, including material movements, light interactions, and adaptable HUDs for various screen resolutions.

Unity serves as a comprehensive Integrated Development Environment (IDE) that equips users with all the essential tools for complete game development in a single platform Its visual editor enables creators to easily drag and drop elements into scenes while adjusting their properties Additionally, Unity offers a wide array of powerful features and tools, enhancing the overall game development experience.

Figure 3.1: Unity Editor user interface. folders in the project, or to create animations via a timeline tool, or to edit variables’ values directly from its UI 1

Unity Engineuses C# to handle code and logic with its own namespaces to help to build the game much easier.

The engine usesRoslynas compiler 2 and users can optMonoorIL2CPPas their project’s scripting backend The former uses just-in-time (JIT) compilation, the latter uses ahead-of-time

The AOT compilation process transforms C# code into Intermediate Language (IL), which is then converted to C++ before generating native binary files such as exe, apk, or xap Implementing IL2CPP can enhance the performance, security, and platform compatibility of Unity projects.

The engine is designed for both community and commercial use, featuring a finely structured internal architecture Its core systems—including animation, physics, and graphics—are developed to operate seamlessly and without errors.

Unity's paid versions include a powerful profiler tool that monitors each thread within a game, analyzing independent tasks such as script execution, logic updates, physics simulations, and graphics calculations This tool enables developers to identify and optimize tasks that may slow down frame draw calls While optimization is also possible in the free version, it requires manual intervention for each issue.

Some advantages of Unity are:

The platform offers extensive features and tools, enabling users to efficiently build their games without the need to reinvent or construct low- or intermediate-level subsystems.

• Good for prototyping games There is an asset store to find many pre-made assets, ranging from low-poly models to detailed explosion particle effect, or even a completed scene.

The engine features a diverse range of modular libraries, equipped with essential modules that can be easily extended through well-implemented and tested packages such as AR, Shader Graph, and Universal Render Pipeline These tools serve as ideal extensions, allowing developers to concentrate on product creation rather than dealing with middleware complexities.

A supportive community plays a crucial role in maintaining the engine by providing updates, resolving issues, and reporting both major and minor bugs in new releases Additionally, they contribute by creating valuable assets and modular libraries, many of which have been integrated into the official engine.

Besides some advantages that have been listed, Unity game engine:

The game simulation can be relatively slow, as each component is fully implemented When the engine is not stripped down and the build lacks optimization, it may take additional milliseconds to render each frame on the screen.

• Cannot be customized, similar to other commercial engines, Unity’s true implementation is hidden from the users.

Cocos Creator

Cocos2d-xis a multi-platform framework for building 2d games, interactive books, demos and other graphical applications It is based oncocos2d-iphone, but instead of using Objective-

1 https://www.androidauthority.com/what-is-unity-1131558/

2 https://github.com/dotnet/roslyn

C, it uses C++ It works on iOS, Android, macOS, Windows and Linux.

Cocos2d-x offers a feature set comparable to Unity, including scene management, effects, transformation actions like moving, rotating, and scaling, as well as a particle system, font support, and input management Additionally, it supports adaptive resolution and utilizes C++ as its primary programming language, with bindings available for Lua and JavaScript.

De facto, Cocos2d-x does not come with an IDE, if need to debug a program, users must useCMaketo generate corresponding project file For Android platform projects, users can use

Cocos Creator is an all-in-one game development tool that facilitates the creation of complete games across multiple platforms, including iOS, Android, PC, and HTML5 Built on the Cocos2d-x framework, it features a built-in Integrated Development Environment (IDE) that allows users to design User Interfaces (UI) directly within the platform.

The 2.x version primarily focuses on 2D platform games with limited support for 3D, while the latest 3.x version aims to cater to both 2D and 3D games, although many 3D features are still not fully integrated Both versions utilize JavaScript and TypeScript as their core programming languages, optimizing them for enhanced performance.

When developing a game using Cocos Creator, the project operates on a pure JavaScript engine runtime for web and mini-game platforms This ensures compatibility with TypeScript, as the engine seamlessly converts TypeScript code into JavaScript For native platforms, the framework is built in C++ to enhance runtime efficiency.

Cocos Creator is a lightweight, open-source game engine that allows developers to easily customize and modify components to meet project requirements It enables quick script compilation and scene simulation, ensuring low resource consumption and a small installation package size.

Great documentations provides a degree of ease-to-use to the users, additionally, it helps them to address every component’s actions for the customization process above.

3 https://github.com/cocos2d/cocos2d-x

3 https://docs.cocos2d-x.org/cocos2d-x/v4/en/about/getting_started.html

As section 3.1.2 pointed out, the Cocos Creator engine has/is:

• Lightweight IDE, compiling scripts can be done fast, and previewing games have low la- tency even though all the built-in components are not stripped and optimized.

• Straightforward, users can learn to use the engine while they are working with it, most of its features are widely used and well-documented.

• Customizable, any execution can be re-arranged, or added to the core engine.

• Highly supported by the communities, both the engine community and language (Javascript

& Typescript) communities are active and supportive, most of the common questions and issues can be found online.

Even though the engine is good in some aspects, still, some disadvantages are:

Despite its supportive community and user-friendly interface, Cocos Creator is not the ideal choice for game prototyping due to the significant time investment required to create numerous elements from scratch.

Certain features may be incomplete or unavailable, and networking, graphics, or physics simulation systems often contain poorly structured code In these cases, specific issues might be addressed with a simple null return or default value While customization can rectify these problems, identifying and resolving them can be challenging for basic and intermediate users.

Using certain scripting features alongside the Cocos Creator editor can be challenging, particularly with serializable classes, which may require extensive debugging to display correctly Additionally, users may encounter inconsistent error-catching between the text editor (such as Visual Studio Code, Atom, or Notepad++) and the Cocos Creator editor itself.

• Sometimes, it is hard to fix a linking error or eliminate a warning, as those messages inCocos Creator can be undescriptive.

Summary

Cocos Creator is an excellent option for users seeking a lightweight, free game engine that allows for purpose-specific development without the need to build from the ground up However, for developers creating complex, high-performance games that require highly optimized frameworks and comprehensive functionalities, it may be more beneficial to consider using custom-built engines or commercial alternatives that ensure all components are fully integrated and operational.

Unity is an excellent platform for game development, offering a wide range of free and paid pre-made assets in its store While integrating third-party libraries may pose challenges, Unity's package managers and extensions simplify the process, making it manageable Although Unity and other engines share several features, key differences can influence your final decision on which engine to use.

Creating a game involves substantial investments in both logic and graphics Developing graphics from scratch, including 2D textures, 3D models, and visual effects, can be time-consuming, especially for non-professionals To streamline this process, Unity's asset store offers a valuable solution for acquiring high-quality graphics efficiently.

This thesis emphasizes the importance of speed in game development, prioritizing rapid prototyping and construction over uniqueness We aim to select an engine that facilitates quick development, leveraging existing tools and available assets to streamline the process.

Unity engine is the one in question It meets our requirements perfectly, with some prior experience, using Unity will speed up the building process.

Games

Yu-Gi-Oh!

Yu-Gi-Oh! was first a manga (comics) in Japan, which the originals were published in

From September 1996 to March 2004, the Weekly Shounen Jump magazine serialized the highly successful manga that laid the groundwork for a vast media franchise This franchise encompasses various spin-off manga and anime series, a trading card game, and an array of video games, prominently featuring the Yu-Gi-Oh! Power of Chaos series.

In the digital realm of Yu-Gi-Oh!, players immerse themselves as characters, crafting unique card decks that blend Monster Cards, Spell Cards, and Special Summon Cards to devise strategic gameplay To enhance their decks, players engage in battles against other characters, aiming to collect additional cards Unlike Hearthstone's minions, Yu-Gi-Oh! emphasizes monsters as the primary card type, with each monster characterized by its name, level, attribute, type, and specific attack and defense points.

Yu-Gi-Oh!ruleset is quite similar toHearthstoneruleset, except the fact that it splits a turn into small phases:Draw Phase,Standby Phase,Main Phase 1,Battle Phase,Main Phase 2and

In the End Phase of aDuel, players execute key actions aimed at achieving the ultimate goal of the match: reducing their opponent's Life Points (LP) to zero or preventing them from drawing a card when required.

The game’s rules and interactions are very various and deeply connected, detailed information can be found atYu-Gi-Oh! Trading Card Game rulebook index 4

4 https://www.yugioh-card.com/en/rulebook/index.html

Figure 3.4: Yu-Gi-Oh! Power of Chaos: Yugi The Destiny gameplay.

The renowned manga significantly influenced game development and adaptation by incorporating familiar characters, allowing players to quickly acclimate to the gameplay This familiarity encourages players to explore and creatively engage with new playstyles at their own pace.

Both the digital and real-life versions of Yu-Gi-Oh! regularly introduce new cards, enhancing the game's diversity These new additions not only offer fresh strategies but also create counter moves for both existing and newly released cards.

Yu-Gi-Oh! features unique gameplay mechanics that are less common in modern card games, including the ability for monsters to be placed in either Attack or Defense Position Additionally, when a monster attacks, any excess damage can be inflicted on the opponent or their monsters, and the game also includes mechanics for monster evolution.

The immersive experiences in the game are significantly enhanced by character dialogues, highlighting that enjoyment stems not just from coding but also from sound and graphics This is where the Yu-Gi-Oh! series truly excels, offering players a richer and more engaging gameplay experience.

In the latest version, the number of cards is over 10,000 5 , the large card set is a great reference for our thesis’ game.

While rich interactions and game elements enhance the player experience, they can pose significant challenges on the technical side In the context of this thesis' experimental environment, an excessive number of interactions or cards may lead to increased time investment in implementation, bug fixing, and testing processes.

Hearthstone is a digitalturn-basedcard game, in which one assembles a deck containing

30 cards and then plays against an opponent and their deck It is a combinatorially difficult,

5 https://www.konami.com/yugioh/lotd_le/asia/en/

Figure 3.5: An example of Hearthstone battlefield interface 6 discrete, imperfect information game, featuring many complex interactions Each moment of the game is a game state.

In this game, each player selects one of nine unique heroes, each equipped with distinct powers At the start, every hero has 30 health points (HP) The primary objective is to deplete the opponent's hero HP to zero, ultimately securing victory.

In the game, players utilize two types of cards: minions and spells Each player has a hand, which consists of cards that can be played during their turn using a resource known as mana Once played, minions from a player's hand appear on the board, where they serve their primary roles of attacking and defending Minions possess an attack value, their own health points (HP), and may also have special properties that enhance their abilities.

Spell cards represent effects that can impact a player’s board, hand, or hero upon being played Certain heroes feature unique interactions when casting spell cards in succession, while others rely on spells as their primary strategy for achieving victory in the game.

At the start of each turn in Hearthstone, players draw a random card from their deck to add to their hand They can then play cards or utilize their hero's power, allowing for strategic moves Players can also use their minions to attack the opponent's minions or hero Successfully winning a Hearthstone match hinges on effectively managing the hero's health points (HP) and optimizing the resources available on the board and in hand.

Hearthstone, derived from the iconic Warcraft franchise, expands the beloved Warcraft universe in a unique and engaging manner As a game cherished by multiple generations, Hearthstone captures the essence of its characters, bringing them to life with distinct keyword abilities that enhance gameplay This design not only enriches the player's experience but also facilitates a smoother understanding of the game's dynamics.

6 https://hearthstone.gamepedia.com/File:Ui-guide-small.png

In Hearthstone, the avatar can inflict damage through a unique hero ability that targets enemy minions or the enemy hero directly Additionally, players can equip Weapon cards, enabling their heroes to attack while adhering to the game's damage rules, where both characters deal damage to one another The game's success is further enhanced by its modern visuals and impressive graphic effects, which are crucial elements in attracting players.

Average match length is roughly about 10 - 20 minutes, the gaming trend is focusing on quick games, not the long ones anymore.

Hearthstone features simplified mechanics, yet its interactions offer a diverse and engaging player experience However, as highlighted in section 3.2.1.2, developing a game like Hearthstone can be a significant financial investment.

BothYu-Gi-Oh!andHearthstoneshare the same spirit of card games, withgreat illustra- tions,visual effects,sounds,attractive gameplaywithminimal controls.

Reinforcement learning approaches

Deep Q-network

In 2.2.2, we mentioned that it is more common to use an approximator to estimate Q-value function rather than computing every action-state tuples(s,a)at a time step Deep Q-network (DQN) is Q-learning-based algorithms implementing neural networkθ as approximator 8 The optimal Q-value function can be redefined by:

7 https://hearthstone.gamepedia.com/Weapon

8 http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf

In DQN, the network can be train by minimizing a sequence of loss function L t (θt) that changes at each time stept,

L t (θt) =E[(yt−Q(st,a t ,θ i )] 2 ) (3.2) where, yt=E[R a s t t +γmax a t+1

Qi(s t+1 ,a t+1 )|st,at] (3.3) is the target for time step t Differentiating the loss function with respect to the network weightθ is defined by:

In practice, stochastic gradient descent is commonly employed to compute the expectation in the gradient for optimizing the loss function The weight θ is updated at each time step, with the expectation being substituted by individual samples (st, at, st+1).

Two key elements of Deep Q-Networks (DQN) are the target network and experience replay The target network, denoted by parameters θ−, is utilized to compute the Q-value for the next time step, ensuring stability by being updated every τ time steps from the primary network θ This approach allows the target used in DQN to be defined as yt = E[R a s t t + γmax a t+1 Q(st+1, a t+1, θ−) | st, at], which enhances learning efficiency and stability in reinforcement learning tasks.

Experience replay enhances algorithm performance by storing observer transitions in a memory bank and uniformly sampling from this data to update the network These two components significantly boost the effectiveness of the learning process.

DQN (Deep Q-Network) is a powerful method for estimating Q-values, addressing the limitations of traditional Q-tables used in Q-learning While Q-tables map state-action pairs to stored Q-values, they pose significant storage challenges for large problems DQN overcomes this by employing a neural network to approximate Q-values, eliminating the need for extensive Q-table storage This not only conserves memory but also enhances execution speed by reducing the time spent exploring each game state to build the table.

Many Q-learning algorithms, such as DQN, face the challenge of overestimating the action-value function This occurs because Q-learning often learns excessively high state-action values due to its reliance on maximizing overestimated values, leading to unrealistic estimations.

In Q-learning, the value estimate is updated using a greedy target, represented as \( y = E[R | a, s] + \gamma \max_{a'} Q(s', a') \) However, when errors, denoted as \( \epsilon \), are present, the maximum value is likely to be overestimated, leading to the inequality \( E[\max_{a'} (Q(s', a') + \epsilon)] \geq \max_{a'} Q(s', a') \) For example, if we aim to estimate a value around the true number 2 with an error of \( \epsilon = 1 \), obtaining results of 1 and 3 would lead us to incorrectly select 3 as the maximum, despite both estimates being equally inaccurate This consistent overestimation bias, caused by \( \epsilon \), propagates through the Bellman equation, presenting a significant challenge since errors from function approximation are unavoidable.

Double Deep Q-Network

The targetQ ∗ (st,a t )of DQN in 3.5 can be rewritten as:

In 3.6, DQN uses the same value both to select and evaluate action This makes it more likely to select overestimated value, resulting in the overestimate bias To prevent it, we can separate the selection and evaluation process This is the intuition behind Double Deep Q- Network (Double DQN) [20].

In Double DQN, two sets of weights, θ t and θ t ′, are utilized; one set is responsible for determining the greedy policy, while the other evaluates its value This approach allows for the estimation of the greedy policy's value using θ t, with the second network, θ t ′, providing a more accurate assessment of that value, thereby enhancing the overall performance compared to standard DQN.

Double DQN enhances the stability and accuracy of the state-action value function by effectively separating action selection from action evaluation As illustrated in Figure 3.6, sourced from [20], this separation contributes to a more reliable approximation, showcasing the improved performance of Double DQN in reinforcement learning.

In Figure 3, the DQN algorithm often exhibits significant over-optimism regarding the current value of the greedy policy, evident from the comparison of the orange learning curve in the top row to the straight orange line Conversely, Double DQN demonstrates greater stability and achieves superior scores in the two games presented at the bottom.

Double DQN effectively addresses the issue of overestimated bias but still demands significant training time Fortunately, advancements in hardware, particularly GPUs, can mitigate this drawback and enhance the training process.

In Figure 3.6, the value estimates for six Atari games are depicted, showcasing the performance of DQN (in orange) compared to Double DQN (in blue) These results highlight the effectiveness of both algorithms when applied to the games.

The analysis utilizes six different random seeds alongside hyper-parameters from [22], with the median represented by a darker line and a shaded area indicating the average of the extreme values through linear interpolation The top row features straight horizontal lines in orange for DQN and blue for Double DQN, reflecting the average discounted return from each state after the learning phase These lines would align with the learning curves on the right if no bias were present The middle row illustrates the extreme overoptimism of DQN through log-scaled value estimates in two games, while the bottom row highlights the negative impact of this overestimation on the agent's scores during training, showing a decline as overestimations occur In contrast, learning with Double DQN demonstrates significantly greater stability.

Neural Fictitious Self-play

Neural Fictitious Self-play (NFSP) integrates Fictitious Self-play (FSP) with Deep Q-Networks (DQN), allowing an agent to learn from interactions with other agents It maintains two types of memories: MRLandMSL, which store game transitions and the agent's best responses, respectively The agent employs data from MRLand to train a neural network, Q(s,a|θ Q ), using an ε-greedy strategy to approximate its best response strategy, β = ε-greedy(Q) Additionally, it trains a separate network, ∏(s,a|θ ∏ ), to replicate its previous best responses through supervised classification with data from MSL, mapping states to action probabilities and defining the average strategy, π = ∏ At each time-step, the agent selects actions based on a mixture of β and π, utilizing anticipatory dynamics and reservoir sampling to minimize sample correlation The algorithm's pseudo-code is detailed in Algorithm 1.

Algorithm 1:Neural Fictitious Self-play (NFSP) with fitted Q-learning [9]

Initialize gameΓand execute an agent via RUNAGENT for each player in the game functionRUNAGENT(Γ)

Initialize replay memoriesMRL(circular buffer) andMSL (reservoir)

Initialize average-policy network∏(s,a|θ ∏ with random parametersθ ∏

Initialize action-value networkQ(s,a|θ Q with random parametersθ Q

Observe initial information states 1 and rewardr 1 foreacht ∈[1,T]do

Sample actiona t from policyσ Execute actionat in game and observe rewardrt and information states t+1 Store transition(st,a t ,r t ,s t+1 in reinforcement learning memoryMRL if agent follows best response policyσ=ε-greedy(Q) then

Store behavior tuple(st,a t )in supervised learning memoryMSL end

Updateθ ∏ with stochastic gradient descent on loss

Updateθ Q with stochastic gradient descent on loss

Periodically update target network parametersθ Q ′ ←θ Q end end end

NFSP is the pioneering end-to-end reinforcement learning approach that identifies approximate Nash Equilibrium in imperfect games without requiring prior knowledge By integrating reinforcement learning with supervised learning, NFSP simplifies the learning process, making a deep understanding of game theory unnecessary This accessibility leads to a reduced learning curve for mastering the concept and implementing the model effectively.

NFSP struggles in games with extensive information spaces due to the complexity of the environment, making opponents' strategies more intricate and challenging to predict optimal responses Additionally, NFSP's reliance on DQN for calculating the best response leads to lengthy convergence times, hindering performance.

Remarks

In reinforcement learning, methods such as DQN and double DQN are considered memoryless, as they rely solely on the current state to compute Q-values, rendering previous states irrelevant In contrast, NFSP offers a more comprehensive approach, blending best responses with the agent's own experiences to form an average policy This allows NFSP agents to employ heuristics similar to human problem-solving, resulting in superior performance compared to DQN and double DQN in specific games.

Despite NFSP's challenges with large-scale games, our game’s simplicity allows us to overlook these issues Therefore, we choose to utilize NFSP for developing our agent.

Summary

In this chapter, we have listed out our considerations ofgame engines,referenced gamesand reinforcement learning approachesfor the thesis’s problem.

Cocos Creator and Unity are both highly supported by their respective communities, offering unique advantages for game development Cocos Creator excels in customization, making it suitable for various game genres, while Unity provides a more structured environment with extensive assets and precise tools Ultimately, we decided to use Unity for the development of our base game.

Hearthstone and Yu-Gi-Oh! are renowned for their engaging gameplay and intricate card mechanics While Hearthstone captivates players with its stunning graphics and visual effects, Yu-Gi-Oh! offers a variety of unique gameplay features Drawing inspiration from both games, we aim to develop our own design that incorporates essential elements while maintaining a manageable complexity for smart agent experimentation To ensure a balanced experience, the number of cards in each deck will be fixed, preventing excessively large game states Additionally, we plan to keep the average game duration between 10 to 20 minutes for optimal playability.

In our analysis of reinforcement learning methods, we anticipate that the Neural Fictitious Self-Play (NFSP) approach will outperform other strategies in our game, despite its challenges with large-scale scenarios Consequently, we will implement the NFSP to develop our intelligent agent Additionally, we will create Q-learning-based agents to enhance the reinforcement learning elements of NFSP and for evaluation purposes.

In this chapter, we present a smart agent using NFSP method Furthermore, the game, which is called"Quest of the Divinity"would be described in detail.

System Architecture

Application Programming Interface

At the beginning of a match, the game environment transmits state information to the agent via a socket server Upon receiving this state information, the agent calculates the reward using predefined functions and predicts the optimal policy The agent then selects the best action based on this policy and sends it back to the game environment, which executes the action, resulting in a new game state This updated state information is sent back to the agent, which stores the transition tuple in its replay memory and continues the process.

In order to serve information transition task, we design two main APIs:

1 GetEnvironmentState: the agents call this API to retrieve environment state from the game session.

2 SendAction: computed action from agent will be sent to the game.

Communication Server

The intermediate server, orcommunication server, is responsible for transferring data from game to agent, and transferring the agent’s decision to game.

The server was developed using various approaches and programming languages, with an initial focus on integrating communication into the game Building a server from the ground up proved to be a challenging and time-consuming process We initially attempted to utilize an HTML server package; however, it fell short of our needs as it lacked the capability to maintain persistent connections, which was a critical requirement for our server.

We opted to separate the server from the game and the agent to enhance efficiency To expedite server development, we selected Python for its simplicity, as the architecture primarily focuses on data transfer and maintaining online connections Consequently, the communication server is designed as a socket server, which effectively handles these tasks.

• Accepting connections fromone train clientandone game client.

• Sending byte streams, APIs from section 4.1.1 are encoded into byte streams with defined initials:

– GET-ENV-STATEfor sending an environment state request from train client to game client.

– ENV-STATEfor sending environment state result from game client to train client.

– ACTIONfor sending computed action from train client to game client.

The GAME-OVER signal is transmitted from the game client to the train client, with environment states formatted in JSON for efficient parsing into training data Action codes are sent as standard strings, which are then converted into numeric data types Notably, all data transfer occurs on the local computer, eliminating the need for encryption or decryption processes.

Quest of the Divinity

Game board

The Game board serves as the central playing area, featuring designated spaces for key components such as the Avatar, Deck, and Board side In this setup, Avatar 1 is positioned at the lower end of the board while Avatar 2 occupies the upper end Typically, players take control of Avatar 1, with their opponent managing Avatar 2.

Each player manages their own resources: life , gold , hand , deck and ground As shown in figure 4.3, where:

• Section 1 and 1’ are players’ and enemy’s avatar, respectively These sections not only represent players, but also indicate their remaininglives.

In Sections 2 and 2’, players control their minions, which are visually indicated by a green outline on their avatars when they are able to attack Conversely, any opposing minions serving as guards are marked with a yellow outline, highlighting their defensive role during gameplay.

• Section 3 and 3’are players’ hands, where drawn cards are kept, they know which cards they hold, how many cards their opponents have, but not the detail.

• Section 4and4’are one of the most important resource, the golds, which are used to play cards from hands to ground.

Figure 4.3: Elements of the game board.

The board contains supplementary elements to interact and give players additional information:end turn button,timeras shown inSection 5of Figure 4.3 where:

• End turn button ends one player’s turn and allows other to make their moves on click before the turn’s time runs out.

The timer displays the remaining time for the current turn and identifies the active player by pointing to their side at the start Once it has rotated 180 degrees from its initial position, the turn concludes, typically taking about one minute to finish.

In case the players want to start over, they can click the resetbutton, which is laid near to theend turnbutton.

Ruleset

As laws were made to restrict the residents’ behavior, the game’s ruleset is made not only to restrictwhatplayers could do, but also provide them information abouthowthe game functions.

Quest of The Divinityrules are:

1 A card costs 1 to 8 golds to activate.

2 A card canonlybe one of three types: minion , spell or sacrifice

3 A minion card belongs toat mostone of five special classes, which are described at section 4.2.5.

4 Player’s deck initially contains 52 cards including 40 minion, 8 spell and 4 sacrifice cards.

5 Each player starts the game with40 life,1 goldand3 1-to-3-gold cards.

6 Play order will be randomized for each single game.

8 Player in even turn gets1 additional cardat the beginning of the game.

9 Players use their current gold to place cards.

10 Player’s gold limit increases by one at beginning of their turn.

11 The gold limit is capped at 10 and can temporarily exceed the number with special effects.

12 Player’s current gold is reset to its limit at the beginning of their turn.

13 At the start of each turn, player receives a random card from the deck The randomization follows the distribution in table 4.1.

14 After activating a minion card, it must wait for 1 turn to perform attack.

15 When a minion attacks its target, the target also deals damage back to the minion.

The rules of the game not only dictate the gameplay but also provide players with an initial understanding of their environment Key rules, such as rules 2, 4, 5, 11, and 13, shape the game dynamics and influence player decisions based on the cards they have played, their current hand, the cards on the board, and their available gold resources.

Game Controls

Card-games are simplified in their controls In two referenced games,HearthstoneandYi- Gi-Oh!, mouse is clicked, dragged and hovered, keyboard is not used so often From that point,

1-gold 2-gold 3-gold 4-gold 5-gold 6-gold 7-gold 8-gold

Table 4.1: Card randomization distribution. possible controls inQuest of the Divinityare:

1 Hover to hand zone and focus on drawn cards.

3 Hold & drag to play cards on hand.

How To Play

Simple controls and basic rules opened a good opportunity for new players to get used to the game The game follows these steps:

1 Game start, the system randomly chooses a player to move first The cards are drawn for both players.

2 Current turn player makes moves:

• Choose a minion to attack opponent’s minions or the opponent’s avatar.

3 If one of the two player’s life reaches 0, the other player wins, the game terminates.

4 If the elapsed time from the turn’s beginning exceeds the limit time, or current player chooses to end their turn, switch turn to other player.

5 Randomly draw a card for current turn player.

Card Design

As mentioned in section 4.2.2, each card belongs to one of three types:Minion, Spelland

Each card features a front and back design, with the front displaying crucial information such as cost, name, description, and a portrait that illustrates the card's effect, helping players easily recall its function For minion cards, the front additionally reveals damage and life stats, indicating the minion's strength In contrast, the card back conceals information from opponents, allowing players to view only the backs of their opponents' cards until they are activated.

Spellcards give users instant effects and can be a turnaround when it comes to a crux, player only pay golds to gain the instant effect, or resource.

Sacrifice is a strategic method to enhance a team's strength or limit an opponent's actions in gameplay While sacrifices offer greater power than spells, they require not only gold but also the loss of a minion and often the player's health Sacrificing a higher-cost minion results in a longer-lasting effect, whereas using a minimum-cost minion provides only a temporary advantage However, it's important to note that the more expensive the minion sacrificed, the greater the damage the player must endure.

Minions serve as essential tools for both offense and defense in gameplay, allowing players to strategically place cards to combat their opponents Each minion card, as outlined in rule 3, is categorized into one of five distinct classes: Sage, Warrior, Abyss, or Nature, providing players with varied tactical options.

(a) A finalized card (b) Card visual on board (c) Card with subtypeGuard on board.

Figure 4.5: Spell and Sacrifice cards overview and Dragon Each class represents a playstyle, correspondingly, Fending and Counterattack, Bursting damage,Aggression and Summoning,PersistenceandConquer.

Subtypesare auxiliary attributes for minions to define their own powers, which makes them unique and advantageous in their classes’ playstyle and when they are in a specific situation. They are:

• Raider, minions with this subtype can bypass rule 14 and attack immediately on their first turns.

• Championminions create special effects when they are activated.

• Pioneerminions create special effects on death.

• Warchiefminions usually grant a buff to allies or debuff to enemies while they are alive. Some can have special effects on the beginning of every turn.

Guards serve as a strategic defense mechanism by compelling enemy minions to target them first, prioritizing their attacks over ordinary minions or the player's avatar When positioned effectively, having at least two guards on the enemy side creates vulnerabilities, making them essential for protecting your forces in battle.

Please refer to Appendix B for detailed information of every cards and their art credits.

Architecture Design

Figure 4.6 illustrates the key components of the game, highlighting the various classes, interfaces, and their interconnections through inheritance Each primary component consists of one or more classes that contribute to the overall structure of the game.

• Requesting socket connection to theCommunication serverwhich has been mentioned in section 4.1.2.

• Receiving JSON data from theCommunication server.

• Pre-processing JSON data and calling proper functions corresponding to the processed data at API Director component.

• Sending environment state or byte streams to the communication server.

The API Director reads a CSV file containing card IDs before integrating into the system, initializes the data, and instructs the Client instance to connect to the socket server.

• Receives actions fromClientand calls the corresponding function inGame Managercom- ponent to proceed with the game.

• Sends a new environment state after an action has been performed in the game.

• Responds to other component’s requests for card IDs.

Figure 4.6: The game main architecture.

Beside the networking components, log data is recorded every game, every time when the game state changes TheLog Writercomponent serves this purpose.

On every new game, a new log file is created When an action is made, the component appends a new logline to the log file.

The Log Writer operates independently of player presence, generating logs for any game and any player, unlike the API Director, which only transmits environment state when agents are active These logs are instrumental for training Imitation agents and assessing the performance of smart agents.

TheMousecomponent is associated with player’s cursor When the player hovers or clicks other components, specific interactions will be handled Mostly, theMouseinteracts withCard andHandcomponents, where:

• On hovering the player’s hand rectangle, the cards on hand will be displayed widely.

• When the cursor is clicked, theMousecomponent casts a ray to choose a widely displayed card, the chosen card can be dragged to activate it.

• On clicking the card on player’s board side, the Mousecomponent notify that card’s data (e.g its owner, its reference) to Game Manager.

Timeris a small component in the game, which only handles:

• Counting the remaining time of the current turn.

• NotifyingGame Managerthat turn is over.

The Game Manager serves as the core component of the system, functioning as both a supervisor that oversees and manages all game events and as an intermediary that facilitates internal communications between senders and receivers.

TheGame Manager’s job can be briefly described as follow:

• Start the game and set up players, cards.

• Receive action command from API Directorand send action toPlayer component to perform that action.

• RequestLog Writer to append new log data when an action is performed and new state is ready.

• Notify theTimerto count current turn’s time, and change turn when time is up.

• Receive the click data notification, and depend on the received notificationsGame Manager can choose target for: card attack, card activation or card sacrifice, etc.

• RetrievePlayer’s state to update environment state.

• Notify new turn update for other components.

There are twoPlayers in any game, eachPlayerholds an instance ofHand,DeckandBoard- sideof their own ThePlayer component is in charge of:

• Updating its components on new turn accordingly: draw a card from itsdeck, add the drawn card to itshandand call the boardside to update the minions.

• Activating a card, if it satisfies gold condition, spot condition and so on.

• Gathering data from itshand,deckandboardsideto send to Game Manager.

Hand is one of three components included in everyPlayer, it interacts withMouse, holds card instances and controlled by thePlayer:

• OnMousehovered, it displays the drawn cards widely for player to choose or inspect.

• Player requests adding one or many cards fromdeckto itself.

• Removes an activated card and transmit it to boardside, or destroy it if the card is aspellor sacrificecard.

Deck is one of three components included in every Player, it is initialized with the whole card dataset The deck only communicates with cards and itsplayercomponent:

• Randomly chooses a card from its card pool and instantiates it onPlayerrequest.

• Returns nothing if all the cards in its pool is drawn.

Boardsideis the last in three components included in everyPlayer It holds activatedMinion cards and controls them:

• Receives the update signal of its playerthen propagate the signal to the minionsto update their states on a new turn.

• Adds a new minion if there is a spare spot, and remove a minion if it is dead.

Cardis the large set of three types:

In the game, effects generated by spells and sacrifice cards that impact both the player and minions will be stored directly within the player's component and updated each turn, similar to other components Minions interact with the board side, and any effects they create that affect both players and minions will be handled in the same manner as those from spells and sacrifice cards.

As the most important component of the game,card:

• Acts as a target forMouse’s ray casting.

• Requests activation to its owner.

• Is only a class with data and logic functions Any translation, rotation or scaling will be requested toVisualizer.

[3,338] 14 24-dimensional vectors for cards on board

(both player and opponent) [339,347] 1 9-dimensional vector for cards on player’s hand

2 56-dimensional vectors for positions of cards on board (both player and opponent) and player’s hand

Table 4.2: Features which describe a game state.

Visualizer mainly separates visual effects from the whole Card component It is in charge of:

• Showing card onboard and on hand.

• Translating, rotating and scaling the card as demand.

In game development, it is essential to implement a callback for the card component after actions such as activation, attacking, or dying This callback, which occurs after visual effects are executed, is crucial due to the interconnected nature of interactions For instance, when a minion attacks, it first moves closer to its target before dealing damage and then gradually returns to its original position.

Game agent

Experiment

Results

Card Details

Graphics References

Tiêu đề	Smart Agent For Card Games Using Reinforcement Learning
Tác giả	Vo Khac Thanh, Cap Dang Xuan Kiet
Người hướng dẫn	PhD. Nguyen Duc Dung
Trường học	Vietnam National University Ho Chi Minh City
Chuyên ngành	Computer Science
Thể loại	thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	87
Dung lượng	4,67 MB