# Design Considerations of Linear Algebra Processor for Wearable Brain-Computer Interface System

## Woo Seok Byun<sup>1</sup>, Do Kyun Kim<sup>2</sup>, Sung Yeon Kim<sup>3</sup> and Ji Hoon Kim<sup>a</sup>

1 Information & Electronics Research Institute, Korea Advanced Institute of Science and Technology <sup>2</sup>Hanwha Systems, <sup>3</sup>Synopsys Korea, <sup>a</sup>Department of Electronic and Electrical Engineering, Ewha Womans University E-mail: <sup>1</sup>wooseokbyun@kaist.ac.kr

Abstract - In this paper, we introduced design considerations of a wearable brain-computer interface (BCI) that performs a target identification algorithm based on linear algebra. Steadystate visual evoked potential (SSVEP) based wearable BCI have been studied to enable paralyzed patients to communicate with others. However, performance indicators such as target identification accuracy and the information transfer rate (ITR) still need to be further improved for wearable devices. This paper discusses several considerations for designing algorithms and linear algebra accelerating hardware. In the case of target identification algorithms, a signal binarization technique and candidate reduction technique which are proposed in the previous works can be considered in single-channel SSVEPbased software implementations and multi-channel SSVEP processing in hardware to reduce computational complexity, respectively. For hardware architecture design, we introduced architectural considerations of processing element array that can effectively perform various linear algebra operations.

Keywords—Brain-Computer Interface (BCI), Linear algebra processor, System-on-a-chip, Target identification

#### I. INTRODUCTION

Brain-computer interface (BCI) can convert electrical signals of brain activity into interpretable information that reflects the user's intent [1]-[3]. Because BCI technology only requires electrical signals, it can provide a new control channel for people with neuromuscular disorders [1], [3]. For paralyzed patients, the BCI speller allows the use of word processing programs for communication with others.

BCI techniques can be classified as either invasive or noninvasive depending on whether the surgery is performed or not. In non-invasive BCI, the electroencephalogram (EEG) based BCI speller has been widely used in paralyzed patients due to its high time resolution, low cost, safety, and wide range of applications [4]-[6]. Because spelling systems for paralyzed patients require fast and intuitive communication,



Fig. 1. (a) Principle of target identification of visual evoked potential (VEP)based BCI system. (b) Implementation example as the headband-type device with system-on-a-chip (SoC) for signal processing

the steady-state visual evoked potential (SSVEP) has been widely used [5], [7], [8].

Recent works on high-speed visual stimuli-based BCI (V-BCI) systems mainly used electrode array caps with EEG recording devices and external signal processing machines. This type of BCI spelling system is inconvenient, takes a long time to set up, and relies on external computing systems [9]. Such systems have a high information transfer rate (ITR), but their high cost and large form factor limit their practicality in real-world situations.

In previous study, an in-ear V-BCI was reported in [10]. However, the wearable in-ear device of this system only acquires the SSVEP signal and transmits raw SSVEP data wirelessly with high power consumption [11]. Therefore, there is a requirement for an energy-efficient wearable V-BCI that performs signal processing algorithms on the chip.

The operation of the wearable V-BCI with its example is illustrated in Fig. 1(a). (1) When the user gazes at a blinking character on the display, the frequency information of the blinking character is reflected in the brain signal (SSVEP). (2) SSVEP can be measured in the occipital area of the scalp. (3) The system then performs signal acquisition and a target identification algorithm to identify the target the user has focused on. (4) The results of the target identification

a. Corresponding author; jihoonkim@ewha.ac.kr

Manuscript Received May. 10, 2021, Revised Jun. 11, 2021, Accepted Jun. 11, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (<u>http://creativecommons.org/licenses/bync/3.0</u>) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.



Fig. 2. Performance indicators of target identification algorithm.

algorithm are transmitted wirelessly to the display. (5) After a single typing, the user moves his or her gaze to the next visual target to continue typing.

The key performance indicators of the V-BCI system are target identification accuracy and information transfer rate (ITR). As shown in the ITR equation in Fig. 2, the ITR is determined by the number of all visual target stimuli  $N_f$ ; the accuracy P, and the target selection time T. The ITR can be improved by increasing P while decreasing  $N_f$  and T. The practicality of the wearable V-BCI system can be improved by increasing the number of target frequencies  $N_f$  since it represents the number of intents that can be expressed. However, the higher the  $N_f$ ; the higher performance of the target identification algorithm is required as smaller visual stimuli are placed in the limited display area and should be identified with small frequency intervals.

The target selection time, T, is composed of a gaze shift time  $T_{gaze}$  (usually 0.5 sec) to move the focus of sight, the SSVEP recording time  $T_R$ , and the signal processing time  $T_{proc}$  required to complete the signal processing after the end of the recording. Reducing the T is also an effective way to improve the ITR. However, in general, the longer the  $T_R$ , the higher the P from a large amount of target information in the SSVEP. In addition, a high-performance target identification algorithm with high accuracy can take a long  $T_{proc}$ , which works opposite to reducing T. Therefore, the  $T_R$  and target identification algorithm should be carefully determined.

The accuracy can be improved by increasing the  $T_R$  or the number of channels as indicated in Fig. 2. However, the amount of computation increases exponentially by using the higher number of SSVEP channels. The dramatic increase of computational complexity can be shown when the system uses multiple channels rather than a single channel.

Therefore, if high accuracy is not required in a V-BCI system, it can be an appropriate design strategy to consider optimizing single channel SSVEP-based algorithm with low power consumption in the low-cost embedded system. Due to the limitations of single-channel SSVEP in terms of the amount of target information, V-BCI can be implemented with a multi-channel-based target identification algorithm to meet the high accuracy requirements. Accordingly, it will be inefficient and difficult to perform a multi-channel-based



Fig. 3. Algorithm optimization for single-channel target identification on low-cost MCU-based V-BCI system. This optimization method not only reduced the memory footprint of the target identification program by 92% but also significantly reduced the computational complexity.

algorithm with software in a wearable device. In this case, it is important to perform the algorithm by signal processing hardware which can accelerate the frequently used linear algebra operations with high energy efficiency.

This paper introduces a wearable V-BCI system that performs target identification based on linear algebra and discusses issues that need to be specifically considered in the design of an algorithm and a linear algebra processor. For the target identification algorithm, two topics are covered: an optimization technique that can be considered in a singlechannel SSVEP-based software implementation to reduce memory footprint and computational complexity, and a target candidate reduction technique that can be considered in multi-channel SSVEP-based implementation to reduce the computational load. For hardware design, we discuss the architecture and performance of a processing element (PE) array that can effectively perform various linear algebra operations.

#### II. DESIGN METHODOLOGY OF WEARABLE BCI SYSTEM

#### A. Algorithm

Fig. 3 shows the concept of signal binarization reported in the previous study [12] for the target identification step in the procedure of wearable V-BCI system operation. The signal binarization to the SSVEP maps the amplitude of the SSVEP signal to +1 or -1. Positive amplitude samples map to +1 and negative/zero amplitude samples map to -1. In general, the target identification algorithms consist of linear algebra operations are performed in software that uses floating-point arithmetic. If a sample can be stored in 4-byte float variable, 8 samples take up 32-bytes of memory on an MCU-based system. However, if the amplitude is mapped to a single digit binary value, 8 samples will occupy only 1byte and can be processed with fixed-point arithmetic. Signal binarization reduces data memory and code size of MCU-based system for single-channel SSVEP as shown in Fig. 3.

Fig. 4(a) represents baseline algorithm, CCA-Comb, a combinatorial method for CCA-Standard (standard version of canonical correlation analysis) and individual templatebased CCA (IT-CCA) [8], [13], which detects temporal features of SSVEP signals using CCA-Standard between SSVEP and individual SSVEP template that can be obtained by averaging multiple training trials. The CCA-Comb uses the four weight vectors as spatial filters to enhance the SNR of SSVEPs. The CCA-Comb for multi-channel SSVEP



Fig. 4. (a) The conventional target identification algorithm analyzes the correlation for all existing visual stimuli and selects the most correlated target among them. (b) Candidate reduction (CR) reduces the number of target candidates using simple correlation analysis.

typically compares SSVEP to all target frequencies as shown in Fig. 4(a). It can increase computational complexity in proportion to the number of visual targets. In previous work [14], the candidate reduction (CR) technique was proposed to reduce the target candidates with simple preprocessing as described in Fig. 4(b). If the total number of targets  $N_f$  is 12 and the predetermined number of candidates is 3, the CR technique, a simple correlation analysis between the input SSVEP and the subject-specific SSVEP template data, reduces the number of target candidates from 12 to 3 before processing the target identification algorithm. If the top-3 targets of the correlation coefficient are selected, the target identification algorithm is performed for only those 3 candidates with a negligible decrease of average target identification accuracy. In this case, except for the addition of the initial correlation calculation, the computational complexity can be reduced by about 75%.

Fig. 5 shows the result of analyzing the dataset used in [15] with the CCA-CR algorithm proposed in [14]. Fig. 5(a) represents the probability that the focused target stimuli are included in the group of reduced candidates for each  $T_R$  according to the predetermined number of candidates. The accuracy tends to be the same depending on the length of the  $T_R$ , however, the performance is particularly poor for the 0.5s-long  $T_R$ . Considering that as the number of candidates decreases, the amount of computation is drastically reduced, it is effective for system performance to select 3 candidates for 1s  $T_R$ , which is close to 90% accuracy. Fig. 5(b) shows the results of Fig. 5(a) according to the number of candidates for each subject. In the case of subjects 2, 5, 9, and 10, when the number of candidates falls below 3, the performance decreases significantly.



Fig. 5. (a) Accuracy of CCA-CR according to the number of candidates. (b) Subject-wise accuracy of CCA-CR using 1-s SSVEP according to the number of candidates.

#### B. Hardware Architecture

Linear algebra operations used in the target identification algorithm of V-BCI typically require BLAS (basic linear algebra Subprograms) Level 1-3 and matrix factorization. In general, the hardware architecture for simple matrix multiplication is a PE (processing element) array of MAC units, which is also called a systolic array. The systolic array is suitable not only for simple MAC operations but also for complex operations such as matrix factorization from point of view of data flow since the data flow can be appropriately expressed at the hardware architecture level for matrix factorization. In addition, the systolic array can reduce the memory bandwidth and access, which helps to improve energy efficiency.



Fig. 6. (a) Comparison of parallel processing and sequential processing by multiple array architecture and single array architecture, respectively. (b) ITR analysis for the time component constituting the target selection time, T, which is an important factor in calculating the ITR. (c) ITR analysis over processing time,  $T_{proc}$ , for various target identification accuracy.

Target identification accuracy and ITR are closely related to the hardware architecture in the multi-channel SSVEP based system implementation. Fig. 6(a) shows the typical difference between the single array architecture and the 3engine architecture proposed in a previous study [14] in which two PE arrays are contained in one engine. Multiple array architecture can quickly complete the target identification algorithm through parallel processing. The single array architecture requires relatively small hardware complexity and power consumption, while there is a long delay to algorithm completion due to sequential processing.

In the ITR equation of Fig. 6(b), target selection time T is composed of  $T_{gaze}$ ,  $T_R$ , and  $T_{proc}$  as mentioned earlier. If the  $T_R$  can be significantly reduced with a high performance SSVEP acquisition equipment or target identification algorithm, then reduction of  $T_R$  improves ITR by increasing the ratio of  $T_{proc}$  to T. In this case, it is better to exploit parallel processing from multiple array architecture to improve ITR. However, if the  $T_R$  is longer than 1 sec or more, reducing  $T_{proc}$  does not significantly affect ITR improvement, then using a multiple array architecture to reduce  $T_{proc}$  is inefficient in terms of hardware area and



Fig. 7. Implementation procedure and methodology. In the hardware design step, algorithm profiling, chip specification, and functional modeling into python and C simulator are particularly important for numerical processing in terms of quantization error of fixed-point arithmetic.

| Technology         | 130nm CMOS                                                              |
|--------------------|-------------------------------------------------------------------------|
| Array Architecture | 8x8 PE-based single array architecture                                  |
| Core Voltage       | 1.0 V                                                                   |
| Clock Frequency    | 90 MHz                                                                  |
| Average Power      | 122.8mW for CCA-CR processing<br>(PE array consumes 92% of total power) |
| Application        | Linear Algebra Processing for V-BCI                                     |

TABLE I. CHIP INFORMATION

power consumption. Therefore, it is suitable to implement a wearable V-BCI device to lower the chip implementation cost and improve energy efficiency by sequentially processing through a single array architecture.

Fig. 6(c) shows the ITR according to the change of  $T_{proc}$  when  $T_{gaze}$  and  $T_R$  are set to 0.5sec and 1.0sec, respectively. If the  $T_{proc}$  is in the order of tens of ms, even if the  $T_{proc}$  becomes 6 times longer, the change in ITR is not significant.

Designing a linear algebra acceleration processor requires careful consideration of a reconfigurable PE array and an ISA (instruction set architecture) representing various linear algebra operations. In addition, efficient microarchitecture for repetitive operations should be considered. Various matrix factorization made up of an iterative procedure typically involves plane rotation. The accurate results of the matrix factorization can be obtained through iterative plane rotation to get closer to convergence. Designing these operations as a hardware architecture requires consideration of minimizing memory access for energy efficiency by moving data between PEs in the array.

Fig. 7 shows the overall design process and methodology for a linear algebra processor. The front-end and back-end of the chip implementation follow the general procedure for implementing a digital system. The algorithm profiling, chip specification, and functional simulator step in the hardware design process are very important in numerical processing applications dealing with linear algebra. In general, serious malfunction can occur when the quantization errors generated by converting the floating-point arithmetic into fixed-point arithmetic are accumulated. Therefore, the process of converting the floating-point algorithm to the fixed-point algorithm and the hardware Q-format design should consider the accumulation of errors due to repetitive linear algebra operations.

#### **III. RESULTS AND DISCUSSIONS**

The linear algebra processor with design methodology introduced in the previous sections was fabricated in a 130nm CMOS process as indicated in Table I. This chip is implemented to operate at a core voltage 1.0V at 90MHz operating frequency and has a single array to reduce power consumption since energy efficiency is an important indicator in this work. Linear algebra processors can be designed to support different types of operations according to the requirements of the application system. It is important to define the supported operations and ISA to be specialized in the target domain. In addition, in order to efficiently perform reconfiguration that changes the behavior of hardware in runtime, there are important issues such as managing input/output memory address, size, and matrix shape, and Q-format settings.

It is always recommended that the ITR of target identification is high, except when the accuracy and energy efficiency are too low. For example, if the number of visual targets is very high and the target selection time is very short, ITR may look high even if the accuracy is low. In this case, it is difficult to use the system. In addition to ITR, energy efficiency should be considered when referring to the physical aspects of chip performance. Reducing target selection time requires an approach that improves the operating frequency and reduces signal processing latency, which harms energy efficiency. In terms of overall system performance, then increasing the operating frequency is not always an appropriate approach, and it is important to achieve a balance between operating frequency and energy efficiency. If the designer needs to reduce target selection time as much as possible, even if the energy consumption is large, it would be better to adopt a multiple array architecture and improve the operating frequency. However, if energy efficiency is the most important metric, reducing power consumption by making signal processing latencies longer by using a single array and lower operating frequency can help ensure proper ITR and energy efficiency performance.

### IV. CONCLUSION

This paper introduced design considerations in terms of algorithms and hardware architecture issues that can be considered when designing a linear algebra processor for a wearable V-BCI system. The number of SSVEP channels and the corresponding algorithm implementation method should be determined according to the system requirements of the wearable V-BCI. In the case of processing single channel SSVEP, the target identification algorithm can be optimized and implemented in software and processed in the low-cost MCU environment. In the case of processing multichannel SSVEP, hardware acceleration is required to handle a large amount of computation. The candidate reduction technique was also introduced that significantly reduces the amount of computation with little reduction of accuracy. The hardware architecture should be determined considering the time required for signal processing and input recording, which constitutes the target selection time of the V-BCI system. Unless the recording time is short and the signal processing time needs to be reduced, a single array architecture is suitable. In addition, the PE array architecture should be designed by reflecting the repetitive characteristics of linear algebra operations.

#### ACKNOWLEDGMENT

This work was supported in part by the Super Computer Development Leading Program of the National Research Foundation of Korea (NRF) funded by the Korean government (Ministry of Science and ICT (MSIT)) (2020M3H6A1084852) and in part by the IC Design Education Center (IDEC), Korea.

#### REFERENCES

- D. J. McFarland and J. R. Wolpaw, "Brain-computer interface operation of robotic and prosthetic devices," Computer, vol. 41, no. 10, pp. 52-56, Oct. 2008.
- [2] N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A Kubler, J. Perelmouter, E. Taub, and H. Flor, "A spelling device for the paralysed," Nature, vol. 398, pp. 297-298, 1999.
- [3] N. Birbaumer and L. Cohen, "Brain-computer interfaces: Communication and restoration of movement in paralysis," J. Physiol., vol. 579, no. 3, pp. 621-636, Jan. 2007.
- [4] H. Cecotti, "Spelling with non-invasive brain-computer interfaces-current and future trends," J. Physiol. – Paris, vol. 105, pp. 106-114, June. 2011.
- [5] D. Zhu, J. Bieger, G. G. Molina, and R. M. Aarts, "A survey of stimulation methods used in SSVEP-based BCIs," Comput. Intell. Neurosci., vol. 2010, 2010, 702357.
- [6] M. Arvaneh, C. Guan, K. K. Ang, and C. Quek, "Optimizing spatial filters by minimizing within-class dissimilarities in electroencephalogram-based braincomputer interface," IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 610-619, Apr. 2013.
- [7] M. Cheng, X. Gao, and S. Gao, "Design and implementation of a brain-computer interface with high transfer rates," IEEE Trans. Biomed. Eng., vol. 49, pp. 1181-1186, Oct. 2002.
- [8] Y. Wang, M. Nakanishi, Y.-T. Wnag, and T.-P. Jung, "Enhancing detection of steady-state visual evoked potentials using individual training data," Proc. Ann. Int. Conf. IEEE Eng. Med. Bio. Soc., pp. 3037-3040, 2014.
- [9] M. Nakanishi, Y. Wang, X. Chen, Y.-T. Wang, X. Gao, and T.-P. Jung, "Enhancing detection of SSVEPs for a high-speed brain speller using task-related component analysis," IEEE Trans. Biomed. Eng., vol. 65, no. 1, pp. 104-112, Jan. 2018.
- [10] J. W. Ahn, Y. Ku, D. Y. Kim, J. Sohn, J.-H. Kim, and H. C. Kim, "Wearable in-the-ear EEG system for SSVEP-based brain-computer interface," Electron. Lett., vol. 54, no. 7, pp. 413–414, Mar. 2018.
- [11] N. Verma, A. Shoeb, J. Bohorquez, J. Dawson, J. Guttag, and A. P. Chandrakasan, "A micro-power EEG acquisition SoC with integrated feature extraction processor for a chronic seizure detection system," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 804–816, Apr. 2010.
- [12] D. Kim, W. Byun, Y. Ku, J.-H. Kim, "High-Speed Visual Target Identification for Low-Cost Wearable Brain-Computer Interfaces," IEEE Access, vol. 7, pp. 55169-55179, April. 2019.
- [13] M. Nakanishi, Y. Wang, Y. Wang, Y.T. Mitsukura, and T.-P. Jung, "A high-speed brain speller using steadystate visual evoked potentials," Int. J. Neural Syst., vol. 24, no. 6, pp. 1450019, 2014.

- [14] W. Byun, D. Kim, S. Y. Kim and J. Kim, "A 110.3bits/min 8-Ch SSVEP-based Brain-Computer Interface SoC with 87.9% Accuracy," 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 201-204, 2019.
- [15] M. Nakanishi, Y. Wang, Y. Wang, and T.-P. Jung, "A Comparison Study of Canonical Correlation Analysis Based Methods for Detecting Steady-State Visual Evoked Potentials," PLoS One, vol. 10, no. 10, pp. e0140703, Oct. 2015.



**Wooseok Byun** received the B.S. degree in electronics engineering and the M.S. and Ph.D. degree in electronics, radio, and information communications engineering, from Chungnam National University, Daejeon, South Korea, in 2013, 2015, and 2020, respectively. In 2020, he joined the Information and Electronics Research Institute, Korea Advanced Institute of

Science and Technology (KAIST), Daejeon, South Korea, where he is currently a Postdoctoral Researcher. His current research interests include CPU/DSP, brain-computer interface, and energy-efficient neural processing engine. He was the recipient of the Distinguished Design Award from the IEEE A-SSCC 2019 Student Design Contest, in 2019.



**Dokyun Kim** received the B.S. and M.S. degrees in electronics engineering from Chungnam National University, Daejeon, South Korea and SeoulTech, Seoul, South Korea, in 2017 and 2019, respectively.

In 2019, he joined the Avionics R&D Center, Hanwha Systems, where he is currently an Engineer. His current research interests

include avionics software such as Operational Flight Program (OFP), communication system in aircraft, test automation tools for SW reliability testing, and braincomputer interface for military application.



Sung Yeon Kim received the B.S. and M.S. degree in electronic engineering from SeoulTech, Seoul, South Korea, in 2018, and 2020, respectively. In 2020, he joined the Synopsys Korea, Seongnam-si, South Korea, where he is currently working as an Engineer. Application He is fusioncurrently supporting compiler front-end projects.



**Ji-Hoon Kim** received the B.S. (summa cum laude) and Ph.D. degrees in electrical engineering and computer science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics. In 2018, he joined the Faculty of the Department of Electronic and Electrical Engineering, Ewha

Womans University, Seoul, South Korea, where he is currently a Professor. His current research interests include CPU/DSP, communication modem, and low-power SoC design for security/biomedical systems. Dr. Kim is a Technical Committee Member of the circuits and systems for communications and VLSI systems and applications in the IEEE Circuits and Systems Society. He was the recipient of the Best Design Award from the Dongbu HiTek IP Design Contest, in 2007, and the First Place Award from the International SoC Design Conference Chip Design Contest, in 2008.