Tokenization is a fundamental preprocessing step in natural language processing (NLP) and LLM that influences both model performance and computational efficiency. Although extensive research has explored tokenization strategies in terms of accuracy, their impact on energy consumption remains underexplored. This study investigates the energy consumption and classification performance of three widely used tokenization methods: word-level, character-level, and subword-level on sentiment analysis tasks using BiLSTM models. Our experiments measure the energy consumption of tokenization, model training, and inference, along with classification performance using the F1 score. The results show that subword tokenization is the most energy intensive during tokenization, while character tokenization consumes the highest energy during training and inference. Word tokenization emerges as the most energy efficient method in all stages while achieving competitive classification performance. Despite its higher energy consumption, subword tokenization performs similarly to word tokenization in terms of F1 score. However, character tokenization, proves to be the least effective, exhibiting the lowest classification performance while demanding the highest computational resources. These findings emphasize the need to consider tokenization choices in Green AI research, particularly for energy-efficient NLP applications. Future work should extend this analysis to other NLP models, such as transformers, to provide a broader understanding of tokenization efficiency in deep learning.
Greening the AI: Evaluating tokenization methods for Energy-efficient NLP
Omar R.;Muccini H.
2025-01-01
Abstract
Tokenization is a fundamental preprocessing step in natural language processing (NLP) and LLM that influences both model performance and computational efficiency. Although extensive research has explored tokenization strategies in terms of accuracy, their impact on energy consumption remains underexplored. This study investigates the energy consumption and classification performance of three widely used tokenization methods: word-level, character-level, and subword-level on sentiment analysis tasks using BiLSTM models. Our experiments measure the energy consumption of tokenization, model training, and inference, along with classification performance using the F1 score. The results show that subword tokenization is the most energy intensive during tokenization, while character tokenization consumes the highest energy during training and inference. Word tokenization emerges as the most energy efficient method in all stages while achieving competitive classification performance. Despite its higher energy consumption, subword tokenization performs similarly to word tokenization in terms of F1 score. However, character tokenization, proves to be the least effective, exhibiting the lowest classification performance while demanding the highest computational resources. These findings emphasize the need to consider tokenization choices in Green AI research, particularly for energy-efficient NLP applications. Future work should extend this analysis to other NLP models, such as transformers, to provide a broader understanding of tokenization efficiency in deep learning.| File | Dimensione | Formato | |
|---|---|---|---|
|
Greening_the_AI__Evaluating_tokenization_methods_for_Energy_efficient_NLP.pdf
solo utenti autorizzati
Licenza:
Non specificato
Dimensione
246.21 kB
Formato
Adobe PDF
|
246.21 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


