Tokenization is a fundamental preprocessing step in natural language processing (NLP) and LLM that influences both model performance and computational efficiency. Although extensive research has explored tokenization strategies in terms of accuracy, their impact on energy consumption remains underexplored. This study investigates the energy consumption and classification performance of three widely used tokenization methods: word-level, character-level, and subword-level on sentiment analysis tasks using BiLSTM models. Our experiments measure the energy consumption of tokenization, model training, and inference, along with classification performance using the F1 score. The results show that subword tokenization is the most energy intensive during tokenization, while character tokenization consumes the highest energy during training and inference. Word tokenization emerges as the most energy efficient method in all stages while achieving competitive classification performance. Despite its higher energy consumption, subword tokenization performs similarly to word tokenization in terms of F1 score. However, character tokenization, proves to be the least effective, exhibiting the lowest classification performance while demanding the highest computational resources. These findings emphasize the need to consider tokenization choices in Green AI research, particularly for energy-efficient NLP applications. Future work should extend this analysis to other NLP models, such as transformers, to provide a broader understanding of tokenization efficiency in deep learning.

Greening the AI: Evaluating tokenization methods for Energy-efficient NLP

Omar R.;Muccini H.
2025-01-01

Abstract

Tokenization is a fundamental preprocessing step in natural language processing (NLP) and LLM that influences both model performance and computational efficiency. Although extensive research has explored tokenization strategies in terms of accuracy, their impact on energy consumption remains underexplored. This study investigates the energy consumption and classification performance of three widely used tokenization methods: word-level, character-level, and subword-level on sentiment analysis tasks using BiLSTM models. Our experiments measure the energy consumption of tokenization, model training, and inference, along with classification performance using the F1 score. The results show that subword tokenization is the most energy intensive during tokenization, while character tokenization consumes the highest energy during training and inference. Word tokenization emerges as the most energy efficient method in all stages while achieving competitive classification performance. Despite its higher energy consumption, subword tokenization performs similarly to word tokenization in terms of F1 score. However, character tokenization, proves to be the least effective, exhibiting the lowest classification performance while demanding the highest computational resources. These findings emphasize the need to consider tokenization choices in Green AI research, particularly for energy-efficient NLP applications. Future work should extend this analysis to other NLP models, such as transformers, to provide a broader understanding of tokenization efficiency in deep learning.
File in questo prodotto:
File Dimensione Formato  
Greening_the_AI__Evaluating_tokenization_methods_for_Energy_efficient_NLP.pdf

solo utenti autorizzati

Licenza: Non specificato
Dimensione 246.21 kB
Formato Adobe PDF
246.21 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11697/284161
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact