• Ei tuloksia

MACHINE LEARNING TECHNIQUES FOR NATURAL LANGUAGE PROCESSING

4. TEXT DATA IN MACHINE LEARNING

5.1 Ethics of Text data utilization

Where to draw the line in a Digital Age where the data collection happens all the time?

We know the customer is constantly solicitated to buy another product/service according to a specific product range defined by collection and analysis data (Ford 2015). We know that nowadays, data collection is happening in a widespread way by public or private sector which clearly asks the customer’s consent in most of the cases, but also without users’ knowledge. The companies collect information on the consumer in different ways from customization to an invasion of privacy. (Mark et al. 1999)

According to a new Adobe and Edelman Berland survey, with 84% agreeing that there are too many technologies analyzing and tracking behavior and 82% agreeing that com-panies collect too much information on consumers. 79% feel that their information is collected without their knowledge. (Jameel & Majid 2018)

When people fill out a simple registration form to have information or service, they don´t think about the fact that their data will be stored for a long time in the registration system.

The GDPR (General Data Protection Regulation) in Europe allows to give privacy data rights to the consumer i.e. a set of data handling rules that must be respected by companies and enterprises. The customer can complain after noticing a violation of their rights, so he must be involved in the learning of his rights to defend his position.

Moreover, the most invasive practices happen when (Leidner & Plachouras 2018):

 Information is shared with 3rd parties

 Customer must enter his social security number or other personal information

 An ad follows them around from one website to another

 A website knows their geographic location

As previously mentioned, the information can be collected with or without the customer’s consent. However, the biggest challenge is to understand who are these information used for and for what purpose. For example, many have pondered the following question: how do adverts for hotels in specific destinations start appearing in your pop-ups after chatting to your friends about their recent holiday, without you even touching your phone.

The data collection systems are very powerful, and we can nothing but ponder why and how do they perform the collection and analysis data. (Taylor 2018)

The data text analysis is also a main issue of privacy data rights. On social media, emails, text documents (or any kind of secondary data) a specific analysis is performed in order to know more about user’s personality, character, emotions. (Brey & Soraker 2009) Even if the user doesn’t want to share their information, the companies can extract some mean-ingful information about their private life through their Internet activities and website visits. In addition, by combining information, the companies can have a general idea of what personality the user has, what are their interests, hobbies and movements during a certain time. Some of this information (e.g. social media posts) are publicly available but legal consequences regarding data privacy are something that companies should consider when choosing what data to use in NLP.

6. CONCLUSIONS

The amount of text data has increased ever since the invention of a computer and exploded since the beginning of information age. This opens possibilities to gather insights from data in a text format. We concluded that significant amount of work is used to the prepro-cessing of the data and only a small portion is used to the value creating analysis. Most of the preprocessing methods used are very simple and further research is needed to derive a model on how to automate the preprocessing of textual data.

The analysis on text data is typically called Natural Language Processing (NLP). There is a vast amount of different NLP methods, each answering to different organizational problems and business needs. The diversity of the findings surprised us; the methods that were found from chatbots to simple word counting and from sentiment analysis to auto-matic summarizing.

Artificial intelligence in general brings up many ethical questions, and text data is not an exception. Although processing of personal data is limited with legislation such as GDPR, it may not be enough in the future. AI is definitely an enabler in the field of text analysis and NLP, but the question remains: where to draw the line? It is deemed acceptable e.g.

to conduct sentiment analysis to enterprise data such as customer feedback or social me-dia responses, but it is definitely not acceptable to conduct sentiment analysis on employ-ees’ email correspondence.

All in all, based on this research the field of using artificial intelligence with text data is moving very fast and the future of the field is difficult to predict. Further research is def-initely needed in order to reach the full potential of Text analysis and Natural language processing. Or as Abraham Lincoln put it: “Best way to predict the future is to create it”

REFERENCES

Androutsopoulou, A., Karacapilidis, N., Loukis, E. & Charalabidis, Y. (2018). Trans-forming the communication between citizens and government through AI-guided chat-bots, Government Information Quarterly, (Journal Article).

Bell, T. (2017). 4 business applications for natural language processing. Cio, Retrieved from https://libproxy.tuni.fi/login?url=https://search-proquest-com.lib-proxy.tuni.fi/docview/1976406187?accountid=14242

Brey, P. and Soraker, J. 2009, North Holland, Burlington, MA, USA

. Philosophy of computing and information technology. In Dov M. Gabbay, Antonie Nei-jers, and John Woods, editors, Philosophy of Technology and Engineering Sciences, vol-ume 9, pages 1341–1408.

Choi, Y. & Lee, H. (2017). Data properties and the performance of sentiment classifica-tion for electronic commerce applicaclassifica-tions, Informaclassifica-tion Systems Frontiers, Vol. 19(5), pp.

993–1012.

Church, K. W. & Rau, L. F. (1995). Commercial Applications of Natural Language Pro-cessing, Commun. ACM, Vol. 38(11), pp. 71–79.

Dale, R. (2017). The commercial NLP landscape in 2017, Natural Language Engineering;

Cambridge, Vol. 23(4), pp. 641–647.

Ford, M. (2015) The Rise of the Robots: Technology and the Threat of Mass Unemploy-ment. Oneworld, New York, NY, USA.

Garcia, S., Luengo, J., Herrera, F. (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems, Volume 98, Pages 1-29 https://doi.org/10.1016/j.knosys.2015.12.006

Gabernet, A. R. & Limburn, J. 2017, Breaking the 80/20 rule: How data catalogs trans-form data scientists’ productivity, IBM. Available: https://www.ibm.com/blogs/blue-mix/2017/08/ibm-data-catalog-data-scientists-productivity/

Hirschberg, J. & Manning, C. D. (2015). Advances in natural language processing, Sci-ence, Vol. 349(6245), pp. 261–266.

Huang, M. W., Lin, W. ‐C., Chen, C. ‐W., Ke, S. ‐W., Tsai, C. ‐F., and Eberle, W. (2016) Data preprocessing issues for incomplete medical datasets. Expert Systems, 33: 432–438.

doi: 10.1111/exsy.12155.

Jameel et al. | URNCST Journal (2018): Volume 2, Issue 4, pp 45-73-79-83-84 https://doi.org/10.26685/urncst.39

Krishnan, K. & Rogers, S. P. (2015). Chapter 1 - A New Universe of Data, K. Krishnan

& S. P. Rogers, eds., Social Data Analytics, Boston: Morgan Kaufmann, s. 1–10.

Leidner, L. and Plachouras, V. 2017, Proceedings of the First Workshop on Ethics in Natural Language Processing, Valencia, Spain, pages 30–40

http://aclweb.org/anthology/W17-1604.pdf

Leuhu, T. (2015). Sentiment analysis using machine learning. Saatavissa:

https://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/23136/leuhu.pdf?se-quence=1&isAllowed=y

Li, W., Li, L., Li, Z. Cui,, M. (2019) Statistical relational learning based automatic data cleaning, Front. Comput. Sci. 2019, 13: 215. https://doi-org.lib-proxy.tuni.fi/10.1007/s11704-018-7066-4

Lloret, E., Romá-Ferri, M. T. & Palomar, M. (2013). COMPENDIUM: A text summari-zation system for generating abstracts of research papers, Data & Knowledge Engineer-ing, Vol. 88, pp. 164–175.

Mark, M. M., Eyessel, K. M., Campbell B. (1999) The ethics of data collection and anal-ysis, Special Issue: Current and Emerging Ethical Challenges in Evaluation, Volume 1999, Issue 82, pages 47-56, Available: https://onlineli-brary.wiley.com/toc/1534875x/1999/1999/82

Merchant, K. & Pande, Y. (2018). NLP Based Latent Semantic Analysis for Legal Text Summarization, 2018 International Conference on Advances in Computing, Communi-cations and Informatics (ICACCI), s. 1803–1807.

Melvin M. Mark, Kristen M. Eyssell , Bernadette Campbell, 1999, The ethics of data collection and analysis, https://onlinelibrary.wiley.com/toc/1534875x/1999/1999/82Spe-cial Issue: Current and Emerging Ethical Challenges in Evaluation, Volume1999, Is-sue82, pages 47-56

Moreno, J. & Manuel, J. (2014). Automatic Text Summarization, , Vol. 9781848216686, US: Wiley-Iste.

Moussallem, D., Wauer, M. & Ngomo, A.-C. N. (2018). Machine Translation using Se-mantic Web Technologies: A Survey, Journal of Web SeSe-mantics, Vol. 51, pp. 1–19.

Neustein, A. & Markowitz, J. (2013). Advanced NLP Methods and Applications, Where Humans Meet Machines : Innovative Solutions for Knotty Natural-Language Problems,

(Generic). Saatavissa:

Nukarinen, V. (2018) Automated text sentiment analysis for Finnish language us-ing deep learning, Tampere University of Technology. https://dspace.cc.tut.fi/dpub/bitstream/han-dle/123456789/26114/Nukarinen.pdf?sequence=3&isAllowed=y

Oxford Reference (2016). machine translation, A Dictionary of Computer Science,

(Ge-neric). Saatavissa: sen-timent and emotion analysis of unstructured social media text, Electronic Commerce Re-search, Volume 18, Issue 1, Pages 181-199, https://doi.org/10.1007/s10660-017-9257-8 Taylor F. (2018) Is AI the Future of Security? YUDU Sentinel. Available: https://senti-nelcrisismanagement.blog/2018/01/05/is-ai-the-future-of-security/

Thelwall, M. (2019). The SAGE Handbook of Social Media Research Methods, pages 545-556, , 55 City Road, London: SAGE Publications Ltd.

Trstenjak, B., Mikac, S. & Donko, D. (2014). KNN with TF-IDF based Framework for Text Categorization, Procedia Engineering, Vol. 69, pp. 1356–1364.

Verma, S., Vieweg, S., Corvey, W. J., Palen, L., Martin, J. H., Palmer, M., ... & Anderson, K. M. (2011, July). Natural language processing to the rescue? extracting" situational awareness" tweets during mass emergency. In Fifth International AAAI Conference on Weblogs and Social Media.

Xiaobing, Sun, Xiangyue, Liu, Jiajun, Hu, and Junwu, Zhu. (2014) Empirical studies on the NLP techniques for source code data preprocessing. In Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies (EAST 2014). ACM, New York, NY, USA, 32-39. DOI=10.1145/2627508.2627514 http://doi.acm.org.libproxy.tuni.fi/10.1145/2627508.2627514

Zhang, B., Xiong, D. & Su, J. (2018). Neural Machine Translation with Deep Attention, IEEE Transactions on Pattern Analysis and Machine Intelligence, (Journal Article), pp.

1–1.

Zhou L., Pan S., Wang J., Vasilakos A. V. (2017) Machine learning on big data: Oppor-tunities and challenges, Neurocomputing, Volume 237, Pages 350-361, https://doi.org/10.1016/j.neucom.2017.01.026.

THE POTENTIALS AND BARRIERS OF ARTIFICIAL