Machine Learning Algorithms used in NLP (Natural Language Processing)

发布时间:2016-06-01  栏目:机器翻译, 自然语言处理  评论:0 Comments

以下是个人总结,欢迎补充,不好意思写了英文版,懒得翻译了,哈哈:

 

Natural Language Processing is a very useful domain of using machine learning technologies, and has been developed for many years. Now devices are getting smaller and smaller, NLP becomes a possible better input/output way for small devices.

 

NLP has below sub-categories:

1. Question answering systems, like IBM’s Watson, iApple’s Siri

2. Information Extraction & Sentiment Analysis

3. Machine translation

4. Word sense Disambiguation

5. Relation extraction

6. Abstract summarization.

 

Basic techs of NLP are as below:

1. Basic text processing, like Text Normalization, case folding, Lemmatization, segtence segmentation and etc.

2. Part-of-speeech (POS) tagging

3. Named entity recognition

4. Parsing problem.

5. Language Model. This is very useful for machine translation, speech recognition and etc.

 

NLP uses many algorithms, mostly machine learning algorithms and models, recent years, deep learning technologies like word vector and LSTM (long-short term memory) are also applied to NLP, and get good results.

The algorithms/models NLP might use are as below:

1. Regular Expressions. Used for basic text processing, like word extraction, tokenization, or as features in the classifiers.

2. Decision Trees. Used for sentence segmentation

3. Minimum Edit Distance. To judge the similarity of two strings, can be used in named entity extraction and entity coreference

4. N-grams language model. Simplfy the modle using Markov Assumption.

5. Noisy Channel Model. Used for spelling Correction.

6. Naive Bayes. Used for text classification, like spam dection, authorship identification, assigning subject categories/topics/genres, age/gender identification, sentiment analysis and etc.

7. Traditional classification algorithms like SVM, logistic regression, Maximum Entropy and KNN, also can be used for text classification, relation extraction and etc.

8. HMM and CRF. Used for tagging, named entity recognition and parsing.

9. tf-idf and VSM (Vecotr Space Model).  Can be used for Information Retrieval and etc.

留下评论

You must be logged in to post a comment.

相册集

pix pix pix pix pix pix

关于自己

杨文龙,微软Principal Engineering Manager, 曾在各家公司担任影像技术资深总监、数据科学团队资深经理、ADAS算法总监、资深深度学习工程师等职位,热爱创新发明,专注于人工智能、深度学习、图像处理、机器学习、算法、自然语言处理及软件等领域,目前发明有国际专利19篇,中国专利28篇。

联系我

个人技术笔记

welonshen@gmail.com

2015 in Shanghai