Document Analysis Platform

Web app for gathering and consuming news stories. Quickly retrieve online articles about a topic and a chatbot will reference them while answering your questions.

  • Flask
  • SQLite
  • Weaviate Vector Database
  • OpenAI API
  • Google Search API

Functionality

  • Seek and Store - using Google Search API and BeautifulSoup extract text from search results then clean and store in databases.
  • Q&A - Question and Answer over dataset with a GPT powered chatbot. Vector search via Weaviate DB.
  • NER - Visualize named entity frequency between articles. Powered by Flair and Chart.js
  • Summarize - Enter URL and see an article's summary.
Github Repo

Topic Modeling via BERTopic and LDA

A tool for quantifying topic model performance. Compares BERTopic and Latent Dirichlet Allocation.

  • BERTopic
  • scikit-learn Pipelines
  • Sentence-Transformers
  • Topic Coherence / NPMI Score
  • Matplotlib
Github Repo

Algorithms

Implementations of classic algorithms from Stanford's Coursera course.

  • Sorting, search and randomized algorithms
  • Graph search and shortest paths
  • Data Structures
  • Time complexity analysis
Github Repo

News Classification

Training and evaluating different classifier models on 2 different news datasets. BBCs 5 class and Huffposts 42 class dataset.

  • Logistic Regression
  • Random Forest
  • Neural Classifier
Github Repo

Information Extraction

Extracting information from text in the form of Subject, Verb, Object triples.

  • SpaCy
  • Coreferee
  • Textacy
Github Repo

News Clustering Study

Scraping news articles, vectorizing their contents and comparing their similarity.

  • BeautifulSoup
  • TF-IDF
  • PCA
  • Matplotlib
Github Repo

Document Database

Input documents, tokenize and encode them via Sentence-Transformers then insert into Milvus DB.

  • NLTK Tokenizer
  • Sentence-Transformers
  • Milvus DB
  • Docker
Github Repo