Data Engineering & NLP

Market Voice Analytics

Analyzing the market impact of central bank communications using natural language processing and financial sentiment analysis

Role Full Stack Developer
Timeline 2025
Type Personal Project
Python Apache Airflow PostgreSQL FinBERT NLP Streamlit Docker SQLAlchemy
MVA

Overview

Market Voice Analytics is an end-to-end data pipeline that scrapes speeches from the European Central Bank (ECB) and Federal Reserve, performs financial sentiment analysis using FinBERT, and correlates the results with market movements across major indices and currencies including EUR/USD, S&P 500, Gold, US Treasuries, and Euro Stoxx 50.

The project demonstrates the intersection of natural language processing, financial analysis, and data engineering by analyzing how central bank communications influence market behavior.

Key Features

Automated Data Ingestion

Scrapers for ECB and Federal Reserve speeches, press releases, and statements with URL-based deduplication

Historical Backfill

Archive scrapers that fetch thousands of historical speeches from ECB foedb JSON database and Fed yearly archives

Financial NLP

Sentiment analysis using FinBERT with intelligent sentence-based chunking for long documents

Market Data Integration

Automatic fetching of price data around speech dates using Yahoo Finance API

Correlation Analysis

Measures how speech sentiment correlates with market movements over 1-day and 1-week periods

Interactive Dashboard

3-page Streamlit app with sentiment distribution, speaker analysis, and market impact visualizations

System Architecture

The pipeline follows a modular architecture with clear separation of concerns:

ECB RSS Feed Fed JSON / RSS | | v v +--------------+ +--------------+ | ECB Scraper | | Fed Scraper | +--------------+ +--------------+ | | +----------+ +------------+ | | v v +----------------+ | PostgreSQL | | (Docker) | +----------------+ | +---------------+----------------+ | | v v +---------------+ +-----------------+ | FinBERT NLP | | Market Data | | (Sentiment) | | (yfinance) | +---------------+ +-----------------+ | | +---------------+----------------+ | v +------------------+ | Streamlit | | Dashboard | +------------------+

Technology Stack

Language
Python 3.11+
Orchestration
Apache Airflow 2.9
Database
PostgreSQL 16, SQLAlchemy 2.0
NLP
HuggingFace Transformers, FinBERT
Market Data
yfinance
Dashboard
Streamlit, Plotly
Infrastructure
Docker, Docker Compose
Tools
Poetry, Ruff, pytest

Technical Challenges & Solutions

Challenge: BERT Token Limit

Central bank speeches are typically 2,000-5,000 words, but BERT models have a 512-token limit.

Solution: Implemented intelligent sentence-based chunking that splits text on sentence boundaries, verifies exact token count for each chunk, handles edge cases like abbreviations and long sentences, and aggregates sentiment scores across all chunks.

Challenge: Data Deduplication

Ensuring no duplicate speeches are stored when running daily ingestion and historical backfills.

Solution: URL-based deduplication with database constraints and validation checks before insertion.

Challenge: Orchestration & Scheduling

Coordinating multiple data sources and processing steps in a reliable, scheduled manner.

Solution: Apache Airflow DAGs for orchestration with proper dependency management and error handling.

Dashboard Pages

  • Overview: KPI cards showing total speeches analyzed, sentiment distribution charts, sentiment trends over time, and filterable speech table
  • Speaker Analysis: Per-speaker drill down with sentiment breakdown, speaker comparison heatmap, and individual speaker statistics
  • Market Impact: Sentiment vs market changes correlation, scatter plots with trendlines, box plots showing sentiment distribution, and speaker impact ranking showing which speakers move markets the most

Data Sources

  • European Central Bank: Speeches and press releases via RSS feed and foedb JSON archive
  • Federal Reserve: Speeches and statements via JSON endpoint and yearly HTML archives
  • Yahoo Finance: Market price data for EUR/USD, S&P 500, US 10Y Treasury, Gold, and Euro Stoxx 50

Results & Impact

The project successfully demonstrates:

  • End-to-end data engineering pipeline design and implementation
  • Integration of modern NLP techniques with financial data analysis
  • Scalable architecture using industry-standard tools (Airflow, PostgreSQL, Docker)
  • Interactive data visualization for exploratory analysis
  • Correlation between central bank communications and market behavior