House Hunter
The Problem
Standard real estate search platforms provide limited filtering capabilities. Price range and bedroom count filters exist, but complex criteria like finished basement requirements, pool exclusions, or specific city restrictions require manual review of each listing. This manual process consumed hours weekly reviewing listings that failed basic criteria.
Real estate API data presents additional challenges. Basement information lacks standardization across listings. The same property attribute appears in structured details arrays, features lists, or free-text descriptions depending on the listing source. Terminology varies between "basement," "lower level," "walk-out," and "finished" with no consistent pattern.
The result: significant time investment in manual searching, missed listings due to review delays, and no mechanism for tracking price changes on previously-reviewed properties.
The Solution
Built a three-stage automated pipeline using LangGraph StateGraph architecture. The scraper node queries Realtor API across target suburbs, applies client-side price filtering to conserve API calls, then retrieves full property details for qualifying listings. The reviewer node executes two-pass validation: fast dictionary checks for price, location, age, property type, and pool status, followed by GPT-4o-mini analysis for basement status, move-in readiness, and unstructured data extraction.
The summarizer node formats results and delivers via Telegram Bot API. Qualifying properties include full address, price, specifications, review highlights, and direct listing links. Non-qualifying days trigger closest-miss reporting using a weighted scoring algorithm: pool presence receives -50, excluded city -40, unfinished basement -25, with price overage scaling proportionally. This ensures continuous market visibility without notification fatigue.
SQLite handles cross-run deduplication, price change tracking with timestamps, and market analytics including average price per city and price-per-square-foot comparisons. APScheduler triggers the full pipeline twice daily on EC2 via systemd, with error notifications for pipeline failures.
Architecture
LangGraph StateGraph with three nodes in linear pipeline: scraper, reviewer, summarizer. Scraper node wraps rate-limited API client searching multiple suburbs with client-side price filtering and detailed property data retrieval. Reviewer node executes two-pass validation: dictionary-based checks for price, location, age, type, and pool, followed by GPT-4o-mini for basement analysis and move-in readiness assessment. Summarizer node formats Telegram messages, calculates closest-miss scoring for rejection reports, and detects price drops across historical data. SQLite database persists property history, review results, price tracking, and notification state. APScheduler triggers workflow on cron schedule, deployed as systemd service on EC2.
Key Implementation Decisions
- •LangGraph over plain Python script: StateGraph provides observable, traceable execution with LangSmith integration and extensible architecture for future nodes (neighborhood scoring, safety ratings)
- •Two-pass review architecture: Dictionary-based pre-filtering rejects 60-70% of properties before LLM processing, minimizing API costs
- •GPT-4o-mini over GPT-4o: Property review requires structured classification rather than creative generation, making smaller model faster and more cost-effective with equivalent accuracy
- •Three-source basement detection: API data inconsistency requires checking details array, features list, and description text with LLM verification as fallback
- •Weighted closest-miss scoring: Penalty-based ranking ensures daily market signal even when no properties qualify, preventing silent failure days
- •Rate limiting as first-class design: 40 calls per run calculated to stay within monthly API budget while covering all target suburbs with detail fetch capacity
The Results
Quantifiable Outcomes
- ✓Eliminated daily manual search across multiple listing platforms with automated twice-daily execution
- ✓Processes 50+ listings per suburb, retrieves details on up to 30 properties, delivers results via Telegram in under 2 minutes
- ✓Detected price drops on previously-reviewed properties that would have been missed through manual monitoring
- ✓API costs controlled at approximately 40 calls per run through client-side pre-filtering and rate limiting
- ✓Zero missed listings in target suburbs since deployment
Lessons Learned
- →Real estate API data requires multi-source extraction: Identical fields (basement status) appear in different locations across listings, requiring fallback logic and validation layers
- →LLM-as-reviewer effective for semi-structured validation: GPT-4o-mini identifies basement context from phrases like "lower level family room" that keyword matching cannot detect
- →Notification design impacts system utility: Closest-miss reporting on zero-match days maintains daily market visibility, weighted scoring required multiple iterations to calibrate correctly
- →Rate limiting requires upfront architectural consideration: Embedding call budget into search loop design prevented unexpected API billing
Future Enhancements
- ⟩Neighborhood scoring node using crime statistics and school ratings APIs
- ⟩Photo analysis with vision model for renovation quality assessment and red flag detection
- ⟩Web dashboard for historical search results and market trend visualization