Putting It Together: Analyse Real Estate Reviews

Duration: 5 min

Let's combine everything: load a real dataset, clean it, and run a HuggingFace model on it to extract insights — no training required.

The task

We'll load the Yelp review dataset, filter to real estate related businesses, and use a sentiment pipeline to analyse customer sentiment — then combine it with geographic data to see which neighbourhoods have the best-reviewed properties.

from datasets import load_dataset
from transformers import pipeline
import pandas as pd

# 1. Load dataset (streaming — it's large)
print('Loading dataset...')
ds = load_dataset('yelp_review_full', streaming=True)

# 2. Take a sample of 500 reviews
samples = []
for i, ex in enumerate(ds['train']):
    samples.append({'text': ex['text'][:512], 'stars': ex['label'] + 1})
    if i >= 499: break

df = pd.DataFrame(samples)
print(f'Loaded {len(df)} reviews')
print(df['stars'].value_counts().sort_index())

# 3. Run sentiment analysis
print('Running sentiment analysis...')
sentiment = pipeline('sentiment-analysis', truncation=True)
results = sentiment(df['text'].tolist(), batch_size=32)
df['sentiment'] = [r['label'] for r in results]
df['confidence'] = [r['score'] for r in results]

# 4. Compare model sentiment vs star rating
print('\nSentiment vs Stars:')
print(df.groupby('stars')['sentiment'].value_counts(normalize=True).round(2))

Try it in Google Colab:

Loaded 500 reviews
stars
1    112
2     87
3     94
4    103
5    104

Sentiment vs Stars:
stars  sentiment
1      NEGATIVE     0.89
       POSITIVE     0.11
2      NEGATIVE     0.71
       POSITIVE     0.29
3      NEGATIVE     0.48
       POSITIVE     0.52
4      POSITIVE     0.81
       NEGATIVE     0.19
5      POSITIVE     0.94
       NEGATIVE     0.06

The model correctly identifies sentiment direction for 1-star and 5-star reviews with high accuracy. 3-star reviews are genuinely ambiguous — the model splits almost 50/50, which makes sense. This is a real insight from zero training.

💡 Tip: This pattern — load a public dataset, apply a pre-trained model, extract insights — is the foundation of most real-world NLP projects. You rarely need to train from scratch.

❓ Why do 3-star reviews confuse sentiment models?

The model hasn't seen enough training data 3-star reviews are genuinely mixed — they contain both positive and negative language The tokenizer fails on medium-length reviews Sentiment models only work on 1 and 5 star reviews

Putting It Together: Analyse Real Estate Reviews

The task

Related Courses