How to Scrape Data from LinkedIn with Python

Learn how to scrape Google to find LinkedIn profile data.

Profile picture of Tuan Nguyen
Tuan Nguyen
Cover Image for How to Scrape Data from LinkedIn with Python

    LinkedIn is a great place to find leads and engage with prospects. To engage with potential leads, you'll need a list of users to contact.

    However, getting a list is difficult without some scraping knowledge.

    You can search Google for potential LinkedIn users and company profiles using the following script.

    Tools Required

    You'll need Python 2.7+ and some packages to get started. Once you install Python, you can run the following command to install the necessary packages.

    pip install requests

    LinkedIn Scraper Script

    First, we need to import all the packages that we need.

    These packages are used for randomizing the user-agent and making the requests. Then regex is used to parse out the LinkedIn profiles and links.

    import random 
    import argparse 
    import requests 
    import re

    We create a LinkedinScraper class that tracks and holds the data for each request.

    The class requires two parameters keyword and limit.

    The keyword parameter specifies the search term. The limit parameter sets the max amount of links to search for.

    class LinkedinScraper(object):
      def __init__(self, keyword, limit):
          """
          :param keyword: a str of keyword(s) to search for
          :param limit: number of profiles to scrape
          """
          self.keyword = keyword.replace(' ', '%20')
          self.all_htmls = ""
          self.quantity = '100'
          self.limit = int(limit)
          self.counter = 0

    The LinkedinScraper class has three main functions, search , parse_links, and parse_people.

    The search function will perform the requests based on the keywords. It first generates a URL that is Google specific query based on the keyword and limit. Then it makes the requests and saves all the HTML into self.all_htmls.

    def search(self):
        """
        perform the search
        :return: a list of htmls from Google Searches
        """
        
        # choose a random user agent
        user_agents = [
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) chromeframe/10.0.648.205',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/11.10 Chromium/18.0.1025.142 Chrome/18.0.1025.142 Safari/535.19',
            'Mozilla/5.0 (Windows NT 5.1; U; de; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00'
        ]
        while self.counter < self.limit:
            headers = {'User-Agent': random.choice(user_agents)}
            url = 'http://google.com/search?num=100&start=' + str(self.counter) + '&hl=en&meta=&q=site%3Alinkedin.com/in%20' + self.keyword
            resp = requests.get(url, headers=headers)
            if ("Our systems have detected unusual traffic from your computer network.") in resp.text:
                print("Running into captchas")
                return
        
            self.all_htmls += resp.text
            self.counter += 100

    The parse_links function will search the HTML and perform regex parsing to extract all the LinkedIn links.

    def parse_links(self):
        reg_links = re.compile(r"url=https:\/\/www\.linkedin.com(.*?)&")
        self.temp = reg_links.findall(self.all_htmls)
        results = []
        for regex in self.temp:
          final_url = regex.replace("url=", "")
          results.append("https://www.linkedin.com" + final_url)
        return results

    Similarly, the parse_people function will search the HTML for their name and title.

    def parse_people(self):
        """
        :param html: parse the html for Linkedin Profiles using regex
        :return: a list of
        """
        reg_people = re.compile(r'">[a-zA-Z0-9._ -]* -|\| LinkedIn')
        self.temp = reg_people.findall(self.all_htmls)
        print(self.temp)
        results = []
        for iteration in (self.temp):
            delete = iteration.replace(' | LinkedIn', '')
            delete = delete.replace(' - LinkedIn', '')
            delete = delete.replace(' profiles ', '')
            delete = delete.replace('LinkedIn', '')
            delete = delete.replace('"', '')
            delete = delete.replace('>', '')
            delete = delete.strip("-")
            if delete != " ":
                results.append(delete)
        return results

    This is an example of using the class to search for 500 profiles for the Tesla company.

    ls = LinkedinScraper(keyword="Tesla",limit=500)
    ls.search()
    links = ls.parse_links()
    profiles = ls.parse_people()

    This is quite a simple script, but it should be a good starting point. However, it doesn't include error and captcha handling when making too many requests to Google.

    You can find the complete code at https://github.com/serply-inc/python-linkedin-scraper.

    Making too many requests to Google will result in getting your IP blocked. Please use proxies when running this script.

    Or check out Serply's API docs https://docs.serply.io/ on performing searches without getting blocked.