B1 FamiLinx


Data


General


FamiLinx is a scientific resource of curated genealogical, demographic, and basic phenotypic data from tens of millions of people mostly from the last 500 years. Different from traditional studies, this resource is the product of an ultra crowd-sourcing approach and is based on the collaborative work of genealogy enthusiasts around the world who documented and shared their family stories.

The starting point of FamiLinx was the public information on Geni.com, a genealogy-driven social network that is operated by MyHeritage. Geni.com allows genealogists to enter their family trees into the website and to create profiles of family members with basic demographic information such as sex, birth date, marital status, and location. The genealogists decide whether they want the profiles in their trees to be public or private. New or modified family tree profiles are constantly compared to all existing profiles, and if there is high similarity to existing ones, the website offers the users the option to merge the profiles and connect the trees.

With permission from MyHeritage, we only downloaded the public profiles of individuals from Geni.com for future scientific studies. We used graph algorithms to clean the data and organize the pedigrees into fast accessible formats. We also employed natural language processing to tokenize birth, residence, death, and burial locations of individuals and converted this information into quantitative longitude and latitude. The format of the FamiLinx data is an SQL database and users can create their own local copy with the download package. We also provide a Python API to query the database with advanced functions for pedigree analysis.

For privacy purposes, the resource does not contain any names and any attempt to re-identify the users is strictly prohibited.

The main advantage of FamiLinx is its ultra-large pedigrees. The largest pedigree has 13 million individuals. To the best of our knowledge, this is the largest pedigree compiled for scientific studies.


Examples


An example of a (small) FamiLinx pedigree of 6,000 people that spans over 7 generations:

Green nodes denote individuals and red nodes denote marriages



Quantitive analysis of human migration with crowd-sourced genealogy



The Database


The database has 43M individuals in two files:
  • profiles.txt: This file contains information about each profile (such as gender and date and location of birth/death/burial). Click here for list of all data fields.
  • relations.txt: The parent-child relations between profiles.

Visit the download page to request access to the database.

Identifiers

The downloaded data contains anonimized identifiers.
To overlay other datasets on the FamiLinx data, you will need the dynamic version that includes the Geni profile-id.
Write to yaniv@cs.columbia.edu to obtain this type of data.



How to Cite this work?


Quantitative analysis of population-scale family trees using millions of relatives
Joanna Kaplanis ,Assaf Gordon , Mary Wahl, Michael Gershovits, Barak Markus,
Mona Sheikh, Melissa Gymrek , Gaurav Bhatia, Daniel G. MacArthur, Alkes L. Price,
Yaniv Erlich.
https://doi.org/10.1101/106427 bioRxiv, 2017.