Preface
As a new international graduate student without much work experience in North America, starting a career is always not an easy errand. But I still strongly believe that I would finally receive my dream offer at someday in the future. I always believe.
The following notes are mainly based on the book Python for Data Analysis (2nd Edition) written by Wes McKinney, which is available online.
- Overview
- Python Basics, IPython and Jupyter Notebooks
- Built-in Data Structure, Functions and Files
- Numpy Basics
- Pandas Basics
- Data Loading, Storage and File Formats
- Data Manipulation
Chapter 1
Overview
An overview of Python and the common libraries. Also talked about the content distribution of this book.
Chapter 2
Python Basics, IPython and Jupyter Nootbooks
- IPython is an enhanced Python interpreter
- Type
ipythonon the command line to launch ipython - Jupyter Notebooks is created within IPython project
- Type
jupyter notebookon the command line to launch Jupyter Notebook - In IPython or Jupyter Notebooks, use
<tab>key to see all variables sharing the same beginning - Typing
?after variable names will display some general info - Typing
??will show the source code if possible - Try
np.*load*?for yourself - Try
%runand%loadfor yourself - Other magic commands in IPython includes:
%timeit,%debug,%pwd, refer to Page 29 of the book - Type
%matplotlib inlinein Jupyter Notebook to avoid to interfere with the console session - Python uses ` `(whitespace or tabs) to structure code instead of using braces (such as R)
- After using
b=a, if we make a change (such asappend(), instead of assigning a new object) toa, thenbwill change simutaneously, vice versa isinstance(object, type(s))returns True or False- Try
object.<tab> - Page 37: Binary operators and comparisons
- Strings and tuples(contained in
()) are immutable but most of the rest is mutable(can be modified) - Scalar types: None, str, bytes, float, bool, int (See Page 39 for explanation)
- Use single quotes
'or double quotes"for strings, use triple quotes''',"""for multiline strings - If need the backslash in strings, double-type it
pytho\\n!otherwisepytho\n!will outputpytho <new line> ! - Add
rbefore a string if there are a lot of\ - Strings can be added together to generate a new sentence
- Try to import
datetimepackage and usedatetime,dateandtime strftime('date time format')could format a datetime as a stringdt.replace(minute = 0)could replace the minute as 0- Two datetimes could add or minus each other, and the difference’s type is
datetime.timedelta, see P 44 for more details regarding the datetime - Try
if-elif-else forloop could iterate over a collection (like a list or a tuple, even a string) or an iterator- Use
continueto skip some specific values in a loop - Use
breakto stop a loop and output when encountering some specific values, but only works for inner loops - Try
whileloops passin Python is theno-opstatement, which can be left as a placeholderrangeis often used for iterating through sequences by index- Try
True if condition else False(Ternary expression)
Chapter 3
Build-in Data Structures, Functions, Files
- Python’s workhorse structures: tuples, lists, dicts and sets
- Use comma-separated sequence to define a tuple or a nested tuple
- Use
tuple()to convert any sequence(such as lists and strings) or iterators to a tuple - tuple is immutable, and the object in it is also immutable in each slot even if the object itself is mutable
+could concatenate tuples/lists- Assign a tuple to a tuple-like expression, Python will unpack the tuple
- Use
*restto pluck a few elements from the beginning of tuples, use*_to discard unwanted values in tuples tuple.count(object)count the number of object in the tuple- List is like tuple but mutable, use
list()to convert list.append()add to the end;list.insert(index,object)insert the object at the certain index;list.pop(index)removes and returns the object at the certain index- Use
(not) into check whether the list(or tuple) contains the object list.extend(another list)used to extend one existing list, and it’s faster than using+to concatenate two lists and thus preferable- Use
sortfunction to sort a list, but not for tuple bisect.bisect()returns the location where an element should be inserted to keep the list sorted;bisect.insort()actually inserts the element into that location- Index
[]is to slice the list, trylist[1:],list[-1:],list[-3:-1]andlist[-1:-3],list[::2],list[::-1]for yourself - Slicing method also works for tuples
enumerate(list/tuple)function returns the index and the corresponding value in a list/tuple at each time, usually used in a for loop and used to compute adictsorted(list/tuple)function returns a sorted list or tuplezippairs up the elements of a number of lists, tuples, or other sequences to create a list of tuples, and it can be also used for unzip a list byzip(*list), see Page 61reversedto reverse a list/tuple but have to come with a materialization (eg,list())- A
dictis composed bykeyandvalue, both are Python objects (strings, lists…) - Use
(not) into check whether akeyis in the dict del dict[key]to delete the specific key and its value- Use
dict.pop(key)to delete the corresponding key and its value list(dict.keys())shows all keys in dict,list(dict.values())show all valuesdict.update({another dict})to update/add the dictdict(zip(seq1, seq2))can be used to generate a dict, trydict.get(key, default)help you to get the corresponding key’s value, otherwise returns the default value if the key doesn’t exist- Check Page 64 for some practical functions regarding
dict - Keys in dict are usually immutable,
hash()function could tell whether one object is immutable or not Setis an unordered collection of unique elements, like{1,2,3}. It can be created viaset(list/tuple)function or via a set literal with curly braces, just like a dict without values, only keys- The math set operaters are available here, see Page 66
- Set elements are generally immutable as well
set1.issubset(set2)andset1.issuperset(set2)are used to check whether set1 is the subset/superset of set2- Try
[expr for x in collection if condition]for yourself (list) - Try
{key-expr : value-expr for value in collection if condition}for yourself (dict) - Try
{value-expr for value in collection if condition}for yourself (set) - Check Page 68 for nested lsit comprehensions, which is just another concise expression of nested
forloop - Functions can have multiple returns
- Functions are objects
- Check
re.sub()for yourself str.strip()remove whitespaces,str.title()define propercase- Anonymous (Lambda) functions consist a single statement such as
f = lambda x: x**2 - Using
yieldin a function is to create a generator (iterable object) - Another way is to use a list comprehension but within parentheses
(expr for x in collection), and we can modify its type withlist/tuple/set/dictas we want - Check
itertoolslibrary in Python, see Page 77 - Try
try/exceptfor yourself (likeif/else), we could also add more conditions after except, such asValueErroror(TypeError, ValueError) - Try
file = open(path)for yourself, addfile.close()after finishing our work - Check Page 82 for more details about dealing with files in standard Python libraries
- The last subsection is about bytes and unicode
Chapter 4
NumPy Basics
Reasons for using Numpy:
- Written in the C language and use much less memory than built-in Python sequences
- Perform complex computation on entire arrays without the need for Python
forloops
4.1
ndarray
- Try
data = np.random.randn(2,3) - Try
data.dtypeanddata.shape np.array()create arrays, or convert other types to arraysnp.ones(),np.zeros()andnp.empty()return corresponding arrays, see Page 90 for more array types- Page 91 for Numpy data types
- Use
astype()function to convert data type in Numpy - Arrays can be applied with batch operations without
forloop, we call this Vectorization - The slicing array is the view on the original array, thus a change in the slicing array will lead to a change in the original array. To avoid this, add
.copy()after slicing - For multi-dim arrays, use
array[1,2,:]to slice/select. In a word, the slicing method is quite similar to that for Python lists - Use boolean indexing (True/False, 1/0) to slice an array
- Comparison operator
!=has the same effect as negating the condition~(A == B) andandordon’t work for boolean arrays, use&and|instead- We can also self-define the order of rows/cols in the slicing array, see Page 102
- Try
np.arange(30).reshape(5,6),np.arange(60).reshape(3,4,5)for yourself array.Ttranspose 2-d array,np.dot()compute inner matrix product- For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes, such as
arr.transpose((1,0,2)) - check
arr.swapaxes(0,1)for yourself
4.2
Array Functions
- Try
np.sqrt(arr),np.exp(arr),np.maximum(arr1, arr2)(arr1 and arr2 must have the same length) np.modf(arr)is the vectorized version of built-in Pythondivmod, returns the decimal and integer parts (two arrays) of an array- Refer to Page 107 for more functions
4.3
Array-Oriented Programming with Arrays
- Try
np.meshgrid(arr1, arr1)for yourself np.whereis similar tox if condition else yin standard Python, used likenp.where(cond, arr1, arr2). At each slot, ifconditionis true then pickarr1, otherwise takearr2. Note, the last two arguments don’t have to be arrays or same length, both or one of them could be scalar- Try
arr.cumsum()andarr.cumprod()for yourself - Argument
axis=1means compute XXX across the columns, # of anws should equal to # of rows - Argument
axis=0means compute XXX across the rows, # of anws should equal to # of columns - Try basic statistical methods for yourself, see Page 112
anyandallare expecially useful for boolean arrays:anychecks whether there is anyTruevalue;allchecks whether the array are allTrue- Arrays can also be sorted.
arr.sort()will modifyarritself, but the top-level methodnp.sort(array)returns a copy of sorted array instead of modifying it - We could use sorted array to find the quantile number,
sorted_array[int(Quantile * len(sorted_array))] np.unique(object)returns the sorted unique values, basically as same assorted(set(object))np.in1d(arr1, arr2)checks whether the value inarr1exists inarr2, also works for other types, eg lists and tuples, returns a boolean array- Refer to Page 115 for more array set operations
4.4 - 4.8
- Check Page 115 for saving and loading data with Numpy
*is element-wise product,np.dotorx.dot(y)excutes the matrix dot productx @ yalso excutes a matrix dot product- Try
inv(inverse) andqr(QR decomposition) in numpy.linalg for matrices, check Page 117 for more functions - Use
np.random.seed(1234)to change the Numpy’s GLOBAL random generation seed,np.random.RandomState()for a specific number/array, see Page 119 for more np.random functions - Page 120 gives a simple application to simulate (multiple) random walks
Chapter 5
Pandas Basics
About Pandas:
- Often used with numerical computing tools like Numpy and Scipy, analytical libraries like statsmodels and scikit-learn, data visualization libraries like matplotlib
- Designed for working with tabular or heterogeneous data, but Numpy is mainly for homogeneous numerical array data
- Open source in 2010, has over 800 distinct contributors
5.1
pandas Data Structures
- Two workhorse data structures: Series and DataFrame
- Series is formed by a sequence of values, called values and an associated array of data labels, called index, the latter is not required
- Try
obj = pd.Series(arr), and thenobj.values,obj.indexby yourself obj[ind]returns the corresponding value, tryobj[[ind1,ind2,ind3]]by yourself- Series can also be treated as a sorted dict but with the fixed value length at 1
- Many Numpy’s function also apply to pd.Series object, and many operators for dict also works for pd.Series objects
pd.Series(dict, index)returns the dict’s keys in sorted order (order in index). If there is noindex1in the previous dict, then value is NaN; ifkey1in dict is not inindex, then it would be removed- Check
pd.isnull(obj)andpd.notnull(obj)by yourself - Try
obj1 + obj2, suppose they are similar but not the same index - Each series object and its index have a
nameattribute, eg.obj.name = 'population',obj.index.name = 'state' - DataFrame is a rectangular table of data with an collection of columns (NOTE: the book said the columns will be sorted automatically, but when I ran the same code in my PC, it’s not. The Python version is
3.7.4and I am using Jupyter Notebook) - DataFrame has both a row and column index, can be viewed as a dict of Series sahring the same index
- You can use
pd.DataFrame(data, columns=['col3','col2','col1'])to customize the column order - If you pass a column that isn’t contained in the dict, it will appear with missing values NaNs
- Try
df['col1']anddf.col1yourself and see output - Try
df.loc[row_index] df['col1']=..ordf.col1=..can be used to assign a list/array to an existing column- If we assign a Series to a column, then its labels will be realigned exactly to the DataFrame’s index, inserting NaN in any holes
- New column cannot be created with
df.colsyntax del df['col1']delete col1 from df, also doesn’t work fordf.col1- The column returned from indexing a DataFrame is a view on the underlying data, which means any in-place change to the series will be reflected in the DataFrame. Try
copy()function df.Tto transpose- Compared with Series, DataFrame has one more
df.columns.name = ...attribute - Check Page 134 for more possible data inputs to DataFrame
- Index objects are immutable, but this makes it safe to share Index objects among data structures
- Pandas Index can contain duplicate labels, and there are also a number of related methods, see Page 136
5.2
Essential Functionality
obj.reindex([new index])re-orders the obj by new index and introduce missing values if there is no corresponding indexobj.reindex(columns = [new column index])used to re-order columns, see Page 138 for more details- Try
obj.drop('index1')andobj.drop(['index1', 'index2']) - To drop columns in a DataFrame object, you need to add columns before the index, or add
axis=1oraxis = columns obj.drop(index, inplace=True)will manipulate the object in-place without returning a new object- The slicing method is similar to Numpy array, except that we could also use the indices, not only numbers
- When using index, the slicing result will include the end index, different from numbers
- This indexing method cannot select a subset from DataFrame
loc(indices) andiloc(numbers) enable us to select a subset of rows and columns from a DataFrame, see Page 144 for more indexing options with loc and iloc- when indexes are integer, try to use
locandiloc - Try
series1 + series2yourself, it works like the outer join in database (thus there are often a lot of NaNs in the output DataFrame) - To avoid NaNs, try
df1.add(df2, fill_value=0),fill_valueargument can also be used in other places - All basic arithematic methods (+,-,*,/) have the counterpart, starting with the letter
r(reverse, I guess), eg.1/df1equals todf1.rdiv(1), see Page 149 for more details - When an array +/- a row/column, the computation would be performed for every row/col in the array, which is called broadcasting, same for the computation between DataFrame and Series
df.apply(func, axis)apply functionfuncto each row/column(default)- Check
%(format code) yourself - Use
df.applymap(func)function for every elements indf obj.sort_index(axis, ascending)returns a new, sorted object, and any missing values are sorted to the end of the Series by default- Use
df.sort_values(by=...)to sort a DataFrame by one or more columns - Try
obj.rank()yourself, then addmethod='first'as an argument into it and see what happened - Check Page 156 for more methods
obj.index.is_uniquereturns whetherobj’s labels are unique
5.3
Descriptive Statistics
- Most basic mathematical and statistical methods in Pandas are similar to Numpy, and they have built-in handling for missing value, see Page 159 & 160
df.corrwith(series1/df1)returns pairwise correlations between a DataFrame’s cols and another series/df- Try
series.unique()yourself obj.value_counts()returns the unique values and their counts,pd.value_counts(obj.values)also works for arraysobj.isin(target_list)checks whether the value is in the target_list, returns a boolean series, is often used for filteringIndex.get_indexergives you an index array from an array of possibly non-distinct values into another array of distinct values, refer to Page 164 for more details- Apply
data.apply(pd.value_counts).fillna(0)to your data and see what will happen
Chapter 6
Data Loading, Storage, and File Formats
- Focus on using pandas
- Input and output are mainly categorized as
6.1 - 6.2
- Page 167 provides tons of methods (eg.
read_csv()andread_table()) to read tabular data as a DataFrame df = pd.read_csv('filepath')read the specific csv filedf = pd.read_csv('filepath', sep=',')specify the delimiterpd.read_csv('examples/ex2.csv', header=None)will let pandas assign default column names {0, 1, 2, …}pd.read_csv('examples/ex2.csv', names = ['a', 'b', 'c', 'd'])Specify the column names by yourselfresult = pd.read_csv('examples/ex5.csv', na_values=['NULL'])- Page 172 lists some frequently used options in
read_csv()andread_table() - Try
pd.options.display.max_rows = 10and then print a DataFrame, see what happens data.to_csv('out.csv'), wheredatacould be a DataFrame or a Series- JSON (short for JavaScript Object Notation) data is one of the standard formats for sending data by HTTP
- Use
json.loads(jsonfile)to convert a JSON string to Python form pd.read_json()assume each object in the JSON array is a row in the tablexlsx = pd.ExcelFile()and thenpd.read_excel(xlsx, 'Sheet1')FOR LOADwriter = pd.ExcelWriter('examples/ex2.xlsx')and thenframe.to_excel(writer, 'Sheet1')
6.3 - 6.4
- Use
requestpackage to interact with Web APIs, see Page 187 - Use
sqlalchemypackage to access SQL databases, see Page 190, then usepd.read_sql()
Since my main purpose is to learn how to do data analysis and my time now is not very much, I will have a glance this chapter for now.
Chapter 7
Data Cleaning and Preparation
7.1
Handling Missing Data
- All descriptive statistics in Pandas exclude missing value by default
- Missing value for numeric data in pandas is represented as
NaN, a floating-point value - Try
dropna,fillna,isnullandnotnullyourself dropnafor a DataFrame will drop any row containing a missing value by default- Adding
how='all'will only drop rows with all NAs, orthresh=?to restrict the number of missing value df.fillna(0, replace=True)fill NA with 0, and replace the previous DataFramedf.fillna(data.mean())fill NA with the mean value, see Page 197 for more details
7.2
Data Transformation
data.duplicated()returns a boolean Series indicating whether each row is a duplicate (compared with the previous row)data.drop_duplicates()drops the duplicate;data.drop_duplicates(['col1'])drops the duplicate according to thecol1; addkeep='last'will keep the last duplicate instead of the first, which is the defaultdictcould be viewed as a mapping in python- Using
mapis a convenient way to perform element-wise transformations and other data cleanning-related operations data.replace(value1, value2)replace the value1 in data with value2, where value1 could be a list- We can also use
mapto modify DataFrame’s index, but we could use a more useful methodrename, butrenamewon’t save the change until you addinplace=True pd.cut(values, bins)returns a categorical object,binsis to divide your data